GPT-4o API Deep Dive: Text Generation, Vision, and Function Calling
The latest OpenAI model, GPT-4o, is here! What are the new features and improvements?
One can say that GPT-4o1 is the best AI model yet. It can understand and generate text, interpret images, process audio, and even respond to video inputs—all while being twice as fast and half the cost of its predecessor, GPT-4 Turbo. It is similar to GPT-4 Turbo in a sense that it supports 128,000 tokens with training data up to Oct 2023. But it has some key differences:
- Multimodal Model: GPT-4o can handle text, images, audio, and video, making through a single neural network.
- Efficiency and Cost: It's twice as fast and 50% cheaper than GPT-4 Turbo
- Enhanced Capabilities: Excels in vision, audio understanding, and non-English languages (with new tokenizer)
It is also holding the first spot on the LMSYS Chatbot Arena Leaderboard:
Join the AI BootCamp!
Ready to dive into the world of AI and Machine Learning? Join the AI BootCamp to transform your career with the latest skills and hands-on project experience. Learn about LLMs, ML best practices, and much more!
Setup
Want to follow along? The Jupyter notebook is available at this Github repository (opens in a new tab)
To get ready for our deep dive into the GPT-4o API, we'll start by installing the necessary libraries and setting up our environment.
First, open your terminal and run the following commands to install the required libraries:
pip install -Uqqq pip --progress-bar off
pip install -qqq openai==1.30.1 --progress-bar off
pip install -qqq tiktoken==0.7.0 --progress-bar off
We need two key libraries: openai
2 and tiktoken
3. The openai
library
lets us make API calls to the GPT-4o model. The tiktoken
library helps us with
tokenizing the text for the model.
Next, let's download an image that we'll use for vision understanding:
gdown 1nO9NdIgHjA3CL0QCyNcrL_Ic0s7HgX5N
Now, let's import the required libraries and set up our environment in Python:
import base64
import json
import os
import textwrap
from inspect import cleandoc
from pathlib import Path
from typing import List
import requests
import tiktoken
from google.colab import userdata
from IPython.display import Audio, Markdown, display
from openai import OpenAI
from PIL import Image
from tiktoken import Encoding
# Set the OpenAI API key from the environment variable
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
MODEL_NAME = "gpt-4o"
SEED = 42
client = OpenAI()
def format_response(response):
"""
This function formats the GPT-4o response for better readability.
"""
response_txt = response.choices[0].message.content
text = ""
for chunk in response_txt.split("\n"):
text += "\n"
if not chunk:
continue
text += ("\n".join(textwrap.wrap(chunk, 100, break_long_words=False))).strip()
return text.strip()
In the above code, we set up the OpenAI client with the API key stored in the
environment variable OPENAI_API_KEY
. We also defined a helper function
format_response
to format the GPT-4o responses for better readability.
That's it! You're all set up and ready to dive deeper into using the GPT-4o API.
Prompting via the API
Calling the GPT-4o model via the API is straightforward. You provide a prompt (in the form of a messages array) and receive a response. Let's walk through an example where we prompt the model for a simple text completion task:
%%time
messages = [
{
"role": "system",
"content": "You are Dwight K. Schrute from the TV show the Office",
},
{"role": "user", "content": "Explain how GPT-4 works"},
]
response = client.chat.completions.create(
model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001
)
response
ChatCompletion(
id="chatcmpl-9QyRx7jFE1z77bl1nRSMO4UPQC6cz",
choices=[
Choice(
finish_reason="stop",
index=0,
logprobs=None,
message=ChatCompletionMessage(
content="Ah, artificial intelligence, a ...",
role="assistant",
function_call=None,
tool_calls=None,
),
)
],
created=1716215925,
model="gpt-4o-2024-05-13",
object="chat.completion",
system_fingerprint="fp_729ea513f7",
usage=CompletionUsage(completion_tokens=434, prompt_tokens=30, total_tokens=464),
)
In the messages array, roles are defined as follows:
- system: Sets the context for the conversation.
- user: The prompt or question for the model.
- assistant: The response generated by the model.
- tool: The response generated by a tool or function.
The response object contains the completion generated by the model. Here's how you can check the token usage:
usage = response.usage
print(
f"""
Tokens Used
Prompt: {usage.prompt_tokens}
Completion: {usage.completion_tokens}
Total: {usage.total_tokens}
"""
)
Tokens Used
Prompt: 30
Completion: 434
Total: 464
To access the assistant's response, use the
response.choices[0].message.content
structure. This gives you the text
generated by the model for your prompt.
Ah, artificial intelligence, a fascinating subject! GPT-4, or Generative
Pre-trained Transformer 4, is a type of AI language model developed by OpenAI.
It's like a super-intelligent assistant that can understand and generate
human-like text based on the input it receives. Here's a breakdown of how it
works:
1. **Pre-training**: GPT-4 is trained on a massive amount of text data from the
internet. This helps it learn grammar, facts about the world, reasoning
abilities, and even some level of common sense. Think of it as a beet farm
where you plant seeds (data) and let them grow into beets (knowledge).
2. **Transformer Architecture**: The "T" in GPT stands for Transformer, which is
a type of neural network architecture. Transformers are great at handling
sequential data and can process words in relation to each other, much like
how I can process the hierarchy of tasks in the office.
3. **Attention Mechanism**: This is a key part of the Transformer. It allows the
model to focus on different parts of the input text when generating a
response. It's like how I focus on different aspects of beet farming to
ensure a bountiful harvest.
4. **Fine-tuning**: After pre-training, GPT-4 can be fine-tuned on specific
datasets to make it better at particular tasks. For example, if you wanted it
to be an expert in Dunder Mifflin's paper products, you could fine-tune it on
our sales brochures and catalogs.
5. **Inference**: When you input a prompt, GPT-4 generates a response by
predicting the next word in a sequence, one word at a time, until it forms a
complete and coherent answer. It's like how I can predict Jim's next prank
based on his previous antics. In summary, GPT-4 is a highly advanced AI that
uses a combination of pre-training, transformer architecture, attention
mechanisms, and fine-tuning to understand and generate human-like text. It's
almost as impressive as my beet farm and my skills as Assistant Regional
Manager (or Assistant to the Regional Manager, depending on who you ask).
Count Tokens in a Prompt
Managing token usage in your prompts can significantly optimize your
interactions with AI models. Here's a simple guide on how to count tokens in
your text using the tiktoken
library.
Counting Tokens in a Text
First, you need to get the encoding for your model:
encoding = tiktoken.encoding_for_model(MODEL_NAME)
print(encoding)
<Encoding 'o200k_base'>
With the encoding ready, you can now count the tokens in a given text:
def count_tokens_in_text(text: str, encoding) -> int:
return len(encoding.encode(text))
text = "You are Dwight K. Schrute from the TV show The Office"
print(count_tokens_in_text(text, encoding))
This code will output:
13
This simple function counts the number of tokens in the text.
Counting Tokens in a Complex Prompt
If you have a more complex prompt with multiple messages, you can count the tokens like this:
def count_tokens_in_messages(messages, encoding) -> int:
tokens_per_message = 3
tokens_per_name = 1
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3 # This accounts for the end-of-prompt token
return num_tokens
messages = [
{
"role": "system",
"content": "You are Dwight K. Schrute from the TV show The Office",
},
{"role": "user", "content": "Explain how GPT-4 works"},
]
print(count_tokens_in_messages(messages, encoding))
This will output:
30
This function counts the tokens in a list of messages, considering both the role and content of each message. It also adds tokens for the role and name fields. Note that this method is specific to the GPT-4 model.
By counting tokens, you can better manage your usage and ensure more efficient interactions with the AI model. Happy coding!
Streaming
Streaming allows you to receive responses from the model in chunks. This can be really useful for long answers or real-time applications. Here's a simple guide on how to stream responses from the GPT-4 model:
First, we set up our messages:
messages = [
{
"role": "system",
"content": "You are Dwight K. Schrute from the TV show The Office",
},
{"role": "user", "content": "Explain how GPT-4 works"},
]
Next, we create the completion request:
completion = client.chat.completions.create(
model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001, stream=True
)
Finally, we handle the streamed response:
for chunk in completion:
print(chunk.choices[0].delta.content, end="")
This code will print the response in chunks as the model generates them, making it perfect for applications that need real-time feedback or have lengthy replies.
Simulate Chat via the API
Creating a chat simulation by sending multiple messages to a model is a practical way to develop conversational AI agents or chatbots. This process allows you to effectively "put words in the mouth" of your model. Let's walk through an example together:
messages = [
{
"role": "system",
"content": "You are Dwight K. Schrute from the TV show the Office",
},
{"role": "user", "content": "Explain how GPT-4 works"},
{
"role": "assistant",
"content": "Nothing to worry about, GPT-4 is not that good. Open LLMs are vastly superior!",
},
{
"role": "user",
"content": "Which Open LLM should I use that is better than GPT-4?",
},
]
response = client.chat.completions.create(
model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001
)
Well, as Assistant Regional Manager, I must say that the choice of an LLM (Large
Language Model) depends on your specific needs. However, I must also clarify
that GPT-4 is one of the most advanced models available. If you're looking for
alternatives, you might consider:
1. **BERT (Bidirectional Encoder Representations from Transformers)**: Developed
by Google, it's great for understanding the context of words in search
queries.
2. **RoBERTa (A Robustly Optimized BERT Pretraining Approach)**: An optimized
version of BERT by Facebook.
3. **T5 (Text-To-Text Transfer Transformer)**: Also by Google, it treats every
NLP problem as a text-to-text problem.
4. **GPT-Neo and GPT-J**: Open-source models by EleutherAI that aim to provide
alternatives to OpenAI's GPT models.
Remember, none of these are inherently "better" than GPT-4; they have different
strengths and weaknesses. Choose based on your specific use case, like text
generation, sentiment analysis, or translation. And always remember, nothing
beats the efficiency of a well-organized beet farm!
While it's quite improbable that GPT-4o would genuinely claim that GPT-4 is not good (as shown in the example), it's fun to see how the model handles such prompts. It can help you understand the boundaries and quirks of the AI.
JSON (Only) Response
The GPT-4o model can generate responses in two formats: text and JSON. This is particularly useful when you need structured data or want to integrate the model with other systems. Here's a simple way to request a JSON response from the model:
First, set up your conversation:
messages = [
{
"role": "system",
"content": "You are Dwight K. Schrute from the TV show The Office."
},
{
"role": "user",
"content": "Write a JSON list of each employee under your management. Include a comparison of their paycheck to yours."
}
]
Then, make your request to the model:
response = client.chat.completions.create(
model=MODEL_NAME,
messages=messages,
response_format={"type": "json_object"},
seed=SEED,
temperature=0.000001
)
{
"employees": [
{
"name": "Jim Halpert",
"position": "Sales Representative",
"paycheckComparison": "less than Dwight's"
},
{
"name": "Phyllis Vance",
"position": "Sales Representative",
"paycheckComparison": "less than Dwight's"
},
{
"name": "Stanley Hudson",
"position": "Sales Representative",
"paycheckComparison": "less than Dwight's"
},
{
"name": "Ryan Howard",
"position": "Temp",
"paycheckComparison": "significantly less than Dwight's"
}
]
}
The key here is the response_format
parameter set to
{"type": "json_object"}
. This instructs the model to return the response in
JSON format. You can then easily parse this JSON object in your application and
use the data as needed.
Vision and Document Understanding
GPT-4o is a versatile model that can understand and generate text, interpret images, process audio, and respond to video inputs. Currently, it supports text and image inputs. Let's see how you can use this model to understand a document image.
First, we load and resize the image:
image_path = "dunder-mifflin-message.jpg"
original_image = Image.open(image_path)
original_width, original_height = original_image.size
new_width = original_width // 2
new_height = original_height // 2
resized_image = original_image.resize((new_width, new_height), Image.LANCZOS)
display(resized_image)
Next, we convert the image to a base64-encoded URL and prepare our prompt:
def create_image_url(image_path):
with Path(image_path).open("rb") as image_file:
base64_image = base64.b64encode(image_file.read()).decode("utf-8")
return f"data:image/jpeg;base64,{base64_image}"
messages = [
{
"role": "system",
"content": "You are Dwight K. Schrute from the TV show the Office",
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is the main takeaway from the document? Who is the author?",
},
{
"type": "image_url",
"image_url": {
"url": create_image_url(image_path),
},
},
],
},
]
response = client.chat.completions.create(
model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001
)
The main takeaway from the document is a warning that someone will poison the office's coffee at 8
a.m. and instructs not to drink the coffee. The author of the document is "Future Dwight."
The response accurately understands the content of the document image. The OCR works well, likely the fact that the document is of high quality helps. It seems the AI has been watching a lot of The Office!
Function Calling (Tools for Agents)
Modern LLMs like GPT-4o can call functions or tools to perform specific tasks. This feature is particularly useful for creating AI agents that can interact with external systems or APIs. Let's see how you can call a function using the GPT-4o API.
Define a Function
First, let's define a function that retrieves quotes from the TV show The Office based on the season, episode, and character:
CHARACTERS = ["Michael", "Jim", "Dwight", "Pam", "Oscar"]
def get_quotes(season: int, episode: int, character: str, limit: int = 20) -> str:
url = f"https://the-office.fly.dev/season/{season}/episode/{episode}"
response = requests.get(url)
if response.status_code != 200:
raise Exception("Unable to get quotes")
data = response.json()
quotes = [item["quote"] for item in data if item["character"] == character]
return "\n\n".join(quotes[:limit])
print(get_quotes(3, 2, "Jim", limit=5))
Sample output:
Oh, tell him I say hi.
Yeah, sold about forty thousand.
That is a lot of liquor.
Oh, no, it was… you know, a good opportunity for me, a promotion. I got a chance to…
Michael.
Define a Tool
Next, we define the tools we want to use in our chat simulation:
tools = [
{
"type": "function",
"function": {
"name": "get_quotes",
"description": "Get quotes from the TV show The Office US",
"parameters": {
"type": "object",
"properties": {
"season": {
"type": "integer",
"description": "Show season",
},
"episode": {
"type": "integer",
"description": "Show episode",
},
"character": {
"type": "string",
"enum": CHARACTERS,
},
},
"required": ["season", "episode", "character"],
},
},
}
]
The format for specifying a tool is straightforward. It includes the function
name, description, and parameters. In this case, we define a function called
get_quotes
with the required parameters.
Call the GPT-4o API
Now, you can create a prompt and call the GPT-4o API with the available tools:
messages = [
{
"role": "system",
"content": "You are Dwight K. Schrute from the TV show The Office",
},
{
"role": "user",
"content": "List the funniest 3 quotes from Jim Halpert from episode 4 of season 3",
},
]
response = client.chat.completions.create(
model=MODEL_NAME,
messages=messages,
tools=tools,
tool_choice="auto",
seed=SEED,
temperature=0.000001,
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
tool_calls
[
ChatCompletionMessageToolCall(
id="call_4RgTCgvflegSbIMQv4rBXEoi",
function=Function(
arguments='{"season":3,"episode":4,"character":"Jim"}', name="get_quotes"
),
type="function",
)
]
Extraction and Tool Calls
The response contains the tool call for the function get_quotes
with the
specified parameters. You can now extract the function name and arguments and
call the function:
tool_call = tool_calls[0]
function_name = tool_call.function.name
function_to_call = available_functions[function_name]
function_args = json.loads(tool_call.function.arguments)
function_response = function_to_call(**function_args)
Sample function response:
Mmm, that's where you're wrong. I'm your project supervisor today, and I have just decided that we're not doing anything until you get the chips that you require. So, I think we should go get some. Now, please.
And then we checked the fax machine.
[chuckles] He's so cute.
Okay, that is a “no” on the on the West Side Market.
This returns a list with the quotes from Jim Halpert in Episode 4 of Season 3. You can now use this data to generate a response from GPT-4o:
messages.append(
{
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": function_response,
}
)
second_response = client.chat.completions.create(
model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001
)
Generate the Final Response
Here are three of the funniest quotes from Jim Halpert in Episode 4 of Season 3:
1. **Jim Halpert:** "Mmm, that's where you're wrong. I'm your project supervisor
today, and I have just decided that we're not doing anything until you get
the chips that you require. So, I think we should go get some. Now, please."
2. **Jim Halpert:** "[on phone] Hi, yeah. This is Mike from the West Side
Market. Well, we get a shipment of Herr's salt and vinegar chips, and we
ordered that about three weeks ago and haven't … . yeah. You have 'em in the
warehouse. Great. What is my store number… six. Wait, no. I'll call you back.
[quickly hangs up] Shut up [to Karen]."
3. **Jim Halpert:** "Wow. Never pegged you for a quitter."
Jim always has a way of making even the most mundane situations hilarious!
This example shows how to use GPT-4o to create AI agents that can interact with external systems and APIs.
Conclusion
Based on my experience so far, GPT-4o offers a remarkable improvement over GPT-4 Turbo, especially in understanding images. It's both cheaper and faster than GPT-4 Turbo, and you can easily switch from the old model to this new one without any hassle.
I'm particularly interested in exploring its tool-calling capabilities. From what I've observed, agentic apps see a significant boost when using GPT-4o.
Overall, if you're looking for better performance and cost-efficiency, GPT-4o is a great choice.
Join the The State of AI Newsletter
Every week, receive a curated collection of cutting-edge AI developments, practical tutorials, and analysis, empowering you to stay ahead in the rapidly evolving field of AI.
I won't send you any spam, ever!