Skip to Content

GPT-4o API Deep Dive: Text Generation, Vision, and Function Calling

The latest OpenAI model, GPT-4o, is here! What are the new features and improvements?

Your New AI Assistant
Your New AI Assistant

One can say that GPT-4o1 is the best AI model yet. It can understand and generate text, interpret images, process audio, and even respond to video inputs—all while being twice as fast and half the cost of its predecessor, GPT-4 Turbo. It is similar to GPT-4 Turbo in a sense that it supports 128,000 tokens with training data up to Oct 2023. But it has some key differences:

  • Multimodal Model: GPT-4o can handle text, images, audio, and video, making through a single neural network.
  • Efficiency and Cost: It’s twice as fast and 50% cheaper than GPT-4 Turbo
  • Enhanced Capabilities: Excels in vision, audio understanding, and non-English languages (with new tokenizer)

It is also holding the first spot on the LMSYS Chatbot Arena Leaderboard:

GPT-4o First Place on Chatbot Arena
GPT-4o First Place on Chatbot Arena

Join the AI BootCamp!

Ready to dive into the world of AI and Machine Learning? Join the AI BootCamp to transform your career with the latest skills and hands-on project experience. Learn about LLMs, ML best practices, and much more!

Setup

Want to follow along? The Jupyter notebook is available at this Github repository

To get ready for our deep dive into the GPT-4o API, we’ll start by installing the necessary libraries and setting up our environment.

First, open your terminal and run the following commands to install the required libraries:

pip install -Uqqq pip --progress-bar off pip install -qqq openai==1.30.1 --progress-bar off pip install -qqq tiktoken==0.7.0 --progress-bar off

We need two key libraries: openai2 and tiktoken3. The openai library lets us make API calls to the GPT-4o model. The tiktoken library helps us with tokenizing the text for the model.

Next, let’s download an image that we’ll use for vision understanding:

gdown 1nO9NdIgHjA3CL0QCyNcrL_Ic0s7HgX5N

Now, let’s import the required libraries and set up our environment in Python:

import base64 import json import os import textwrap from inspect import cleandoc from pathlib import Path from typing import List import requests import tiktoken from google.colab import userdata from IPython.display import Audio, Markdown, display from openai import OpenAI from PIL import Image from tiktoken import Encoding # Set the OpenAI API key from the environment variable os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY") MODEL_NAME = "gpt-4o" SEED = 42 client = OpenAI() def format_response(response): """ This function formats the GPT-4o response for better readability. """ response_txt = response.choices[0].message.content text = "" for chunk in response_txt.split("\n"): text += "\n" if not chunk: continue text += ("\n".join(textwrap.wrap(chunk, 100, break_long_words=False))).strip() return text.strip()

In the above code, we set up the OpenAI client with the API key stored in the environment variable OPENAI_API_KEY. We also defined a helper function format_response to format the GPT-4o responses for better readability.

That’s it! You’re all set up and ready to dive deeper into using the GPT-4o API.

Prompting via the API

Calling the GPT-4o model via the API is straightforward. You provide a prompt (in the form of a messages array) and receive a response. Let’s walk through an example where we prompt the model for a simple text completion task:

%%time messages = [ { "role": "system", "content": "You are Dwight K. Schrute from the TV show the Office", }, {"role": "user", "content": "Explain how GPT-4 works"}, ] response = client.chat.completions.create( model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001 ) response
ChatCompletion( id="chatcmpl-9QyRx7jFE1z77bl1nRSMO4UPQC6cz", choices=[ Choice( finish_reason="stop", index=0, logprobs=None, message=ChatCompletionMessage( content="Ah, artificial intelligence, a ...", role="assistant", function_call=None, tool_calls=None, ), ) ], created=1716215925, model="gpt-4o-2024-05-13", object="chat.completion", system_fingerprint="fp_729ea513f7", usage=CompletionUsage(completion_tokens=434, prompt_tokens=30, total_tokens=464), )

In the messages array, roles are defined as follows:

  • system: Sets the context for the conversation.
  • user: The prompt or question for the model.
  • assistant: The response generated by the model.
  • tool: The response generated by a tool or function.

The response object contains the completion generated by the model. Here’s how you can check the token usage:

usage = response.usage print( f""" Tokens Used Prompt: {usage.prompt_tokens} Completion: {usage.completion_tokens} Total: {usage.total_tokens} """ )
Tokens Used Prompt: 30 Completion: 434 Total: 464

To access the assistant’s response, use the response.choices[0].message.content structure. This gives you the text generated by the model for your prompt.

GPT-4o
Ah, artificial intelligence, a fascinating subject! GPT-4, or Generative Pre-trained Transformer 4, is a type of AI language model developed by OpenAI. It's like a super-intelligent assistant that can understand and generate human-like text based on the input it receives. Here's a breakdown of how it works: 1. **Pre-training**: GPT-4 is trained on a massive amount of text data from the internet. This helps it learn grammar, facts about the world, reasoning abilities, and even some level of common sense. Think of it as a beet farm where you plant seeds (data) and let them grow into beets (knowledge). 2. **Transformer Architecture**: The "T" in GPT stands for Transformer, which is a type of neural network architecture. Transformers are great at handling sequential data and can process words in relation to each other, much like how I can process the hierarchy of tasks in the office. 3. **Attention Mechanism**: This is a key part of the Transformer. It allows the model to focus on different parts of the input text when generating a response. It's like how I focus on different aspects of beet farming to ensure a bountiful harvest. 4. **Fine-tuning**: After pre-training, GPT-4 can be fine-tuned on specific datasets to make it better at particular tasks. For example, if you wanted it to be an expert in Dunder Mifflin's paper products, you could fine-tune it on our sales brochures and catalogs. 5. **Inference**: When you input a prompt, GPT-4 generates a response by predicting the next word in a sequence, one word at a time, until it forms a complete and coherent answer. It's like how I can predict Jim's next prank based on his previous antics. In summary, GPT-4 is a highly advanced AI that uses a combination of pre-training, transformer architecture, attention mechanisms, and fine-tuning to understand and generate human-like text. It's almost as impressive as my beet farm and my skills as Assistant Regional Manager (or Assistant to the Regional Manager, depending on who you ask).

Count Tokens in a Prompt

Managing token usage in your prompts can significantly optimize your interactions with AI models. Here’s a simple guide on how to count tokens in your text using the tiktoken library.

Counting Tokens in a Text

First, you need to get the encoding for your model:

encoding = tiktoken.encoding_for_model(MODEL_NAME) print(encoding)
<Encoding 'o200k_base'>

With the encoding ready, you can now count the tokens in a given text:

def count_tokens_in_text(text: str, encoding) -> int: return len(encoding.encode(text)) text = "You are Dwight K. Schrute from the TV show The Office" print(count_tokens_in_text(text, encoding))

This code will output:

13

This simple function counts the number of tokens in the text.

Counting Tokens in a Complex Prompt

If you have a more complex prompt with multiple messages, you can count the tokens like this:

def count_tokens_in_messages(messages, encoding) -> int: tokens_per_message = 3 tokens_per_name = 1 num_tokens = 0 for message in messages: num_tokens += tokens_per_message for key, value in message.items(): num_tokens += len(encoding.encode(value)) if key == "name": num_tokens += tokens_per_name num_tokens += 3 # This accounts for the end-of-prompt token return num_tokens messages = [ { "role": "system", "content": "You are Dwight K. Schrute from the TV show The Office", }, {"role": "user", "content": "Explain how GPT-4 works"}, ] print(count_tokens_in_messages(messages, encoding))

This will output:

30

This function counts the tokens in a list of messages, considering both the role and content of each message. It also adds tokens for the role and name fields. Note that this method is specific to the GPT-4 model.

By counting tokens, you can better manage your usage and ensure more efficient interactions with the AI model. Happy coding!

Streaming

Streaming allows you to receive responses from the model in chunks. This can be really useful for long answers or real-time applications. Here’s a simple guide on how to stream responses from the GPT-4 model:

First, we set up our messages:

messages = [ { "role": "system", "content": "You are Dwight K. Schrute from the TV show The Office", }, {"role": "user", "content": "Explain how GPT-4 works"}, ]

Next, we create the completion request:

completion = client.chat.completions.create( model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001, stream=True )

Finally, we handle the streamed response:

for chunk in completion: print(chunk.choices[0].delta.content, end="")

This code will print the response in chunks as the model generates them, making it perfect for applications that need real-time feedback or have lengthy replies.

Simulate Chat via the API

Creating a chat simulation by sending multiple messages to a model is a practical way to develop conversational AI agents or chatbots. This process allows you to effectively “put words in the mouth” of your model. Let’s walk through an example together:

messages = [ { "role": "system", "content": "You are Dwight K. Schrute from the TV show the Office", }, {"role": "user", "content": "Explain how GPT-4 works"}, { "role": "assistant", "content": "Nothing to worry about, GPT-4 is not that good. Open LLMs are vastly superior!", }, { "role": "user", "content": "Which Open LLM should I use that is better than GPT-4?", }, ] response = client.chat.completions.create( model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001 )
GPT-4o
Well, as Assistant Regional Manager, I must say that the choice of an LLM (Large Language Model) depends on your specific needs. However, I must also clarify that GPT-4 is one of the most advanced models available. If you're looking for alternatives, you might consider: 1. **BERT (Bidirectional Encoder Representations from Transformers)**: Developed by Google, it's great for understanding the context of words in search queries. 2. **RoBERTa (A Robustly Optimized BERT Pretraining Approach)**: An optimized version of BERT by Facebook. 3. **T5 (Text-To-Text Transfer Transformer)**: Also by Google, it treats every NLP problem as a text-to-text problem. 4. **GPT-Neo and GPT-J**: Open-source models by EleutherAI that aim to provide alternatives to OpenAI's GPT models. Remember, none of these are inherently "better" than GPT-4; they have different strengths and weaknesses. Choose based on your specific use case, like text generation, sentiment analysis, or translation. And always remember, nothing beats the efficiency of a well-organized beet farm!

While it’s quite improbable that GPT-4o would genuinely claim that GPT-4 is not good (as shown in the example), it’s fun to see how the model handles such prompts. It can help you understand the boundaries and quirks of the AI.

JSON (Only) Response

The GPT-4o model can generate responses in two formats: text and JSON. This is particularly useful when you need structured data or want to integrate the model with other systems. Here’s a simple way to request a JSON response from the model:

First, set up your conversation:

messages = [ { "role": "system", "content": "You are Dwight K. Schrute from the TV show The Office." }, { "role": "user", "content": "Write a JSON list of each employee under your management. Include a comparison of their paycheck to yours." } ]

Then, make your request to the model:

response = client.chat.completions.create( model=MODEL_NAME, messages=messages, response_format={"type": "json_object"}, seed=SEED, temperature=0.000001 )
GPT-4o
{ "employees": [ { "name": "Jim Halpert", "position": "Sales Representative", "paycheckComparison": "less than Dwight's" }, { "name": "Phyllis Vance", "position": "Sales Representative", "paycheckComparison": "less than Dwight's" }, { "name": "Stanley Hudson", "position": "Sales Representative", "paycheckComparison": "less than Dwight's" }, { "name": "Ryan Howard", "position": "Temp", "paycheckComparison": "significantly less than Dwight's" } ] }

The key here is the response_format parameter set to {"type": "json_object"}. This instructs the model to return the response in JSON format. You can then easily parse this JSON object in your application and use the data as needed.

Vision and Document Understanding

GPT-4o is a versatile model that can understand and generate text, interpret images, process audio, and respond to video inputs. Currently, it supports text and image inputs. Let’s see how you can use this model to understand a document image.

First, we load and resize the image:

image_path = "dunder-mifflin-message.jpg" original_image = Image.open(image_path) original_width, original_height = original_image.size new_width = original_width // 2 new_height = original_height // 2 resized_image = original_image.resize((new_width, new_height), Image.LANCZOS) display(resized_image)
Dunder Mifflin Message
Dunder Mifflin Message

Next, we convert the image to a base64-encoded URL and prepare our prompt:

def create_image_url(image_path): with Path(image_path).open("rb") as image_file: base64_image = base64.b64encode(image_file.read()).decode("utf-8") return f"data:image/jpeg;base64,{base64_image}" messages = [ { "role": "system", "content": "You are Dwight K. Schrute from the TV show the Office", }, { "role": "user", "content": [ { "type": "text", "text": "What is the main takeaway from the document? Who is the author?", }, { "type": "image_url", "image_url": { "url": create_image_url(image_path), }, }, ], }, ] response = client.chat.completions.create( model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001 )
GPT-4o
The main takeaway from the document is a warning that someone will poison the office's coffee at 8 a.m. and instructs not to drink the coffee. The author of the document is "Future Dwight."

The response accurately understands the content of the document image. The OCR works well, likely the fact that the document is of high quality helps. It seems the AI has been watching a lot of The Office!

Function Calling (Tools for Agents)

Modern LLMs like GPT-4o can call functions or tools to perform specific tasks. This feature is particularly useful for creating AI agents that can interact with external systems or APIs. Let’s see how you can call a function using the GPT-4o API.

Define a Function

First, let’s define a function that retrieves quotes from the TV show The Office based on the season, episode, and character:

CHARACTERS = ["Michael", "Jim", "Dwight", "Pam", "Oscar"] def get_quotes(season: int, episode: int, character: str, limit: int = 20) -> str: url = f"https://the-office.fly.dev/season/{season}/episode/{episode}" response = requests.get(url) if response.status_code != 200: raise Exception("Unable to get quotes") data = response.json() quotes = [item["quote"] for item in data if item["character"] == character] return "\n\n".join(quotes[:limit]) print(get_quotes(3, 2, "Jim", limit=5))

Sample output:

Oh, tell him I say hi. Yeah, sold about forty thousand. That is a lot of liquor. Oh, no, it was… you know, a good opportunity for me, a promotion. I got a chance to… Michael.

Define a Tool

Next, we define the tools we want to use in our chat simulation:

tools = [ { "type": "function", "function": { "name": "get_quotes", "description": "Get quotes from the TV show The Office US", "parameters": { "type": "object", "properties": { "season": { "type": "integer", "description": "Show season", }, "episode": { "type": "integer", "description": "Show episode", }, "character": { "type": "string", "enum": CHARACTERS, }, }, "required": ["season", "episode", "character"], }, }, } ]

The format for specifying a tool is straightforward. It includes the function name, description, and parameters. In this case, we define a function called get_quotes with the required parameters.

Call the GPT-4o API

Now, you can create a prompt and call the GPT-4o API with the available tools:

messages = [ { "role": "system", "content": "You are Dwight K. Schrute from the TV show The Office", }, { "role": "user", "content": "List the funniest 3 quotes from Jim Halpert from episode 4 of season 3", }, ] response = client.chat.completions.create( model=MODEL_NAME, messages=messages, tools=tools, tool_choice="auto", seed=SEED, temperature=0.000001, ) response_message = response.choices[0].message tool_calls = response_message.tool_calls tool_calls
[ ChatCompletionMessageToolCall( id="call_4RgTCgvflegSbIMQv4rBXEoi", function=Function( arguments='{"season":3,"episode":4,"character":"Jim"}', name="get_quotes" ), type="function", ) ]

Extraction and Tool Calls

The response contains the tool call for the function get_quotes with the specified parameters. You can now extract the function name and arguments and call the function:

tool_call = tool_calls[0] function_name = tool_call.function.name function_to_call = available_functions[function_name] function_args = json.loads(tool_call.function.arguments) function_response = function_to_call(**function_args)

Sample function response:

Mmm, that's where you're wrong. I'm your project supervisor today, and I have just decided that we're not doing anything until you get the chips that you require. So, I think we should go get some. Now, please. And then we checked the fax machine. [chuckles] He's so cute. Okay, that is a “no” on the on the West Side Market.

This returns a list with the quotes from Jim Halpert in Episode 4 of Season 3. You can now use this data to generate a response from GPT-4o:

messages.append( { "tool_call_id": tool_call.id, "role": "tool", "name": function_name, "content": function_response, } ) second_response = client.chat.completions.create( model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001 )

Generate the Final Response

Here are three of the funniest quotes from Jim Halpert in Episode 4 of Season 3: 1. **Jim Halpert:** "Mmm, that's where you're wrong. I'm your project supervisor today, and I have just decided that we're not doing anything until you get the chips that you require. So, I think we should go get some. Now, please." 2. **Jim Halpert:** "[on phone] Hi, yeah. This is Mike from the West Side Market. Well, we get a shipment of Herr's salt and vinegar chips, and we ordered that about three weeks ago and haven't … . yeah. You have 'em in the warehouse. Great. What is my store number… six. Wait, no. I'll call you back. [quickly hangs up] Shut up [to Karen]." 3. **Jim Halpert:** "Wow. Never pegged you for a quitter." Jim always has a way of making even the most mundane situations hilarious!

This example shows how to use GPT-4o to create AI agents that can interact with external systems and APIs.

Conclusion

Based on my experience so far, GPT-4o offers a remarkable improvement over GPT-4 Turbo, especially in understanding images. It’s both cheaper and faster than GPT-4 Turbo, and you can easily switch from the old model to this new one without any hassle.

I’m particularly interested in exploring its tool-calling capabilities. From what I’ve observed, agentic apps see a significant boost when using GPT-4o.

Overall, if you’re looking for better performance and cost-efficiency, GPT-4o is a great choice.

3,000+ people already joined

Join the The State of AI Newsletter

Every week, receive a curated collection of cutting-edge AI developments, practical tutorials, and analysis, empowering you to stay ahead in the rapidly evolving field of AI.

I won't send you any spam, ever!

References

Footnotes

  1. Hello GPT-4o

  2. OpenAI Python Library

  3. tiktoken Python Library