Vision (Understanding Images)¶
Introduction¶
GPT-4 with Vision, sometimes referred to as GPT-4V or gpt-4-vision-preview in the API, allows the model to take in images and answer questions about them.
GPT-4 with vision is currently available to all developers who have access to GPT-4 via the gpt-4-vision-preview model and the Chat Completions API which has been updated to support image inputs. Note that the Assistants API does not currently support image inputs.
Note Currently, GPT-4 Turbo with vision does not support the message.name parameter, functions/tools, response_format parameter, and we currently set a low max_tokens default which you can override.
Images are made available to the model in two main ways: by passing a link to the image or by passing the base64 encoded image directly in the request. Images can be passed in the user, system and assistant messages. Currently we don't support images in the first system message but this may change in the future.
# Sending image as a URL
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0].message.content)
Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The image depicts a serene natural scene with a wooden boardwalk extending through a lush green field. The boardwalk creates a straight path and invites a walk through the tall grasses surrounding it. The field is abundant with greenery, possibly indicating spring or summer season. The sky is clear with a few scattered, wispy clouds, with the sunlight casting a warm glow on the landscape, enhancing the vivid colors of the flora. This could be a natural reserve, park, or a wetland area where boardwalks are commonly built to facilitate access without disturbing the natural environment.', role='assistant', function_call=None, tool_calls=None))
# Sending image as a base64 encoded string
from openai import OpenAI
import base64
client = OpenAI()
image_path = "data/cappadocia.jpeg"
# Function to encode the image
with open(image_path, "rb") as image_file:
base64_image = base64.b64encode(image_file.read()).decode('utf-8')
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0].message.content)
This image shows a beautiful snowy landscape with unique geological formations. These formations, characterized by their rugged, rocky outcrops and peaks, suggest that this might be a region with a history of volcanic activity, often resulting in such stark and impressive natural features. The snow adds a contrasting layer to the otherwise dry and eroded rocks, highlighting the natural beauty of the place. There are no visible human figures in this photograph, keeping the focus on the natural environment. The clear blue sky suggests it is a sunny day, and the distant mountain in the background adds depth to the scene, underscoring the wild and expansive topography of the area. This kind of terrain is often found in regions known for their historic and geological significance.
# Multiple image inputs
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are in these images? Is there any difference between them?",
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0].message.content)
The images you've provided seem to be identical. Both showcase a wooden boardwalk extending through a lush, green wetland or grassland. The sky is blue with some wispy clouds, and there is a variety of green vegetation on either side of the path. There doesn't appear to be any discernible difference between the two images; they seem to be two copies of the same photo.
Low or high fidelity¶
By controlling the detail parameter, which has three options, low
, high
, or auto
, you have control over how the model processes the image and generates its textual understanding. By default, the model will use the auto
setting which will look at the image input size and decide if it should use the low
or high
setting.
low
will disable the “high res” model. The model will receive a low-res 512px x 512px version of the image, and represent the image with a budget of 85 tokens. This allows the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.high
will enable “high res” mode, which first allows the model to see the low res image and then creates detailed crops of input images as 512px squares based on the input image size. Each of the detailed crops uses twice the token budget (85 * 2 = 170 tokens).
Image inputs are metered and charged in tokens, just as text inputs are. The token cost of a given image is determined by two factors: its size, and the detail option on each image_url block. All images with detail: low
cost 85 tokens each. detail: high
images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled such that the shortest side of the image is 768px long. Finally, we count how many 512px squares the image consists of. Each of those squares costs 170 tokens. Another 85 tokens are always added to the final total.
Here are some examples demonstrating the above.
A 1024 x 1024 square image in detail: high mode costs 765 tokens
- 1024 is less than 2048, so there is no initial resize.
- The shortest side is 1024, so we scale the image down to 768 x 768.
- 4 512px square tiles are needed to represent the image, so the final token cost is 170 * 4 + 85 = 765.
A 2048 x 4096 image in detail: high mode costs 1105 tokens
- We scale down the image to 1024 x 2048 to fit within the 2048 square.
- The shortest side is 1024, so we further scale down to 768 x 1536.
- 6 512px tiles are needed, so the final token cost is 170 * 6 + 85 = 1105.
A 4096 x 8192 image in detail: low most costs 85 tokens
- Regardless of input size, low detail images are a fixed cost.
In the following code, example the input image is 2560 * 1669. In high fidelity:
- Since it is larger then 2048, it will be first resized to 2048 * 1335.
- Then, the shortest side will be scaled down to 768, resulting 1178 * 768.
- 6 512px tiles are needed, so the final token cost for the input image is 170 * 6 + 85 = 1105.
In low fidelity: Regardless of input size, low detail images are a fixed cost of 85 tokens.
from openai import OpenAI
client = OpenAI()
# High fidelity
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
"detail": "high"
},
},
],
}
],
max_tokens=300,
)
print("*** High Fidelity ***")
print(response.choices[0].message.content)
print(f"Prompt Tokens: {response.usage.prompt_tokens}, Completion Tokens: {response.usage.completion_tokens}, Total Tokens: {response.usage.total_tokens}")
# Low fidelity
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
"detail": "low"
},
},
],
}
],
max_tokens=300,
)
print("*** Low Fidelity ***")
print(response.choices[0].message.content)
print(f"Prompt Tokens: {response.usage.prompt_tokens}, Completion Tokens: {response.usage.completion_tokens}, Total Tokens: {response.usage.total_tokens}")
*** High Fidelity *** The image shows a wooden boardwalk traversing through a lush green field with tall grass on either side. The sky is a beautiful blue with some scattered white clouds. It appears to be a sunny day, and the scene is tranquil, possibly a nature reserve or park where boardwalks are installed to allow people to enjoy the landscape without disturbing the natural environment. The image has a sense of depth, leading the observer's eye along the boardwalk towards the horizon. Prompt Tokens: 1118, Completion Tokens: 93, Total Tokens: 1211 *** Low Fidelity *** The image depicts a serene natural landscape. It features a wooden boardwalk or path that meanders through a lush green meadow filled with tall grass or reeds. The path invites one to walk through and enjoy the surrounding nature. The sky overhead is a bright blue with scattered white clouds, suggesting a pleasant day with good weather. The scene conveys a sense of tranquility and the beauty of a natural, untouched environment. Prompt Tokens: 98, Completion Tokens: 85, Total Tokens: 183
Managing the images¶
The Chat Completions API, unlike the Assistants API, is not stateful. That means you have to manage the messages (including images) you pass to the model yourself. If you want to pass the same image to the model multiple times, you will have to pass the image each time you make a request to the API.
For long running conversations, we suggest passing images via URL's instead of base64. The latency of the model can also be improved by downsizing your images ahead of time to be less than the maximum size they are expected them to be. For low res mode, we expect a 512px x 512px image. For high res mode, the short side of the image should be less than 768px and the long side should be less than 2,000px.
After an image has been processed by the model, it is deleted from OpenAI servers and not retained.
OpenAI restricts image uploads to 20MB per image.
They currently support PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif).