0% found this document useful (0 votes)

133 views8 pages

Images and Vision - OpenAI API

The document provides a comprehensive guide on using the OpenAI API for image-related applications, detailing how to create, analyze, and process images. It covers various API endpoints, image generation capabilities, and the differences between models like GPT Image and DALL·E. Additionally, it outlines input requirements, limitations, and cost calculations for image processing using tokens.

Uploaded by

vosohac516

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

133 views8 pages

Images and Vision - OpenAI API

Uploaded by

vosohac516

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

26/07/2025, 07:56 Images and vision - OpenAI API

Copy page Chat Completions

Images and vision

Learn how to understand or generate images.

Overview

Create images
Use GPT Image or DALL·E to generate or edit images.

Process image inputs

Use our models' vision capabilities to analyze images.

https://platform.openai.com/docs/guides/images-vision?api-mode=chat&format=base64-encoded 1/10
26/07/2025, 07:56 Images and vision - OpenAI API

In this guide, you will learn about building applications involving images with the OpenAI API. If
you know what you want to build, find your use case below to get started. If you're not sure
where to start, continue reading to get an overview.

A tour of image-related use cases

Recent language models can process image inputs and analyze them — a capability known as
vision. With gpt-image-1 , they can both analyze visual inputs and create images.

The OpenAI API offers several endpoints to process images as input or generate them as
output, enabling you to build powerful multimodal applications.

API SUPPORTED USE CASES

Responses API Analyze images and use them as input and/or generate images as output

Images API Generate images as output, optionally using images as input

Chat Completions API Analyze images and use them as input to generate text or audio

To learn more about the input and output modalities supported by our models, refer to our
models page.

Generate or edit images

You can generate or edit images using the Image API or the Responses API.

Our latest image generation model, gpt-image-1 , is a natively multimodal large language
model. It can understand text and images and leverage its broad world knowledge to generate
images with better instruction following and contextual awareness.

In contrast, we also offer specialized image generation models - DALL·E 2 and 3 - which don't
have the same inherent understanding of the world as GPT Image.

You can learn more about image generation in our Image generation guide.

Using world knowledge for image generation

The difference between DALL·E models and GPT Image is that a natively multimodal
language model can use its visual understanding of the world to generate lifelike images
including real-life details without a reference.

For example, if you prompt GPT Image to generate an image of a glass cabinet with the most
popular semi-precious stones, the model knows enough to select gemstones like amethyst,
https://platform.openai.com/docs/guides/images-vision?api-mode=chat&format=base64-encoded 2/10
26/07/2025, 07:56 Images and vision - OpenAI API

rose quartz, jade, etc, and depict them in a realistic way.

Analyze images
Vision is the ability for a model to "see" and understand images. If there is text in an image, the
model can also understand the text. It can understand most visual elements, including
objects, shapes, colors, and textures, even if there are some limitations.

Giving a model images as input

You can provide images as input to generation requests either by providing a fully qualified
URL to an image file, or providing an image as a Base64-encoded data URL.

You can provide multiple images as input in a single request by including multiple images in
the content array, but keep in mind that images count as tokens and will be billed
accordingly.

Passing a URL Passing a Base64 encoded image

Analyze the content of an image python

1 import base64
2 from openai import OpenAI
3
4 client = OpenAI()
5
6 # Function to encode the image
7 def encode_image(image_path):
8 with open(image_path, "rb") as image_file:
9 return base64.b64encode(image_file.read()).decode("utf-8")
10
11
12 # Path to your image
13 image_path = "path_to_your_image.jpg"
14
15 # Getting the Base64 string
16 base64_image = encode_image(image_path)
17
18 completion = client.chat.completions.create(
19 model="gpt-4.1",
20 messages=[
21 {
22 "role": "user",
23 "content": [
24 { "type": "text", "text": "what's in this image?" },
25 {
26 "type": "image_url",
https://platform.openai.com/docs/guides/images-vision?api-mode=chat&format=base64-encoded 3/10
26/07/2025, 07:56 Images and vision - OpenAI API

27 "image_url": {
28 "url": f"data:image/jpeg;base64,{base64_image}",
29 },
30 },
31 ],
32 }
33 ],
34 )
35
36 print(completion.choices[0].message.content)

Image input requirements

Input images must meet the following requirements to be used in the API.

Supported file types PNG (.png)

JPEG (.jpeg and .jpg)

WEBP (.webp)

Non-animated GIF (.gif)

Size limits Up to 50 MB total payload size per request

Up to 500 individual image inputs per request

Other requirements No watermarks or logos

No NSFW content

Clear enough for a human to understand

Specify image input detail level

The detail parameter tells the model what level of detail to use when processing and
understanding the image ( low , high , or auto to let the model decide). If you skip the
parameter, the model will use auto .

1 "image_url": {
2 "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wis
3 "detail": "high"
4 },

You can save tokens and speed up responses by using "detail": "low" . This lets the model
process the image with a budget of 85 tokens. The model receives a low-resolution 512px x

https://platform.openai.com/docs/guides/images-vision?api-mode=chat&format=base64-encoded 4/10
26/07/2025, 07:56 Images and vision - OpenAI API

512px version of the image. This is fine if your use case doesn't require the model to see with
high-resolution detail (for example, if you're asking about the dominant shape or color in the
image).

On the other hand, you can use "detail": "high" if you want the model to have a better
understanding of the image.

Read more about calculating image processing costs in the Calculating costs section below.

Limitations
While models with vision capabilities are powerful and can be used in many situations, it's
important to understand the limitations of these models. Here are some known limitations:

Medical images: The model is not suitable for interpreting specialized medical images like
CT scans and shouldn't be used for medical advice.

Non-English: The model may not perform optimally when handling images with text of
non-Latin alphabets, such as Japanese or Korean.
Small text: Enlarge text within the image to improve readability, but avoid cropping
important details.

Rotation: The model may misinterpret rotated or upside-down text and images.

Visual elements: The model may struggle to understand graphs or text where colors or
styles—like solid, dashed, or dotted lines—vary.

Spatial reasoning: The model struggles with tasks requiring precise spatial localization,
such as identifying chess positions.

Accuracy: The model may generate incorrect descriptions or captions in certain

scenarios.

Image shape: The model struggles with panoramic and fisheye images.
Metadata and resizing: The model doesn't process original file names or metadata, and
images are resized before analysis, affecting their original dimensions.

Counting: The model may give approximate counts for objects in images.
CAPTCHAS: For safety reasons, our system blocks the submission of CAPTCHAs.

Calculating costs
Image inputs are metered and charged in tokens, just as text inputs are. How images are
converted to text token inputs varies based on the model. You can find a vision pricing
calculator in the FAQ section of the pricing page.

https://platform.openai.com/docs/guides/images-vision?api-mode=chat&format=base64-encoded 5/10
26/07/2025, 07:56 Images and vision - OpenAI API

GPT-4.1-mini, GPT-4.1-nano, o4-mini

Image inputs are metered and charged in tokens based on their dimensions. The token cost of
an image is determined as follows:

A. Calculate the number of 32px x 32px patches that are needed to fully cover the image (a
patch may extend beyond the image boundaries; out-of-bounds pixels are treated as black.)

raw_patches = ceil(width/32)×ceil(height/32)

B. If the number of patches exceeds 1536, we scale down the image so that it can be covered
by no more than 1536 patches

r = √(32²×1536/(width×height))
r = r × min( floor(width×r/32) / (width×r/32), floor(height×r/32) / (height×r/32

C. The token cost is the number of patches, capped at a maximum of 1536 tokens

image_tokens = ceil(resized_width/32)×ceil(resized_height/32)

D. For gpt-4.1-mini we multiply image tokens by 1.62 to get total tokens, for
gpt-4.1-nano we multiply image tokens by 2.46 to get total tokens, and for o4-mini we
multiply image tokens by 1.72 to get total tokens, that are then billed at normal text token
rates.

Note:

Cost calculation examples

A 1024 x 1024 image is 1024 tokens

Width is 1024, resulting in (1024 + 32 - 1) // 32 = 32 patches

Height is 1024, resulting in (1024 + 32 - 1) // 32 = 32 patches

Tokens calculated as 32 * 32 = 1024 , below the cap of 1536

A 1800 x 2400 image is 1452 tokens

Width is 1800, resulting in (1800 + 32 - 1) // 32 = 57 patches

Height is 2400, resulting in (2400 + 32 - 1) // 32 = 75 patches

We need 57 * 75 = 4275 patches to cover the full image. Since that exceeds 1536,
we need to scale down the image while preserving the aspect ratio.

https://platform.openai.com/docs/guides/images-vision?api-mode=chat&format=base64-encoded 6/10
26/07/2025, 07:56 Images and vision - OpenAI API

We can calculate the shrink factor as

sqrt(token_budget × patch_size^2 / (width * height)) . In our example, the
shrink factor is sqrt(1536 * 32^2 / (1800 * 2400)) = 0.603 .

Width is now 1086, resulting in 1086 / 32 = 33.94 patches

Height is now 1448, resulting in 1448 / 32 = 45.25 patches

We want to make sure the image fits in a whole number of patches. In this case we
scale again by 33 / 33.94 = 0.97 to fit the width in 33 patches.

The final width is then 1086 * (33 / 33.94) = 1056) and the final height is
1448 * (33 / 33.94) = 1408

The image now requires 1056 / 32 = 33 patches to cover the width and
1408 / 32 = 44 patches to cover the height

The total number of tokens is the 33 * 44 = 1452 , below the cap of 1536

GPT 4o, GPT-4.1, GPT-4o-mini, CUA, and o-series (except o4-mini)

The token cost of an image is determined by two factors: size and detail.

Any image with "detail": "low" costs a set, base number of tokens. This amount varies by
model (see charte below). To calculate the cost of an image with "detail": "high" , we do
the following:

Scale to fit in a 2048px x 2048px square, maintaining original aspect ratio

Scale so that the image's shortest side is 768px long
Count the number of 512px squares in the image—each square costs a set amount of
tokens (see chart below)
Add the base tokens to the total

MODEL BASE TOKENS TILE TOKENS

4o, 4.1, 4.5 85 170

4o-mini 2833 5667

o1, o1-pro, o3 75 150

computer-use-preview 65 129

Cost calculation examples (for gpt-4o)

A 1024 x 1024 square image in "detail": "high" mode costs 765 tokens

1024 is less than 2048, so there is no initial resize.

The shortest side is 1024, so we scale the image down to 768 x 768.

https://platform.openai.com/docs/guides/images-vision?api-mode=chat&format=base64-encoded 7/10
26/07/2025, 07:56 Images and vision - OpenAI API

4 512px square tiles are needed to represent the image, so the final token cost is
170 * 4 + 85 = 765 .

A 2048 x 4096 image in "detail": "high" mode costs 1105 tokens

We scale down the image to 1024 x 2048 to fit within the 2048 square.
The shortest side is 1024, so we further scale down to 768 x 1536.

6 512px tiles are needed, so the final token cost is 170 * 6 + 85 = 1105 .

A 4096 x 8192 image in "detail": "low" most costs 85 tokens

Regardless of input size, low detail images are a fixed cost.

GPT Image 1

For GPT Image 1, we calculate the cost of an image input the same way as described above,
except that we scale down the image so that the shortest side is 512px instead of 768px. The
price depends on the dimensions of the image and the input fidelity.

When input fidelity is set to low, the base cost is 65 image tokens, and each tile costs 129
image tokens. When using high input fidelity, we add a set number of tokens based on the
image's aspect ratio in addition to the image tokens described above.

If your image is square, we add 4096 extra input image tokens.

If it is closer to portrait or landscape, we add 6144 extra tokens.

To see pricing for image input tokens, refer to our pricing page.

We process images at the token level, so each image we process counts towards your tokens
per minute (TPM) limit.

For the most precise and up-to-date estimates for image processing, please use our image
pricing calculator available here.

https://platform.openai.com/docs/guides/images-vision?api-mode=chat&format=base64-encoded 8/10

Vision - OpenAI API
No ratings yet
Vision - OpenAI API
8 pages
Mathworks - Yann Debray - GPT-4o
No ratings yet
Mathworks - Yann Debray - GPT-4o
17 pages
OpenAI API
No ratings yet
OpenAI API
11 pages
PythonAI VisionModels ForSharing
No ratings yet
PythonAI VisionModels ForSharing
41 pages
Wk4 - AI Generated Images
No ratings yet
Wk4 - AI Generated Images
30 pages
Practical No3
No ratings yet
Practical No3
8 pages
AI Image Generator
No ratings yet
AI Image Generator
1 page
OpenAI APIs - Vision - SGLang
No ratings yet
OpenAI APIs - Vision - SGLang
7 pages
OpenAI API Guide: Python Integration
No ratings yet
OpenAI API Guide: Python Integration
9 pages
Exhibit 17
No ratings yet
Exhibit 17
7 pages
OpenAI Developers Handbook
No ratings yet
OpenAI Developers Handbook
132 pages
Stable Diffusion
No ratings yet
Stable Diffusion
19 pages
Comprehensive Guide To AI - ML Tools and Free Learni
No ratings yet
Comprehensive Guide To AI - ML Tools and Free Learni
11 pages
Week 1 - Introduction To SDGAI
No ratings yet
Week 1 - Introduction To SDGAI
36 pages
Flask Text-to-Image App
No ratings yet
Flask Text-to-Image App
5 pages
Generative AI Mini Projects
No ratings yet
Generative AI Mini Projects
39 pages
Text Generation - OpenAI API
No ratings yet
Text Generation - OpenAI API
12 pages
College Documentation - Automated Image Captioning
No ratings yet
College Documentation - Automated Image Captioning
26 pages
OpenAI API: Unlock AI Potential
No ratings yet
OpenAI API: Unlock AI Potential
14 pages
Create AI Model Guide
No ratings yet
Create AI Model Guide
14 pages
BE AI Art Generator
No ratings yet
BE AI Art Generator
6 pages
How To Use Vision-Enabled Chat Models - Azure OpenAI Service - Microsoft Learn
No ratings yet
How To Use Vision-Enabled Chat Models - Azure OpenAI Service - Microsoft Learn
9 pages
GPT4架构揭秘
No ratings yet
GPT4架构揭秘
12 pages
Lab Programs
No ratings yet
Lab Programs
4 pages
Ai Assessment
No ratings yet
Ai Assessment
6 pages
Two Generative Ai Apps
No ratings yet
Two Generative Ai Apps
10 pages
Generative Models for Beginners
No ratings yet
Generative Models for Beginners
17 pages
Opencv Cheatsheet
No ratings yet
Opencv Cheatsheet
60 pages
ChatGPT + Midjourney Ile Logo Yapma
No ratings yet
ChatGPT + Midjourney Ile Logo Yapma
10 pages
Build App With ChatGPT
100% (1)
Build App With ChatGPT
96 pages
Comprehensive Research On Free AI API Keys and Long-Term Usage Strategies
No ratings yet
Comprehensive Research On Free AI API Keys and Long-Term Usage Strategies
9 pages
Generative AI Tutorial
67% (3)
Generative AI Tutorial
18 pages
OpenCV Projects for Python Devs
No ratings yet
OpenCV Projects for Python Devs
30 pages
Computer Vision Pretrained Models: What Is Pre-Trained Model?
No ratings yet
Computer Vision Pretrained Models: What Is Pre-Trained Model?
10 pages
Info
No ratings yet
Info
10 pages
Ai-900 Whi Zcar D: Quick Exam Reference - Hand-Picked For You
No ratings yet
Ai-900 Whi Zcar D: Quick Exam Reference - Hand-Picked For You
5 pages
Final PPT
No ratings yet
Final PPT
13 pages
Model Usage
No ratings yet
Model Usage
9 pages
DAY2 Lap2
No ratings yet
DAY2 Lap2
13 pages
Python OpenCV Computer Vision Training PDF
No ratings yet
Python OpenCV Computer Vision Training PDF
85 pages
UNIT VI Gen-AI ASP Notes
No ratings yet
UNIT VI Gen-AI ASP Notes
11 pages
Fastai. A Layered API For Deep Learning
No ratings yet
Fastai. A Layered API For Deep Learning
26 pages
Icrcct24 001
No ratings yet
Icrcct24 001
6 pages
TDSynexx - Virtual Training Session
No ratings yet
TDSynexx - Virtual Training Session
49 pages
Open AI Python
No ratings yet
Open AI Python
1 page
Theres A Lot
No ratings yet
Theres A Lot
2 pages
Meta
No ratings yet
Meta
17 pages
Finkster-Python Cheatsheet
No ratings yet
Finkster-Python Cheatsheet
11 pages
Opencv2refman Py
No ratings yet
Opencv2refman Py
172 pages
30 Amazing Machine Learning Projects For The Past Year (v.2018)
No ratings yet
30 Amazing Machine Learning Projects For The Past Year (v.2018)
22 pages
PyTorch Deep Learning Guide
No ratings yet
PyTorch Deep Learning Guide
19 pages
Building A System That Can Generate High
No ratings yet
Building A System That Can Generate High
2 pages
OpenCV Python Guide: Real-World Apps
100% (5)
OpenCV Python Guide: Real-World Apps
36 pages
??? ?? ?????????? ?? ????????
No ratings yet
??? ?? ?????????? ?? ????????
21 pages
ISBN 978 1 940366 12 8 1577 Chapter06
No ratings yet
ISBN 978 1 940366 12 8 1577 Chapter06
22 pages
Designintech2023 001 Pages 2
No ratings yet
Designintech2023 001 Pages 2
27 pages
AI Trends of May 2023 You Need To Know by Gonzalo Recio Medium
No ratings yet
AI Trends of May 2023 You Need To Know by Gonzalo Recio Medium
1 page
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
29 pages
Red Dead Redemption 2 Ultimate Edition - Build 1491.50 + UE Unlocker - FitGirl Repacks
No ratings yet
Red Dead Redemption 2 Ultimate Edition - Build 1491.50 + UE Unlocker - FitGirl Repacks
2 pages
ISO 9000:2005 - QMS - Fundamentals and Vocabulary
67% (3)
ISO 9000:2005 - QMS - Fundamentals and Vocabulary
34 pages
Java Mainsit
No ratings yet
Java Mainsit
18 pages
FireHawk M7 Bulletin - EN
No ratings yet
FireHawk M7 Bulletin - EN
12 pages
Python Module-1 QB Solution (21EC643)
No ratings yet
Python Module-1 QB Solution (21EC643)
25 pages
WWW - Bhulekh.up - Nic.in UP Bhulekh Land Records With Name and Khasara No Online Fard & Registry Details
67% (3)
WWW - Bhulekh.up - Nic.in UP Bhulekh Land Records With Name and Khasara No Online Fard & Registry Details
4 pages
RAO - FPSO - Eng - 2019090914344054
No ratings yet
RAO - FPSO - Eng - 2019090914344054
16 pages
Absence and Calculation Card Details
No ratings yet
Absence and Calculation Card Details
2 pages
Srujana Short Resume
No ratings yet
Srujana Short Resume
2 pages
Introduction To Responsible AI
No ratings yet
Introduction To Responsible AI
19 pages
SSN Data Science Brochure2
No ratings yet
SSN Data Science Brochure2
14 pages
Actor Srikanth Arrested On Drug Case Viral Video
No ratings yet
Actor Srikanth Arrested On Drug Case Viral Video
3 pages
AdvancedMobileApplications 11.2.2
No ratings yet
AdvancedMobileApplications 11.2.2
128 pages
My ANIKET Document
No ratings yet
My ANIKET Document
52 pages
Module 4
No ratings yet
Module 4
34 pages
Project Report Final
No ratings yet
Project Report Final
21 pages
500+ Nature-Inspired Metaheuristics Analysis
No ratings yet
500+ Nature-Inspired Metaheuristics Analysis
25 pages
Study On Decentralized Identity and Privacy Preserving Cyber Security
No ratings yet
Study On Decentralized Identity and Privacy Preserving Cyber Security
7 pages
F
No ratings yet
F
1 page
PPOA Price Index (2014 Dec)
No ratings yet
PPOA Price Index (2014 Dec)
43 pages
2018.1 Example CSOP WISP NIST CSF Mapping
No ratings yet
2018.1 Example CSOP WISP NIST CSF Mapping
12 pages
Restaurant Management Website
No ratings yet
Restaurant Management Website
42 pages
PAPAYA Technical Manual Eng Ver 1.1
No ratings yet
PAPAYA Technical Manual Eng Ver 1.1
71 pages
Changelog
No ratings yet
Changelog
4 pages
Mayuri
No ratings yet
Mayuri
71 pages
KeyLab Essential mk3 - Logic Pro User Guide
No ratings yet
KeyLab Essential mk3 - Logic Pro User Guide
4 pages
Oracle Security Hardening Guide
No ratings yet
Oracle Security Hardening Guide
9 pages
Fall 2023 - CS607 - 2
No ratings yet
Fall 2023 - CS607 - 2
2 pages
SIM7500 - SIM7600 Series - AT Command Manual - V1.10
No ratings yet
SIM7500 - SIM7600 Series - AT Command Manual - V1.10
335 pages
Research Article
No ratings yet
Research Article
9 pages

Images and Vision - OpenAI API

Uploaded by

Images and Vision - OpenAI API

Uploaded by

26/07/2025, 07:56 Images and vision - OpenAI API

Copy page Chat Completions

Images and vision

Process image inputs

A tour of image-related use cases

API SUPPORTED USE CASES

Images API Generate images as output, optionally using images as input

Generate or edit images

Using world knowledge for image generation

rose quartz, jade, etc, and depict them in a realistic way.

Giving a model images as input

Passing a URL Passing a Base64 encoded image

Analyze the content of an image python

Image input requirements

Supported file types PNG (.png)

JPEG (.jpeg and .jpg)

Non-animated GIF (.gif)

Size limits Up to 50 MB total payload size per request

Up to 500 individual image inputs per request

Other requirements No watermarks or logos

Clear enough for a human to understand

Specify image input detail level

Accuracy: The model may generate incorrect descriptions or captions in certain

GPT-4.1-mini, GPT-4.1-nano, o4-mini

Cost calculation examples

A 1024 x 1024 image is 1024 tokens

Width is 1024, resulting in (1024 + 32 - 1) // 32 = 32 patches

Height is 1024, resulting in (1024 + 32 - 1) // 32 = 32 patches

Tokens calculated as 32 * 32 = 1024 , below the cap of 1536

A 1800 x 2400 image is 1452 tokens

Height is 2400, resulting in (2400 + 32 - 1) // 32 = 75 patches

We can calculate the shrink factor as

Width is now 1086, resulting in 1086 / 32 = 33.94 patches

Height is now 1448, resulting in 1448 / 32 = 45.25 patches

GPT 4o, GPT-4.1, GPT-4o-mini, CUA, and o-series (except o4-mini)

Scale to fit in a 2048px x 2048px square, maintaining original aspect ratio

MODEL BASE TOKENS TILE TOKENS

4o, 4.1, 4.5 85 170

4o-mini 2833 5667

o1, o1-pro, o3 75 150

Cost calculation examples (for gpt-4o)

1024 is less than 2048, so there is no initial resize.

A 2048 x 4096 image in "detail": "high" mode costs 1105 tokens

A 4096 x 8192 image in "detail": "low" most costs 85 tokens

Regardless of input size, low detail images are a fixed cost.

If your image is square, we add 4096 extra input image tokens.

If it is closer to portrait or landscape, we add 6144 extra tokens.

You might also like