Skip to content
AI Beginner Tutorial

Analyze Images and PDFs with Google Gemini's Multimodal API in Python

Send a photo or a PDF to Gemini 1.5 Flash and get back clean, structured JSON — in under 50 lines of Python.

Mariana Souza
Mariana Souza
Senior Editor · Jun 21, 2026 · 4 min read
Analyze Images and PDFs with Google Gemini's Multimodal API in Python

What you'll build

A Python script that sends a local JPEG and a PDF to Gemini 1.5 Flash and parses a structured JSON object from each response, using Google's official google-generativeai SDK.

Prerequisites

  • Python 3.9 or newer (python --version to check)
  • A Google AI Studio API key — free at aistudio.google.com/app/apikey
  • pip 23+
  • A sample JPEG and a sample PDF on disk

OS note: Commands below use export (macOS/Linux). On Windows PowerShell use $env:GEMINI_API_KEY = "your-key".

Step 1 — Store your API key safely

Never hard-code credentials. Export the key as an environment variable in your terminal session:

export GEMINI_API_KEY="your-key-here"

Step 2 — Install dependencies

pip install "google-generativeai>=0.7.0" Pillow

Pillow loads local images; the Google SDK handles everything else, including PDF uploads via the File API.

Step 3 — Write the script

Create gemini_multimodal.py:

import json
import os
import time

import PIL.Image
import google.generativeai as genai

# Configure the SDK (reads key from the environment — never commit secrets)
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Setting response_mime_type tells Gemini to emit valid JSON every time
model = genai.GenerativeModel(
    model_name="gemini-1.5-flash",
    generation_config={"response_mime_type": "application/json"},
)


def analyze_image(path: str, prompt: str) -> dict:
    """Open a local image (JPEG/PNG/WebP) and return parsed JSON."""
    image = PIL.Image.open(path)
    response = model.generate_content([image, prompt])
    return json.loads(response.text)


def analyze_pdf(path: str, prompt: str) -> dict:
    """Upload a PDF via the File API, query it, then delete the upload."""
    uploaded = genai.upload_file(path)

    # Wait for Google's servers to finish processing (usually instant for small files)
    while uploaded.state.name == "PROCESSING":
        time.sleep(2)
        uploaded = genai.get_file(uploaded.name)

    response = model.generate_content([uploaded, prompt])
    genai.delete_file(uploaded.name)   # optional: remove from Google servers immediately
    return json.loads(response.text)


if __name__ == "__main__":
    # --- Image ---
    img_result = analyze_image(
        "sample.jpg",
        "Return JSON with keys: description, dominant_colors (list), has_people (bool).",
    )
    print("IMAGE:", json.dumps(img_result, indent=2))

    # --- PDF ---
    pdf_result = analyze_pdf(
        "document.pdf",
        "Return JSON with keys: title, summary (one sentence), page_count_estimate (int).",
    )
    print("PDF:", json.dumps(pdf_result, indent=2))

Why response_mime_type? Without it, the model sometimes wraps JSON in markdown fences (```json … ```), breaking json.loads. This config key forces clean JSON output at the model level.

Why upload_file for PDFs? PDFs can't be opened with Pillow. The File API accepts application/pdf up to 2 GB and stores the file for up to 48 hours.

Step 4 — Run it

Place sample.jpg and document.pdf in the same directory, then:

python gemini_multimodal.py

Verify it works

Expected output shape (values vary by file):

IMAGE: {
  "description": "A golden retriever sitting in a sunny park.",
  "dominant_colors": ["yellow", "green", "blue"],
  "has_people": false
}
PDF: {
  "title": "Q3 Financial Report",
  "summary": "Revenue grew 12 % year-over-year driven by cloud services.",
  "page_count_estimate": 8
}

Both blocks must parse without error via json.loads. If the script exits without exceptions, you're done.

Troubleshooting

Error Cause Fix
KeyError: 'GEMINI_API_KEY' Environment variable not set in this shell Re-run export GEMINI_API_KEY="..." in the same terminal
google.api_core.exceptions.InvalidArgument on image Unsupported format passed to Pillow/Gemini Use JPEG, PNG, WebP, or GIF; convert BMP/TIFF first
json.JSONDecodeError Older SDK wrapped JSON in markdown fences Run pip install -U google-generativeai; v0.7+ respects response_mime_type reliably
ResourceExhausted: 429 Free-tier rate limit (15 req/min for Flash) Wait 60 seconds and retry; reduce call frequency

Next steps

  • Typed schemas — Pass a TypedDict or Pydantic model as response_schema inside GenerationConfig (SDK ≥ 0.8) to validate field types automatically.
  • Multi-image comparison — Pass a list: [img1, img2, "Compare these two images"].
  • Vertex AI — Replace google-generativeai with the vertexai SDK for IAM auth, VPC controls, and enterprise quotas.
  • Official vision docs: ai.google.dev/gemini-api/docs/vision
Mariana Souza
Written by
Mariana Souza · Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 0

Join the discussion

Sign in or create an account to comment and vote.

No comments yet

Be the first to weigh in.

Related Reading