How to Use the Google Vertex AI Gemini API for Multimodal Tasks

Install and Initialize the SDK

1
pip install google-cloud-aiplatform

You need a GCP project with the Vertex AI API enabled. Authenticate with a service account or your user credentials:

1
gcloud auth application-default login

Then initialize the SDK with your project and region:

1
2
3
4
5
6
7
8
import vertexai
from vertexai.generative_models import GenerativeModel, Part

vertexai.init(project="your-gcp-project-id", location="us-central1")

model = GenerativeModel("gemini-2.0-flash")
response = model.generate_content("Explain what Vertex AI is in two sentences.")
print(response.text)

That’s the full pattern. Every call goes through the GenerativeModel class. You pick a model, call generate_content(), and read response.text. The Vertex AI SDK handles authentication through Application Default Credentials – no API keys floating around in environment variables.

Why use Vertex AI instead of the direct google-genai SDK? Three reasons: enterprise IAM controls, VPC Service Controls for data residency, and access to Google’s full model garden (PaLM, Imagen, Codey) alongside Gemini. If you’re building in a corporate environment with compliance requirements, Vertex AI is the path Google wants you on.

The location parameter matters. us-central1 has the widest model availability. Europe and Asia regions work but may not have the latest model versions on day one. Check the Vertex AI regions page for current availability.

Text Prompts and Generation Config

Basic text generation takes a string or a list of Part objects. For simple prompts, pass a string directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig

vertexai.init(project="your-gcp-project-id", location="us-central1")

model = GenerativeModel(
    "gemini-2.0-flash",
    system_instruction="You are a concise technical writer. Answer in three sentences or fewer.",
)

config = GenerationConfig(
    temperature=0.2,
    top_p=0.95,
    max_output_tokens=512,
)

response = model.generate_content(
    "What are the tradeoffs between gRPC and REST for internal microservices?",
    generation_config=config,
)

print(response.text)

System instructions go in the GenerativeModel constructor, not in the prompt. This keeps them persistent across multi-turn conversations. Set temperature=0 when you need deterministic extraction. Bump it to 0.7+ for creative tasks.

Multi-Turn Conversations

The SDK tracks history with chat sessions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import vertexai
from vertexai.generative_models import GenerativeModel

vertexai.init(project="your-gcp-project-id", location="us-central1")

model = GenerativeModel("gemini-2.0-flash")
chat = model.start_chat()

response = chat.send_message("What is a B-tree?")
print(response.text)

response = chat.send_message("How is it different from a B+ tree?")
print(response.text)

response = chat.send_message("Which one does PostgreSQL use for indexes?")
print(response.text)

The chat object accumulates messages automatically. No need to manually pass conversation history on each turn.

Multimodal: Images and Text

Gemini processes images natively. Load a local image as bytes and wrap it in a Part:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import vertexai
from vertexai.generative_models import GenerativeModel, Part, Image

vertexai.init(project="your-gcp-project-id", location="us-central1")

model = GenerativeModel("gemini-2.0-flash")

image = Image.load_from_file("architecture-diagram.png")

response = model.generate_content(
    [
        image,
        "Describe every service in this architecture diagram and how they connect. List them as bullet points.",
    ]
)

print(response.text)

You can also load images from GCS or pass raw bytes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import vertexai
from vertexai.generative_models import GenerativeModel, Part

vertexai.init(project="your-gcp-project-id", location="us-central1")

model = GenerativeModel("gemini-2.0-flash")

# From Google Cloud Storage
gcs_image = Part.from_uri("gs://your-bucket/photo.jpg", mime_type="image/jpeg")

# From raw bytes
with open("screenshot.png", "rb") as f:
    raw_image = Part.from_data(f.read(), mime_type="image/png")

response = model.generate_content(
    [gcs_image, raw_image, "Compare these two images. What are the key differences?"]
)

print(response.text)

GCS URIs are the best option for production pipelines. The data never leaves Google’s network, which means faster processing and no upload overhead. Supported image formats are JPEG, PNG, GIF, WebP, and BMP. Maximum image size is 20MB per image.

Multimodal: Video and PDF Content

Video Analysis

Gemini can process video files. For anything over a few MB, upload to GCS first:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import vertexai
from vertexai.generative_models import GenerativeModel, Part

vertexai.init(project="your-gcp-project-id", location="us-central1")

model = GenerativeModel("gemini-2.0-pro")

video = Part.from_uri("gs://your-bucket/demo-recording.mp4", mime_type="video/mp4")

response = model.generate_content(
    [
        video,
        "Watch this product demo. List every feature shown, the timestamp where it appears, and a one-sentence description.",
    ]
)

print(response.text)

Gemini samples frames from the video and processes them alongside any audio track. It handles MP4, AVI, MOV, MKV, and WebM. Max video length depends on the model – Flash handles up to 1 hour, Pro handles longer content. For timestamp accuracy, Pro is noticeably better than Flash.

PDF Processing

PDFs work the same way. Pass the file as bytes or from GCS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import vertexai
from vertexai.generative_models import GenerativeModel, Part

vertexai.init(project="your-gcp-project-id", location="us-central1")

model = GenerativeModel("gemini-2.0-flash")

with open("quarterly-report.pdf", "rb") as f:
    pdf_data = Part.from_data(f.read(), mime_type="application/pdf")

response = model.generate_content(
    [
        pdf_data,
        "Extract the revenue figures for each quarter. Return as a markdown table.",
    ]
)

print(response.text)

Gemini renders each page of the PDF as an image internally, so it handles scanned documents, charts, and complex layouts well. For text-heavy PDFs with hundreds of pages, consider splitting them and processing in parallel – the model has a context window limit and very long PDFs can hit it.

Structured Output and JSON Mode

When you need machine-readable output, use JSON mode with a response schema. This constrains the model to produce valid JSON matching your specification:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig

vertexai.init(project="your-gcp-project-id", location="us-central1")

model = GenerativeModel("gemini-2.0-flash")

response_schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "category": {"type": "string", "enum": ["electronics", "clothing", "food", "other"]},
        "price_usd": {"type": "number"},
        "in_stock": {"type": "boolean"},
        "key_features": {
            "type": "array",
            "items": {"type": "string"},
        },
    },
    "required": ["product_name", "category", "price_usd", "in_stock", "key_features"],
}

config = GenerationConfig(
    response_mime_type="application/json",
    response_schema=response_schema,
)

response = model.generate_content(
    "Extract product details: Sony WH-1000XM5 wireless headphones, premium noise cancelling, currently $348 on Amazon, available for purchase.",
    generation_config=config,
)

import json
product = json.loads(response.text)
print(json.dumps(product, indent=2))

The output is guaranteed to match your schema. No post-processing, no regex extraction, no hoping the model follows your prompt. Enums are enforced – the model can only pick from your allowed values.

For more complex schemas, you can nest objects and use arrays of objects:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig, Part

vertexai.init(project="your-gcp-project-id", location="us-central1")

model = GenerativeModel("gemini-2.0-flash")

invoice_schema = {
    "type": "object",
    "properties": {
        "vendor": {"type": "string"},
        "invoice_date": {"type": "string"},
        "total_amount": {"type": "number"},
        "currency": {"type": "string"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "integer"},
                    "unit_price": {"type": "number"},
                },
                "required": ["description", "quantity", "unit_price"],
            },
        },
    },
    "required": ["vendor", "invoice_date", "total_amount", "currency", "line_items"],
}

with open("invoice.pdf", "rb") as f:
    pdf_part = Part.from_data(f.read(), mime_type="application/pdf")

config = GenerationConfig(
    response_mime_type="application/json",
    response_schema=invoice_schema,
)

response = model.generate_content(
    [pdf_part, "Extract the invoice details from this document."],
    generation_config=config,
)

import json
invoice = json.loads(response.text)
print(f"Vendor: {invoice['vendor']}")
print(f"Total: {invoice['currency']} {invoice['total_amount']}")
for item in invoice["line_items"]:
    print(f"  - {item['description']}: {item['quantity']} x ${item['unit_price']}")

This is the killer combo – multimodal input (PDF) with structured output (JSON schema). You get a typed, validated extraction pipeline in about 30 lines of code.

Gemini Flash vs Pro: When to Use Which

Pick the right model for the job:

gemini-2.0-flash – Default choice. Fast, cheap, handles 90% of tasks well. Use it for classification, summarization, simple extraction, and any high-volume workload. Latency is typically under 2 seconds for text-only requests.
gemini-2.0-pro – Reach for this when Flash falls short. Complex reasoning, multi-step analysis, long-document understanding, and tasks where accuracy matters more than speed. Costs roughly 10x more than Flash. Video timestamp accuracy and nuanced image analysis are measurably better.
gemini-2.0-flash-lite – Cheapest option. Good for high-volume classification and routing where you need a yes/no or category label. Not suitable for generation-heavy tasks.

My recommendation: start every project with Flash. Switch to Pro only for specific tasks where you’ve measured a quality gap. Most teams overestimate how much they need Pro. Run a quick eval on 50 examples with both models before committing to the more expensive option.

For multimodal tasks specifically, Flash handles single-image analysis, short videos (under 5 minutes), and standard PDFs well. Pro pulls ahead on multi-image comparisons, long video analysis with timestamp extraction, and dense technical documents where missing a detail matters.

Common Errors and Fixes

`403 Permission Denied: Vertex AI API has not been enabled`

1
google.api_core.exceptions.PermissionDenied: 403 Vertex AI API has not been used in project ... before or it is disabled.

Enable the Vertex AI API in your GCP project:

1
gcloud services enable aiplatform.googleapis.com --project=your-gcp-project-id

This also happens when your service account lacks the roles/aiplatform.user role. Grant it with:

1
2
3
gcloud projects add-iam-policy-binding your-gcp-project-id \
    --member="serviceAccount:[email protected]" \
    --role="roles/aiplatform.user"

`404 Model not found`

1
google.api_core.exceptions.NotFound: 404 Model ... is not found.

Double-check the model name and your region. Model availability varies by location. us-central1 has the widest selection. Also verify you’re using the correct model ID – it’s gemini-2.0-flash, not gemini-flash-2.0 or gemini-2.0-flash-001.

`400 Request payload size exceeds the limit`

You’re sending too much data in a single request. For large files, upload to GCS first and reference by URI instead of sending inline bytes. The inline payload limit is roughly 20MB. GCS-backed requests can handle much larger files.

`ImportError: cannot import name 'Image' from 'vertexai.generative_models'`

Your google-cloud-aiplatform version is too old. Update it:

1
pip install -U google-cloud-aiplatform

The Image class and several Part factory methods were added in version 1.38+. Check your version with pip show google-cloud-aiplatform.

`DefaultCredentialsError: Could not automatically determine credentials`

You haven’t authenticated. Run:

1
gcloud auth application-default login

For production deployments, use a service account key or workload identity federation instead of user credentials. Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your service account JSON key file:

1
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"

Install and Initialize the SDK#

Text Prompts and Generation Config#

Multi-Turn Conversations#

Multimodal: Images and Text#

Multimodal: Video and PDF Content#

Video Analysis#

PDF Processing#

Structured Output and JSON Mode#

Gemini Flash vs Pro: When to Use Which#

Common Errors and Fixes#

403 Permission Denied: Vertex AI API has not been enabled#

404 Model not found#

400 Request payload size exceeds the limit#

ImportError: cannot import name 'Image' from 'vertexai.generative_models'#

DefaultCredentialsError: Could not automatically determine credentials#

Related Guides#

About the Author