Install and Initialize the SDK
| |
You need a GCP project with the Vertex AI API enabled. Authenticate with a service account or your user credentials:
| |
Then initialize the SDK with your project and region:
| |
That’s the full pattern. Every call goes through the GenerativeModel class. You pick a model, call generate_content(), and read response.text. The Vertex AI SDK handles authentication through Application Default Credentials – no API keys floating around in environment variables.
Why use Vertex AI instead of the direct google-genai SDK? Three reasons: enterprise IAM controls, VPC Service Controls for data residency, and access to Google’s full model garden (PaLM, Imagen, Codey) alongside Gemini. If you’re building in a corporate environment with compliance requirements, Vertex AI is the path Google wants you on.
The location parameter matters. us-central1 has the widest model availability. Europe and Asia regions work but may not have the latest model versions on day one. Check the Vertex AI regions page for current availability.
Text Prompts and Generation Config
Basic text generation takes a string or a list of Part objects. For simple prompts, pass a string directly:
| |
System instructions go in the GenerativeModel constructor, not in the prompt. This keeps them persistent across multi-turn conversations. Set temperature=0 when you need deterministic extraction. Bump it to 0.7+ for creative tasks.
Multi-Turn Conversations
The SDK tracks history with chat sessions:
| |
The chat object accumulates messages automatically. No need to manually pass conversation history on each turn.
Multimodal: Images and Text
Gemini processes images natively. Load a local image as bytes and wrap it in a Part:
| |
You can also load images from GCS or pass raw bytes:
| |
GCS URIs are the best option for production pipelines. The data never leaves Google’s network, which means faster processing and no upload overhead. Supported image formats are JPEG, PNG, GIF, WebP, and BMP. Maximum image size is 20MB per image.
Multimodal: Video and PDF Content
Video Analysis
Gemini can process video files. For anything over a few MB, upload to GCS first:
| |
Gemini samples frames from the video and processes them alongside any audio track. It handles MP4, AVI, MOV, MKV, and WebM. Max video length depends on the model – Flash handles up to 1 hour, Pro handles longer content. For timestamp accuracy, Pro is noticeably better than Flash.
PDF Processing
PDFs work the same way. Pass the file as bytes or from GCS:
| |
Gemini renders each page of the PDF as an image internally, so it handles scanned documents, charts, and complex layouts well. For text-heavy PDFs with hundreds of pages, consider splitting them and processing in parallel – the model has a context window limit and very long PDFs can hit it.
Structured Output and JSON Mode
When you need machine-readable output, use JSON mode with a response schema. This constrains the model to produce valid JSON matching your specification:
| |
The output is guaranteed to match your schema. No post-processing, no regex extraction, no hoping the model follows your prompt. Enums are enforced – the model can only pick from your allowed values.
For more complex schemas, you can nest objects and use arrays of objects:
| |
This is the killer combo – multimodal input (PDF) with structured output (JSON schema). You get a typed, validated extraction pipeline in about 30 lines of code.
Gemini Flash vs Pro: When to Use Which
Pick the right model for the job:
gemini-2.0-flash– Default choice. Fast, cheap, handles 90% of tasks well. Use it for classification, summarization, simple extraction, and any high-volume workload. Latency is typically under 2 seconds for text-only requests.gemini-2.0-pro– Reach for this when Flash falls short. Complex reasoning, multi-step analysis, long-document understanding, and tasks where accuracy matters more than speed. Costs roughly 10x more than Flash. Video timestamp accuracy and nuanced image analysis are measurably better.gemini-2.0-flash-lite– Cheapest option. Good for high-volume classification and routing where you need a yes/no or category label. Not suitable for generation-heavy tasks.
My recommendation: start every project with Flash. Switch to Pro only for specific tasks where you’ve measured a quality gap. Most teams overestimate how much they need Pro. Run a quick eval on 50 examples with both models before committing to the more expensive option.
For multimodal tasks specifically, Flash handles single-image analysis, short videos (under 5 minutes), and standard PDFs well. Pro pulls ahead on multi-image comparisons, long video analysis with timestamp extraction, and dense technical documents where missing a detail matters.
Common Errors and Fixes
403 Permission Denied: Vertex AI API has not been enabled
| |
Enable the Vertex AI API in your GCP project:
| |
This also happens when your service account lacks the roles/aiplatform.user role. Grant it with:
| |
404 Model not found
| |
Double-check the model name and your region. Model availability varies by location. us-central1 has the widest selection. Also verify you’re using the correct model ID – it’s gemini-2.0-flash, not gemini-flash-2.0 or gemini-2.0-flash-001.
400 Request payload size exceeds the limit
You’re sending too much data in a single request. For large files, upload to GCS first and reference by URI instead of sending inline bytes. The inline payload limit is roughly 20MB. GCS-backed requests can handle much larger files.
ImportError: cannot import name 'Image' from 'vertexai.generative_models'
Your google-cloud-aiplatform version is too old. Update it:
| |
The Image class and several Part factory methods were added in version 1.38+. Check your version with pip show google-cloud-aiplatform.
DefaultCredentialsError: Could not automatically determine credentials
You haven’t authenticated. Run:
| |
For production deployments, use a service account key or workload identity federation instead of user credentials. Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your service account JSON key file:
| |
Related Guides
- How to Build Apps with the Gemini API and Python SDK
- How to Use the Mistral API for Code Generation and Chat
- How to Use the Anthropic Token Counting API for Cost Estimation
- How to Use the Anthropic Claude Files API for Large Document Processing
- How to Use the Anthropic PDF Processing API for Document Analysis
- How to Use the Anthropic Multi-Turn Conversation API with Tool Use
- How to Use the Anthropic Python SDK for Claude
- How to Use Claude’s Model Context Protocol (MCP)
- How to Use the Together AI API for Open-Source LLMs
- How to Use the AWS Bedrock Converse API for Multi-Model Chat