Amazon Bedrock gives you a single API surface to call Claude, Llama, Titan, and other foundation models without managing infrastructure. You get pay-per-token pricing, IAM-based access control, and the option to keep all traffic inside your AWS VPC. If your stack already runs on AWS, Bedrock is the fastest path to production-grade model access.
Quick Start: Call Claude on Bedrock
Install boto3 and configure your AWS credentials. You need the bedrock-runtime service client, not the bedrock management client.
| |
Make sure you have enabled model access in the Bedrock console first. AWS requires you to request access for each model you want to use.
Here is a minimal call to Claude using invoke_model:
| |
The invoke_model method sends JSON in the body parameter and returns a streaming body you read with .read(). The response format follows Claude’s native Messages API structure: the generated text lives in result["content"][0]["text"].
Invoke Llama Models on Bedrock
Llama uses a different request format than Claude. The key differences: Llama takes a prompt string (not a messages array), uses max_gen_len instead of max_tokens, and requires special prompt tokens for instruction-tuned models.
| |
Notice the response structure is different too. Llama returns the generated text in result["generation"], while Claude uses result["content"][0]["text"]. This is the core problem with invoke_model – every model family has its own request and response format.
The Converse API: One Interface for All Models
The Converse API solves the format inconsistency problem. It provides a unified message format that works across Claude, Llama, Titan, and every other model on Bedrock. Use converse instead of invoke_model when you want to swap models without rewriting your request code.
| |
The Converse API message format is slightly different from the native Claude format. In converse, content blocks are {"text": "..."} (plain dict). In Claude’s native invoke_model format, they are {"type": "text", "text": "..."}. Do not mix these up.
To switch to Llama, just change the model_id string. The rest of the code stays identical:
| |
System prompts use a separate system parameter:
| |
Streaming Responses
For long outputs, streaming gives you tokens as they are generated instead of waiting for the full response.
With the native API, use invoke_model_with_response_stream:
| |
With the Converse API, use converse_stream instead:
| |
I recommend converse_stream over invoke_model_with_response_stream. The event structure is cleaner and you get token usage in the metadata event at the end.
Embedding Generation with Titan
Amazon Titan Embeddings converts text into vectors you can store in a vector database for semantic search. The model ID is amazon.titan-embed-text-v2:0.
| |
Titan Embed v2 supports 256, 512, and 1024 dimensions (1024 is the default). Smaller dimensions trade some accuracy for faster similarity search and lower storage costs. For most RAG use cases, 512 works well.
Bedrock Knowledge Bases for RAG
Bedrock Knowledge Bases is a managed RAG service. You point it at an S3 bucket with your documents, it handles chunking, embedding, and indexing into a vector store. Then you query it with retrieve_and_generate, which fetches relevant chunks and feeds them to a foundation model.
| |
Note that Knowledge Bases uses the bedrock-agent-runtime client, not bedrock-runtime. The modelArn takes the full ARN format, not just the model ID. Subsequent requests in the same conversation should reuse the sessionId returned in the first response.
Common Errors and Fixes
AccessDeniedException: You don’t have access to the model You need to enable model access in the Bedrock console under “Model access.” Go to the AWS console, navigate to Bedrock, click “Model access” in the sidebar, and request access for the model you want. Some models require acceptance of a EULA.
ValidationException: Malformed input request
This usually means your request body format does not match what the model expects. Claude requires anthropic_version and a messages array with type fields. Llama requires prompt and max_gen_len. Use the Converse API to avoid format mismatches entirely.
ResourceNotFoundException: Could not resolve the foundation model
Check your model ID string. Common mistakes include using the wrong version suffix (e.g., v1:0 vs v2:0) or using a model that is not available in your AWS region. Claude 3.5 Sonnet v2 uses anthropic.claude-3-5-sonnet-20241022-v2:0, not anthropic.claude-3-5-sonnet-v2.
ThrottlingException: Rate exceeded Bedrock applies per-model rate limits. Request a quota increase through the AWS Service Quotas console, or use provisioned throughput for predictable high-volume workloads.
ModelTimeoutException on large requests
For long prompts or high max_tokens values, the request may time out. Increase your boto3 client timeout:
| |
Mixing up Converse and invoke_model formats
The Converse API uses {"text": "..."} for content blocks. The native Claude invoke_model uses {"type": "text", "text": "..."}. If you get a validation error, check which API you are calling and use the matching format.
Related Guides
- How to Use the AWS Bedrock Converse API for Multi-Model Chat
- How to Use Claude’s Model Context Protocol (MCP)
- How to Use the Anthropic Python SDK for Claude
- How to Use the Cerebras API for Fast LLM Inference
- How to Use the Anthropic PDF Processing API for Document Analysis
- How to Use the Google Vertex AI Gemini API for Multimodal Tasks
- How to Use the Mistral API for Code Generation and Chat
- How to Run Open-Source Models with the Replicate API
- How to Use the Together AI API for Open-Source LLMs
- How to Use the OpenAI Realtime API for Voice Applications