Amazon Bedrock gives you a single API surface to call Claude, Llama, Titan, and other foundation models without managing infrastructure. You get pay-per-token pricing, IAM-based access control, and the option to keep all traffic inside your AWS VPC. If your stack already runs on AWS, Bedrock is the fastest path to production-grade model access.

Quick Start: Call Claude on Bedrock

Install boto3 and configure your AWS credentials. You need the bedrock-runtime service client, not the bedrock management client.

1
2
pip install boto3
aws configure  # set your access key, secret, and region

Make sure you have enabled model access in the Bedrock console first. AWS requires you to request access for each model you want to use.

Here is a minimal call to Claude using invoke_model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

model_id = "anthropic.claude-3-5-sonnet-20241022-v2:0"

request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1024,
    "messages": [
        {
            "role": "user",
            "content": [{"type": "text", "text": "Explain what a transformer is in 3 sentences."}],
        }
    ],
}

response = client.invoke_model(
    modelId=model_id,
    body=json.dumps(request_body),
)

result = json.loads(response["body"].read())
print(result["content"][0]["text"])

The invoke_model method sends JSON in the body parameter and returns a streaming body you read with .read(). The response format follows Claude’s native Messages API structure: the generated text lives in result["content"][0]["text"].

Invoke Llama Models on Bedrock

Llama uses a different request format than Claude. The key differences: Llama takes a prompt string (not a messages array), uses max_gen_len instead of max_tokens, and requires special prompt tokens for instruction-tuned models.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-west-2")

model_id = "meta.llama3-1-70b-instruct-v1:0"

prompt = "What are the benefits of retrieval-augmented generation?"

formatted_prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{prompt}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""

request_body = {
    "prompt": formatted_prompt,
    "max_gen_len": 512,
    "temperature": 0.7,
    "top_p": 0.9,
}

response = client.invoke_model(
    modelId=model_id,
    body=json.dumps(request_body),
)

result = json.loads(response["body"].read())
print(result["generation"])

Notice the response structure is different too. Llama returns the generated text in result["generation"], while Claude uses result["content"][0]["text"]. This is the core problem with invoke_model – every model family has its own request and response format.

The Converse API: One Interface for All Models

The Converse API solves the format inconsistency problem. It provides a unified message format that works across Claude, Llama, Titan, and every other model on Bedrock. Use converse instead of invoke_model when you want to swap models without rewriting your request code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import boto3

client = boto3.client("bedrock-runtime", region_name="us-east-1")

model_id = "anthropic.claude-3-5-sonnet-20241022-v2:0"

response = client.converse(
    modelId=model_id,
    messages=[
        {
            "role": "user",
            "content": [{"text": "Compare REST and GraphQL in two paragraphs."}],
        }
    ],
    inferenceConfig={
        "maxTokens": 512,
        "temperature": 0.5,
    },
)

output_text = response["output"]["message"]["content"][0]["text"]
print(output_text)
print(f"Tokens used: {response['usage']['totalTokens']}")

The Converse API message format is slightly different from the native Claude format. In converse, content blocks are {"text": "..."} (plain dict). In Claude’s native invoke_model format, they are {"type": "text", "text": "..."}. Do not mix these up.

To switch to Llama, just change the model_id string. The rest of the code stays identical:

1
2
model_id = "meta.llama3-1-70b-instruct-v1:0"
# Same converse() call works without changes

System prompts use a separate system parameter:

1
2
3
4
5
6
7
8
response = client.converse(
    modelId=model_id,
    messages=[
        {"role": "user", "content": [{"text": "Summarize this document."}]}
    ],
    system=[{"text": "You are a concise technical writer."}],
    inferenceConfig={"maxTokens": 256, "temperature": 0.3},
)

Streaming Responses

For long outputs, streaming gives you tokens as they are generated instead of waiting for the full response.

With the native API, use invoke_model_with_response_stream:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

model_id = "anthropic.claude-3-5-sonnet-20241022-v2:0"

request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1024,
    "messages": [
        {
            "role": "user",
            "content": [{"type": "text", "text": "Write a Python quicksort implementation."}],
        }
    ],
}

streaming_response = client.invoke_model_with_response_stream(
    modelId=model_id,
    body=json.dumps(request_body),
)

for event in streaming_response["body"]:
    chunk = json.loads(event["chunk"]["bytes"])
    if chunk["type"] == "content_block_delta":
        print(chunk["delta"].get("text", ""), end="")

With the Converse API, use converse_stream instead:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
response = client.converse_stream(
    modelId=model_id,
    messages=[
        {"role": "user", "content": [{"text": "Write a Python quicksort."}]}
    ],
    inferenceConfig={"maxTokens": 1024},
)

for event in response["stream"]:
    if "contentBlockDelta" in event:
        print(event["contentBlockDelta"]["delta"]["text"], end="")
    if "metadata" in event:
        usage = event["metadata"].get("usage", {})
        print(f"\nInput tokens: {usage['inputTokens']}")
        print(f"Output tokens: {usage['outputTokens']}")

I recommend converse_stream over invoke_model_with_response_stream. The event structure is cleaner and you get token usage in the metadata event at the end.

Embedding Generation with Titan

Amazon Titan Embeddings converts text into vectors you can store in a vector database for semantic search. The model ID is amazon.titan-embed-text-v2:0.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

request_body = {
    "inputText": "Amazon Bedrock provides access to foundation models.",
    "dimensions": 512,
    "normalize": True,
}

response = client.invoke_model(
    modelId="amazon.titan-embed-text-v2:0",
    body=json.dumps(request_body),
)

result = json.loads(response["body"].read())
embedding = result["embedding"]
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Titan Embed v2 supports 256, 512, and 1024 dimensions (1024 is the default). Smaller dimensions trade some accuracy for faster similarity search and lower storage costs. For most RAG use cases, 512 works well.

Bedrock Knowledge Bases for RAG

Bedrock Knowledge Bases is a managed RAG service. You point it at an S3 bucket with your documents, it handles chunking, embedding, and indexing into a vector store. Then you query it with retrieve_and_generate, which fetches relevant chunks and feeds them to a foundation model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import boto3

client = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

response = client.retrieve_and_generate(
    input={"text": "What is our refund policy?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "YOUR_KB_ID",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
        },
    },
)

print(response["output"]["text"])

# Print source citations
for citation in response.get("citations", []):
    for ref in citation.get("retrievedReferences", []):
        source = ref["location"]["s3Location"]["uri"]
        print(f"Source: {source}")

Note that Knowledge Bases uses the bedrock-agent-runtime client, not bedrock-runtime. The modelArn takes the full ARN format, not just the model ID. Subsequent requests in the same conversation should reuse the sessionId returned in the first response.

Common Errors and Fixes

AccessDeniedException: You don’t have access to the model You need to enable model access in the Bedrock console under “Model access.” Go to the AWS console, navigate to Bedrock, click “Model access” in the sidebar, and request access for the model you want. Some models require acceptance of a EULA.

ValidationException: Malformed input request This usually means your request body format does not match what the model expects. Claude requires anthropic_version and a messages array with type fields. Llama requires prompt and max_gen_len. Use the Converse API to avoid format mismatches entirely.

ResourceNotFoundException: Could not resolve the foundation model Check your model ID string. Common mistakes include using the wrong version suffix (e.g., v1:0 vs v2:0) or using a model that is not available in your AWS region. Claude 3.5 Sonnet v2 uses anthropic.claude-3-5-sonnet-20241022-v2:0, not anthropic.claude-3-5-sonnet-v2.

ThrottlingException: Rate exceeded Bedrock applies per-model rate limits. Request a quota increase through the AWS Service Quotas console, or use provisioned throughput for predictable high-volume workloads.

ModelTimeoutException on large requests For long prompts or high max_tokens values, the request may time out. Increase your boto3 client timeout:

1
2
3
4
from botocore.config import Config

config = Config(read_timeout=300, connect_timeout=5)
client = boto3.client("bedrock-runtime", region_name="us-east-1", config=config)

Mixing up Converse and invoke_model formats The Converse API uses {"text": "..."} for content blocks. The native Claude invoke_model uses {"type": "text", "text": "..."}. If you get a validation error, check which API you are calling and use the matching format.