The Quick Version

Replicate hosts open-source models and lets you run them via API. No GPU setup, no Docker images, no infrastructure. You pay per second of compute, which makes it cheaper than running your own GPU for sporadic workloads.

1
2
pip install replicate
export REPLICATE_API_TOKEN=r8_your_token_here
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import replicate

# Run Llama 3.1 70B
output = replicate.run(
    "meta/meta-llama-3.1-70b-instruct",
    input={
        "prompt": "Explain gradient descent in 3 sentences.",
        "max_tokens": 256,
        "temperature": 0.7,
    },
)
print("".join(output))

That sends your prompt to a Llama 70B instance running on Replicate’s GPUs and streams the response back. No model downloads, no CUDA setup.

Running Different Model Types

Replicate isn’t just for text. It hosts image generation, audio transcription, video processing, and more. The API pattern is the same for all of them.

Image Generation with SDXL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
output = replicate.run(
    "stability-ai/sdxl:7762fd07cf82c948538e41f63f77d685e02b063e37e496e96eefd46c929f9bdc",
    input={
        "prompt": "A cyberpunk cityscape at sunset, neon lights reflecting off wet streets, photorealistic",
        "negative_prompt": "blurry, low quality, cartoon",
        "width": 1024,
        "height": 1024,
        "num_outputs": 1,
        "scheduler": "K_EULER",
        "num_inference_steps": 30,
    },
)
print(output[0])  # URL to generated image

Audio Transcription with Whisper

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
output = replicate.run(
    "openai/whisper:cdd97b257f93cb89dede1c7584df59efd8f09f98c99a9f02e1a23c6cf6ba7ab3",
    input={
        "audio": open("meeting_recording.mp3", "rb"),
        "model": "large-v3",
        "language": "en",
        "translate": False,
        "transcription": "plain text",
    },
)
print(output["transcription"])

Image Captioning with BLIP-2

1
2
3
4
5
6
7
8
output = replicate.run(
    "salesforce/blip-2:4b32258c42e9efd4288bb9910bc532a69727f9acd26aa08e175713a0a857a608",
    input={
        "image": open("photo.jpg", "rb"),
        "question": "What objects are in this image?",
    },
)
print(output)

Async Predictions for Long-Running Tasks

Some models take minutes to run — fine-tuning, video generation, large batch processing. Use async predictions to avoid blocking your application.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import replicate
import time

# Create a prediction without waiting for it
prediction = replicate.predictions.create(
    model="meta/meta-llama-3.1-70b-instruct",
    input={
        "prompt": "Write a detailed technical blog post about transformer architectures.",
        "max_tokens": 2048,
    },
)
print(f"Prediction ID: {prediction.id}")
print(f"Status: {prediction.status}")

# Poll for completion
while prediction.status not in ["succeeded", "failed", "canceled"]:
    time.sleep(2)
    prediction.reload()
    print(f"Status: {prediction.status}")

if prediction.status == "succeeded":
    print("".join(prediction.output))
else:
    print(f"Failed: {prediction.error}")

For production, use webhooks instead of polling. Pass a webhook URL when creating the prediction and Replicate will POST the result when it’s done:

1
2
3
4
5
6
prediction = replicate.predictions.create(
    model="meta/meta-llama-3.1-70b-instruct",
    input={"prompt": "...", "max_tokens": 512},
    webhook="https://your-app.com/api/replicate-webhook",
    webhook_events_filter=["completed"],
)

Streaming Responses

For LLMs, streaming gives your users a much better experience. Tokens appear as they’re generated instead of waiting for the full response.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
output_stream = replicate.stream(
    "meta/meta-llama-3.1-70b-instruct",
    input={
        "prompt": "Explain how attention mechanisms work in transformers.",
        "max_tokens": 512,
    },
)

for event in output_stream:
    print(str(event), end="", flush=True)
print()  # newline at the end

This is particularly useful in web applications where you’re streaming to a frontend via Server-Sent Events or WebSockets.

Running Custom Models

You can deploy your own fine-tuned models to Replicate using Cog, their open-source model packaging tool.

1
pip install cog

Create a cog.yaml and predict.py in your model directory:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# predict.py
from cog import BasePredictor, Input
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory."""
        self.tokenizer = AutoTokenizer.from_pretrained("./my-finetuned-model")
        self.model = AutoModelForCausalLM.from_pretrained(
            "./my-finetuned-model", torch_dtype=torch.float16, device_map="auto"
        )

    def predict(
        self,
        prompt: str = Input(description="Input prompt"),
        max_tokens: int = Input(description="Max tokens to generate", default=256),
        temperature: float = Input(description="Sampling temperature", default=0.7),
    ) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs, max_new_tokens=max_tokens, temperature=temperature,
                do_sample=temperature > 0,
            )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

Push it to Replicate:

1
2
cog login
cog push r8.im/your-username/your-model

Your model now runs on Replicate’s infrastructure with the same API as every other model. It auto-scales, handles cold starts, and you only pay when it’s running.

Cost Optimization

Replicate charges per second of compute time. Here’s how to minimize costs:

Use the smallest model that works. Llama 3.1 8B costs roughly 1/10th of the 70B version per token. Test with the small model first.

Set max_tokens accurately. Don’t set 4096 when you need 256. The meter runs until generation stops.

Cache responses. If multiple users ask similar questions, cache the results. A Redis cache in front of the API can cut costs by 30-50% for common queries.

Use cold boot aware routing. Replicate models go cold after 60 seconds of inactivity. The first request after a cold period takes 5-30 seconds to boot. Keep a warm instance by sending periodic heartbeat requests during active hours.

Common Errors and Fixes

ReplicateError: You have reached your spending limit

Set a spending limit in your Replicate dashboard. If you hit it, either increase the limit or optimize your usage. There’s no way to bypass this via the API.

Prediction times out

Default timeout is 60 seconds for replicate.run(). For long-running models, use async predictions instead. Or increase the timeout:

1
2
3
4
5
import replicate
output = replicate.run(
    "model/name",
    input={...},
)

ModelError: Model not found

Model identifiers include the version hash. If the model was updated, the old version may be gone. Use the model name without the hash to get the latest version, or pin a specific version in production.

Cold start latency is too high

For production workloads, keep the model warm with a periodic ping. A cron job that runs every 30 seconds with a minimal prompt keeps the container alive and eliminates cold starts.

Replicate vs. Self-Hosting

Use Replicate when: you’re prototyping, traffic is unpredictable, you don’t want to manage GPUs, or you need access to many different models without deploying each one.

Self-host when: you have consistent high traffic (cheaper at scale), need sub-100ms latency, have data privacy requirements that prevent sending data to third parties, or need custom model modifications that Cog doesn’t support.

The breakeven point is roughly 4-6 hours of continuous GPU usage per day. Below that, Replicate is cheaper. Above that, a reserved cloud GPU wins on cost.