Modal lets you run AI workloads on serverless GPUs without managing infrastructure. You write Python functions, Modal handles the containers, GPUs, and scaling. Here’s how to deploy inference endpoints, run batch jobs, and fine-tune models without touching kubectl.
Deploy a Hugging Face Model as an API#
This example deploys a text embedding model as a serverless endpoint. Modal downloads the model weights once, caches them in an image, and scales instances automatically based on traffic.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| import modal
# Define the container image with dependencies
image = modal.Image.debian_slim().pip_install(
"transformers",
"torch",
"sentence-transformers"
)
app = modal.App("embeddings-api")
# Load model at container startup (runs once per instance)
@app.cls(
image=image,
gpu="T4", # Request an NVIDIA T4 GPU
container_idle_timeout=300,
)
class EmbeddingModel:
@modal.enter()
def load_model(self):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer('all-MiniLM-L6-v2')
@modal.method()
def embed(self, texts: list[str]) -> list[list[float]]:
return self.model.encode(texts).tolist()
@app.function()
@modal.web_endpoint(method="POST")
def api(body: dict):
texts = body["texts"]
model = EmbeddingModel()
embeddings = model.embed.remote(texts)
return {"embeddings": embeddings, "count": len(embeddings)}
|
Deploy with modal deploy embeddings_api.py. Modal returns a public HTTPS endpoint. Call it with curl:
1
2
3
| curl -X POST https://your-app--api.modal.run \
-H "Content-Type: application/json" \
-d '{"texts": ["serverless AI is fast", "no kubernetes needed"]}'
|
The first cold start takes 10-20 seconds to pull the image and load the model. After that, requests hit warm containers in under 100ms. Modal auto-scales from zero to hundreds of GPUs based on traffic.
Run Batch Embedding Jobs#
For processing large datasets, use Modal’s .map() to parallelize across dozens of GPUs. This example embeds 100,000 product descriptions in minutes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| import modal
app = modal.App("batch-embeddings")
image = modal.Image.debian_slim().pip_install(
"sentence-transformers",
"torch"
)
@app.function(
image=image,
gpu="T4",
timeout=600,
)
def embed_batch(texts: list[str]) -> list[list[float]]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
return model.encode(texts).tolist()
@app.local_entrypoint()
def main():
# Load your dataset
with open("products.txt") as f:
all_texts = f.readlines()
# Split into chunks of 100
batch_size = 100
chunks = [all_texts[i:i+batch_size] for i in range(0, len(all_texts), batch_size)]
# Map across GPUs (each chunk runs on a separate container)
results = list(embed_batch.map(chunks))
# Flatten and save
embeddings = [emb for batch in results for emb in batch]
print(f"Embedded {len(embeddings)} texts")
|
Run with modal run batch_embeddings.py. Modal spins up containers in parallel (default limit: 100 concurrent), processes all chunks, and shuts down when done. You pay only for GPU seconds used.
Mount Model Weights with Volumes#
Downloading 7B+ parameter models on every cold start wastes time and bandwidth. Use Modal volumes to persist model weights across runs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| import modal
app = modal.App("llm-inference")
# Create a persistent volume
volume = modal.Volume.from_name("model-cache", create_if_missing=True)
image = modal.Image.debian_slim().pip_install(
"transformers",
"torch",
"accelerate"
)
@app.function(
image=image,
gpu="A10G",
volumes={"/cache": volume},
timeout=1800,
)
def generate(prompt: str) -> str:
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
# Use volume as HuggingFace cache
os.environ["HF_HOME"] = "/cache"
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto"
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
@app.local_entrypoint()
def main(prompt: str = "Explain serverless computing:"):
result = generate.remote(prompt)
print(result)
|
The first run downloads the 13GB model to /cache. Subsequent runs reuse the cached weights, cutting cold start time from 2 minutes to 15 seconds.
Schedule Fine-Tuning Jobs with Cron#
Run periodic fine-tuning jobs on fresh data. This example retrains a classifier daily at 2 AM UTC.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
| import modal
app = modal.App("daily-finetune")
image = modal.Image.debian_slim().pip_install(
"transformers",
"torch",
"datasets"
)
@app.function(
image=image,
gpu="A100",
timeout=7200,
schedule=modal.Cron("0 2 * * *"), # 2 AM UTC daily
)
def finetune_job():
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load fresh data from your warehouse
dataset = load_dataset("csv", data_files="s3://your-bucket/latest_data.csv")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2
)
training_args = TrainingArguments(
output_dir="/tmp/results",
num_train_epochs=3,
per_device_train_batch_size=16,
logging_steps=100,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
)
trainer.train()
# Save model to Modal volume for persistence
trainer.save_model("/models/classifier-latest")
return "Fine-tuning complete"
|
Deploy with modal deploy daily_finetune.py. Modal runs the job on schedule. No servers sit idle between runs.
Common Errors and Fixes#
“GPU request denied” during high demand: Modal shares a GPU pool across users. If A100s are all allocated, downgrade to A10G or T4, or add retries=3 to your function decorator. Modal automatically retries when capacity becomes available.
Cold starts taking 60+ seconds: Your image is too large. Split model downloads into a separate build step using modal.Image.run_commands() or cache weights in a volume. Avoid installing unnecessary pip packages.
“Volume not found” error: Volumes must be created before use. Either create manually with modal volume create model-cache or use Volume.from_name("name", create_if_missing=True) in your code.
Timeout on long-running jobs: Default timeout is 5 minutes. Set timeout=3600 (in seconds) in your function decorator for jobs like fine-tuning that run for hours.
OOM errors on GPU: Your model is too large for the GPU memory. Use device_map="auto" with Hugging Face Transformers to split across CPU and GPU, or switch to a larger GPU tier (A10G has 24GB, A100 has 40/80GB).
Import errors in cold starts: Dependencies must be installed in the image definition, not imported locally. Move all pip_install() calls to the modal.Image definition at the top of your file.