W&B Weave is the tracing and observability layer from Weights and Biases built specifically for LLM applications. It captures every call your app makes, logs inputs and outputs, tracks token usage, and lets you compare prompt versions side by side. If you have been stitching together print statements and spreadsheets to understand what your LLM pipeline is doing, Weave replaces all of that with a single init call and a decorator.
Here is the fastest way to get tracing running:
| |
| |
That is it. Once you call weave.init(), Weave automatically patches the OpenAI client and logs every request and response to your W&B project. Open your browser, go to wandb.ai, navigate to your project, and you will see the trace appear with full input/output pairs, latency, and token counts.
Auto-Tracing OpenAI Calls
Weave hooks into the OpenAI SDK at the client level. Every call to client.chat.completions.create() gets intercepted, timed, and logged. You do not need to wrap anything or change your calling code. The integration captures:
- The full messages array you sent
- The model name and parameters (temperature, max_tokens, etc.)
- The complete response including finish reason
- Token counts for prompt, completion, and total
- Wall-clock latency in milliseconds
This works for both streaming and non-streaming calls. If you use stream=True, Weave buffers the chunks and logs the assembled response once the stream completes.
Handling Async Calls
Async OpenAI calls get traced the same way. No extra setup needed:
| |
Both OpenAI and AsyncOpenAI are patched automatically by weave.init().
Custom Traced Functions with @weave.op()
The real power comes when you trace your own functions. The @weave.op() decorator logs every call to that function, including arguments, return values, and execution time. This lets you build a full trace tree for multi-step pipelines.
| |
In the Weave UI, you will see generate as the parent span with build_prompt and the OpenAI call as child spans nested underneath. This gives you a clear picture of where time is spent and what data flows through each step.
You can nest @weave.op() decorated functions as deeply as you want. Weave builds the trace tree automatically based on the call stack.
Logging Prompt Versions and Comparing Runs
One of the most useful things about Weave is comparing different prompt strategies. Since every traced call logs its full inputs, you can filter and compare runs by the prompt template you used.
A practical pattern is to version your prompts as named functions:
| |
In the Weave dashboard, filter by op name (summarize_v1 vs summarize_v2) to compare outputs, latency, and token usage side by side. Every time you modify a function decorated with @weave.op(), Weave captures the new source code as a new version. You can diff any two versions directly in the UI without maintaining a separate versioning system.
For more structured prompt management, wrap your configuration in a weave.Model subclass:
| |
When you change model_name, system_prompt, or temperature, Weave tracks it as a new version of the model object. You get a full history of every configuration you tried, linked to the traces that used it.
Tracking Token Usage and Cost
Every auto-traced OpenAI call includes token counts in the logged metadata. Weave pulls these directly from the API response’s usage field. You can see prompt tokens, completion tokens, and total tokens for every single call in the traces table.
To aggregate costs across an experiment programmatically, pull the data using the Weave client:
| |
The Weave dashboard also shows token counts in the traces table. Sort by total tokens to find your most expensive calls without writing any code. For multi-model pipelines, Weave breaks down usage per model so you can see exactly where your budget goes.
Common Errors and Fixes
wandb.errors.UsageError: api_key not configured
You need to authenticate before weave.init() works. Run this once:
| |
Or set the environment variable:
| |
Traces not appearing in the dashboard
Weave uploads traces in background threads. If your script exits immediately after the last API call, traces might not finish uploading. Call weave.finish() at the end of short scripts to flush pending data:
| |
For long-running services, Weave flushes periodically on its own. But for one-off scripts and notebooks, always call weave.finish() or your last few traces may get lost.
OpenAI calls not being auto-traced
Make sure weave.init() is called before you create the OpenAI() client. Weave patches the SDK at init time. If you create the client first, it will not be instrumented:
| |
@weave.op() not capturing nested calls
The decorator must be applied before the function is called. If you dynamically construct functions or use functools.partial, the decorator might not see the call hierarchy. Stick to decorating regular functions and class methods directly.
TypeError when scorer functions have wrong parameter names
Weave evaluation scorers must use target and output as parameter names. If your dataset rows use different column names, the scorer will fail with a TypeError. Rename your scorer parameters to match, or use column_map on class-based scorers to remap argument names.
Related Guides
- How to Track ML Experiments with Weights and Biases
- How to Run Fast LLM Inference with the Groq API
- How to Use the Anthropic Prompt Caching API with Context Blocks
- How to Use the Anthropic Tool Use API for Agentic Workflows
- How to Use the Stability AI API for Image and Video Generation
- How to Use the AWS Bedrock Converse API for Multi-Model Chat
- How to Use the OpenAI Realtime API for Voice Applications
- How to Use the Cerebras API for Fast LLM Inference
- How to Use the Anthropic Multi-Turn Conversation API with Tool Use
- How to Use the Mistral API for Code Generation and Chat