W&B Weave is the tracing and observability layer from Weights and Biases built specifically for LLM applications. It captures every call your app makes, logs inputs and outputs, tracks token usage, and lets you compare prompt versions side by side. If you have been stitching together print statements and spreadsheets to understand what your LLM pipeline is doing, Weave replaces all of that with a single init call and a decorator.

Here is the fastest way to get tracing running:

1
pip install weave openai
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import weave
from openai import OpenAI

weave.init("my-llm-project")
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain gradient descent in two sentences."}],
)
print(response.choices[0].message.content)

That is it. Once you call weave.init(), Weave automatically patches the OpenAI client and logs every request and response to your W&B project. Open your browser, go to wandb.ai, navigate to your project, and you will see the trace appear with full input/output pairs, latency, and token counts.

Auto-Tracing OpenAI Calls

Weave hooks into the OpenAI SDK at the client level. Every call to client.chat.completions.create() gets intercepted, timed, and logged. You do not need to wrap anything or change your calling code. The integration captures:

  • The full messages array you sent
  • The model name and parameters (temperature, max_tokens, etc.)
  • The complete response including finish reason
  • Token counts for prompt, completion, and total
  • Wall-clock latency in milliseconds

This works for both streaming and non-streaming calls. If you use stream=True, Weave buffers the chunks and logs the assembled response once the stream completes.

Handling Async Calls

Async OpenAI calls get traced the same way. No extra setup needed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import asyncio
import weave
from openai import AsyncOpenAI

weave.init("my-llm-project")
async_client = AsyncOpenAI()

async def ask(question: str) -> str:
    response = await async_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content

asyncio.run(ask("What is RLHF?"))

Both OpenAI and AsyncOpenAI are patched automatically by weave.init().

Custom Traced Functions with @weave.op()

The real power comes when you trace your own functions. The @weave.op() decorator logs every call to that function, including arguments, return values, and execution time. This lets you build a full trace tree for multi-step pipelines.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import weave
from openai import OpenAI

weave.init("my-llm-project")
client = OpenAI()

@weave.op()
def build_prompt(topic: str, style: str) -> list[dict]:
    system_msg = f"You are a technical writer. Write in a {style} style."
    return [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": f"Explain {topic} for a senior engineer."},
    ]

@weave.op()
def generate(topic: str, style: str) -> str:
    messages = build_prompt(topic, style)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.7,
        max_tokens=500,
    )
    return response.choices[0].message.content

result = generate("vector databases", "concise")
print(result)

In the Weave UI, you will see generate as the parent span with build_prompt and the OpenAI call as child spans nested underneath. This gives you a clear picture of where time is spent and what data flows through each step.

You can nest @weave.op() decorated functions as deeply as you want. Weave builds the trace tree automatically based on the call stack.

Logging Prompt Versions and Comparing Runs

One of the most useful things about Weave is comparing different prompt strategies. Since every traced call logs its full inputs, you can filter and compare runs by the prompt template you used.

A practical pattern is to version your prompts as named functions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import weave
from openai import OpenAI

weave.init("prompt-experiment")
client = OpenAI()

@weave.op()
def summarize_v1(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Summarize the following text in one paragraph."},
            {"role": "user", "content": text},
        ],
    )
    return response.choices[0].message.content

@weave.op()
def summarize_v2(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a concise technical editor. Summarize the key points in 2-3 bullet points. Skip any fluff."},
            {"role": "user", "content": text},
        ],
    )
    return response.choices[0].message.content

sample_text = (
    "Retrieval-augmented generation combines a retriever module with a generative model. "
    "The retriever fetches relevant documents from a corpus, and the generator conditions on "
    "those documents to produce an answer. This approach reduces hallucination compared to pure "
    "generation because the model has factual grounding from retrieved passages."
)

print("--- V1 ---")
print(summarize_v1(sample_text))
print("--- V2 ---")
print(summarize_v2(sample_text))

In the Weave dashboard, filter by op name (summarize_v1 vs summarize_v2) to compare outputs, latency, and token usage side by side. Every time you modify a function decorated with @weave.op(), Weave captures the new source code as a new version. You can diff any two versions directly in the UI without maintaining a separate versioning system.

For more structured prompt management, wrap your configuration in a weave.Model subclass:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import json
import weave
from openai import OpenAI

weave.init("prompt-experiment")

class Summarizer(weave.Model):
    model_name: str
    system_prompt: str
    temperature: float = 0.3

    @weave.op()
    def predict(self, text: str) -> str:
        client = OpenAI()
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": text},
            ],
            temperature=self.temperature,
        )
        return response.choices[0].message.content

v1 = Summarizer(
    model_name="gpt-4o-mini",
    system_prompt="Summarize the text in one paragraph.",
)

v2 = Summarizer(
    model_name="gpt-4o-mini",
    system_prompt="List the 3 most important points from the text.",
    temperature=0.1,
)

When you change model_name, system_prompt, or temperature, Weave tracks it as a new version of the model object. You get a full history of every configuration you tried, linked to the traces that used it.

Tracking Token Usage and Cost

Every auto-traced OpenAI call includes token counts in the logged metadata. Weave pulls these directly from the API response’s usage field. You can see prompt tokens, completion tokens, and total tokens for every single call in the traces table.

To aggregate costs across an experiment programmatically, pull the data using the Weave client:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import weave

api = weave.init("my-llm-project")

# Cost per 1M tokens - adjust for your model and current pricing
COST_PER_1M_PROMPT = 2.50      # gpt-4o prompt tokens
COST_PER_1M_COMPLETION = 10.00  # gpt-4o completion tokens

calls = api.get_calls()
total_prompt_tokens = 0
total_completion_tokens = 0

for call in calls:
    usage = call.summary.get("usage", {})
    model_usage = usage.get("gpt-4o", {})
    total_prompt_tokens += model_usage.get("prompt_tokens", 0)
    total_completion_tokens += model_usage.get("completion_tokens", 0)

prompt_cost = (total_prompt_tokens / 1_000_000) * COST_PER_1M_PROMPT
completion_cost = (total_completion_tokens / 1_000_000) * COST_PER_1M_COMPLETION
total_cost = prompt_cost + completion_cost

print(f"Prompt tokens: {total_prompt_tokens:,}")
print(f"Completion tokens: {total_completion_tokens:,}")
print(f"Estimated cost: ${total_cost:.4f}")

The Weave dashboard also shows token counts in the traces table. Sort by total tokens to find your most expensive calls without writing any code. For multi-model pipelines, Weave breaks down usage per model so you can see exactly where your budget goes.

Common Errors and Fixes

wandb.errors.UsageError: api_key not configured

You need to authenticate before weave.init() works. Run this once:

1
wandb login

Or set the environment variable:

1
export WANDB_API_KEY="your-api-key-here"

Traces not appearing in the dashboard

Weave uploads traces in background threads. If your script exits immediately after the last API call, traces might not finish uploading. Call weave.finish() at the end of short scripts to flush pending data:

1
2
3
weave.init("my-project")
# ... your code ...
weave.finish()

For long-running services, Weave flushes periodically on its own. But for one-off scripts and notebooks, always call weave.finish() or your last few traces may get lost.

OpenAI calls not being auto-traced

Make sure weave.init() is called before you create the OpenAI() client. Weave patches the SDK at init time. If you create the client first, it will not be instrumented:

1
2
3
4
5
6
7
# Wrong order - traces will be missing
client = OpenAI()
weave.init("my-project")

# Correct order
weave.init("my-project")
client = OpenAI()

@weave.op() not capturing nested calls

The decorator must be applied before the function is called. If you dynamically construct functions or use functools.partial, the decorator might not see the call hierarchy. Stick to decorating regular functions and class methods directly.

TypeError when scorer functions have wrong parameter names

Weave evaluation scorers must use target and output as parameter names. If your dataset rows use different column names, the scorer will fail with a TypeError. Rename your scorer parameters to match, or use column_map on class-based scorers to remap argument names.