LLM apps fail in ways traditional software doesn’t. A model call that worked yesterday returns garbage today because the provider updated weights or your prompt drifted. You can’t monitor this with HTTP status codes alone – you need to see every input, output, token count, latency, and cost for every call your app makes. That’s what LangSmith does.

LangSmith is LangChain’s observability platform. It captures full traces of LLM interactions, lets you build evaluation datasets, and tracks cost and latency over time. It works with LangChain, but also with raw OpenAI calls and any other LLM provider through manual instrumentation.

Install and Configure

1
pip install langsmith langchain-openai langchain-core

You need two things: a LangSmith API key and some environment variables. Sign up at smith.langchain.com, create an API key under Settings, then export these:

1
2
3
4
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="lsv2_pt_your_key_here"
export LANGCHAIN_PROJECT="my-llm-app"
export OPENAI_API_KEY="sk-your-openai-key"

LANGCHAIN_TRACING_V2=true is the kill switch. Set it to false or remove it entirely to stop sending traces. LANGCHAIN_PROJECT groups traces by application – use different project names for staging vs. production so you don’t pollute your dashboards.

Auto-Tracing with LangChain

If you’re already using LangChain, tracing is automatic. The moment those environment variables are set, every chain invocation sends a trace to LangSmith with zero code changes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

model = ChatOpenAI(model="gpt-4o", temperature=0)

messages = [
    SystemMessage(content="You are a senior Python developer. Be concise."),
    HumanMessage(content="How do I retry a failed HTTP request with exponential backoff?"),
]

response = model.invoke(messages)
print(response.content)

Run this and check your LangSmith dashboard. You’ll see a trace with the full input messages, the model’s response, token counts (prompt + completion), latency in milliseconds, and the estimated cost. Each trace is a tree – if you chain multiple calls together, LangSmith nests them so you can see parent-child relationships.

Manual Tracing with @traceable

You don’t need LangChain at all. The langsmith SDK gives you the @traceable decorator, which wraps any function and sends its inputs and outputs to LangSmith.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import openai
from langsmith import traceable

client = openai.OpenAI()

@traceable(run_type="llm", name="generate-answer")
def ask_question(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer in one paragraph."},
            {"role": "user", "content": question},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content

@traceable(run_type="chain", name="qa-pipeline")
def qa_pipeline(question: str) -> dict:
    answer = ask_question(question)
    return {"question": question, "answer": answer}

result = qa_pipeline("What causes CUDA out of memory errors?")
print(result["answer"])

The run_type parameter matters. Use "llm" for functions that directly call a model, "chain" for orchestration functions, "tool" for tool executions, and "retriever" for search/retrieval steps. LangSmith renders these differently in the trace view – LLM runs show token breakdowns while chain runs show nested call trees.

When you nest @traceable functions like this, LangSmith automatically builds a parent-child trace hierarchy. The qa-pipeline run appears as the root, with generate-answer nested underneath it.

Wrap OpenAI Directly

If you want full token-level tracking without @traceable on every function, use wrap_openai. This patches the OpenAI client to report traces automatically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from langsmith.wrappers import wrap_openai
import openai

client = wrap_openai(openai.OpenAI())

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain attention mechanisms in 3 sentences."}],
)
print(response.choices[0].message.content)

Every call through the wrapped client now shows up in LangSmith with model name, token counts, and latency. This is the lowest-effort integration if you’re using the OpenAI SDK directly.

Trace an Agent Run

For agent workflows where an LLM calls tools in a loop, tracing becomes essential. Without it, debugging “why did the agent call the wrong tool on step 4?” is near impossible.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from langsmith import traceable
import openai
import json

client = openai.OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }
]

def get_weather(city: str) -> str:
    # Simulated -- replace with real API call
    return json.dumps({"city": city, "temp_f": 72, "condition": "sunny"})

@traceable(run_type="chain", name="weather-agent")
def run_agent(user_input: str) -> str:
    messages = [{"role": "user", "content": user_input}]

    response = client.chat.completions.create(
        model="gpt-4o", messages=messages, tools=tools
    )
    msg = response.choices[0].message

    if msg.tool_calls:
        messages.append(msg)
        for tool_call in msg.tool_calls:
            fn_name = tool_call.function.name
            fn_args = json.loads(tool_call.function.arguments)
            result = get_weather(**fn_args)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result,
            })

        final = client.chat.completions.create(
            model="gpt-4o", messages=messages
        )
        return final.choices[0].message.content

    return msg.content

answer = run_agent("What's the weather in Austin?")
print(answer)

In the LangSmith dashboard, this trace shows the entire agent loop: the initial LLM call, the tool invocation, and the final response generation. You can see exactly what the model decided to call and what arguments it passed.

Build Evaluation Datasets

Monitoring tells you something went wrong. Evaluations tell you what went wrong and whether it’s getting worse. LangSmith lets you create datasets of input-output examples and run evaluators against them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from langsmith import Client

ls_client = Client()

# Create a dataset
dataset = ls_client.create_dataset(
    "qa-golden-set",
    description="Curated question-answer pairs for regression testing",
)

# Add examples
ls_client.create_examples(
    inputs=[
        {"question": "What is the capital of France?"},
        {"question": "What language is PyTorch written in?"},
        {"question": "Who created Linux?"},
    ],
    outputs=[
        {"answer": "Paris"},
        {"answer": "C++ and Python"},
        {"answer": "Linus Torvalds"},
    ],
    dataset_id=dataset.id,
)

Now run your LLM against this dataset and score the results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from langsmith.evaluation import evaluate

def predict(inputs: dict) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": inputs["question"]}],
    )
    return {"answer": response.choices[0].message.content}

def correctness(run, example) -> dict:
    predicted = run.outputs["answer"].lower()
    expected = example.outputs["answer"].lower()
    score = 1.0 if expected in predicted else 0.0
    return {"key": "correctness", "score": score}

results = evaluate(
    predict,
    data="qa-golden-set",
    evaluators=[correctness],
    experiment_prefix="gpt4o-baseline",
)

LangSmith stores every evaluation run as an experiment. You can compare experiments side by side in the dashboard – this is how you catch regressions when you change a prompt or swap models.

Track Costs and Latency

LangSmith automatically computes cost estimates based on the model and token counts. In the dashboard, you can filter by project and time range to see:

  • Total cost per day/week/month
  • P50/P95/P99 latency for each trace type
  • Token usage broken down by prompt vs. completion
  • Error rate as a percentage of total runs

For programmatic access, use the SDK to query run stats:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from langsmith import Client
from datetime import datetime, timedelta

ls_client = Client()

runs = ls_client.list_runs(
    project_name="my-llm-app",
    start_time=datetime.now() - timedelta(days=7),
    run_type="llm",
)

total_tokens = 0
total_cost = 0.0
for run in runs:
    if run.total_tokens:
        total_tokens += run.total_tokens
    if run.total_cost:
        total_cost += run.total_cost

print(f"Last 7 days: {total_tokens:,} tokens, ${total_cost:.2f}")

Set Up Rules and Alerts

LangSmith supports automated rules that trigger on trace conditions. You configure these in the dashboard under your project settings. Typical rules you should set up:

  • Latency threshold: Alert when P95 latency exceeds your SLA (e.g., 5 seconds)
  • Error spike: Alert when error rate exceeds 5% over a 15-minute window
  • Cost anomaly: Alert when daily cost exceeds 2x the trailing 7-day average
  • Sentiment/quality: Run an LLM-as-judge evaluator on a sample of production traces and alert on quality drops

You can also use online evaluators that run automatically on incoming traces. These attach scores to production runs in real time, which is useful for catching quality degradation without waiting for a manual evaluation pass.

Common Errors and Fixes

langsmith.utils.LangSmithAuthError: Invalid API key

1
langsmith.utils.LangSmithAuthError: Invalid API key. Please check your LANGCHAIN_API_KEY.

Your LANGCHAIN_API_KEY is wrong or expired. Go to smith.langchain.com, regenerate the key under Settings > API Keys, and re-export it. A common mistake is setting LANGSMITH_API_KEY instead of LANGCHAIN_API_KEY – the environment variable name uses the LANGCHAIN_ prefix.

Traces Not Appearing in Dashboard

You set everything up but the dashboard is empty. Check these in order:

  1. Is LANGCHAIN_TRACING_V2 set to true (string, not boolean)? Print it: python -c "import os; print(os.environ.get('LANGCHAIN_TRACING_V2'))"
  2. Are you looking at the right project? The default project is "default" unless you set LANGCHAIN_PROJECT.
  3. Is your firewall blocking outbound HTTPS to api.smith.langchain.com?
  4. Check for silent failures by enabling debug logging:
1
2
import logging
logging.getLogger("langsmith").setLevel(logging.DEBUG)

langsmith.utils.LangSmithConnectionError: Connection refused

This usually means you’re pointing at a self-hosted LangSmith instance that’s down. If you’re using the hosted version, make sure LANGCHAIN_ENDPOINT is not set or is set to https://api.smith.langchain.com. Setting it to localhost or an internal URL will bypass the hosted service.

LangSmithRateLimitError: Too many requests

The free tier has rate limits on trace ingestion. If you’re sending thousands of traces per minute in production, you’ll hit this. Two options:

  • Upgrade to a paid plan for higher limits
  • Use sampling to trace only a percentage of requests:
1
2
3
4
5
6
7
8
import os
import random

# Trace 10% of requests
if random.random() < 0.1:
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
else:
    os.environ["LANGCHAIN_TRACING_V2"] = "false"

A cleaner approach is to use the LANGCHAIN_CALLBACKS_BACKGROUND environment variable set to true so tracing happens asynchronously and doesn’t block your request path.

High Latency from Tracing Overhead

Tracing adds network calls. If you’re seeing added latency, set LANGCHAIN_CALLBACKS_BACKGROUND=true to send traces in a background thread. This makes trace delivery best-effort but keeps your request latency clean.

1
export LANGCHAIN_CALLBACKS_BACKGROUND=true

Production Recommendations

Use separate projects for each environment. Name them my-app-dev, my-app-staging, my-app-prod. This keeps your production dashboards clean and lets you set different alerting thresholds per environment.

Sample in production, trace everything in staging. Full tracing on every production request adds cost and latency. Trace 100% in staging for debugging, 5-20% in production for monitoring.

Build a golden dataset early. Collect 50-100 representative inputs with expected outputs before you ship. Run evaluations against this dataset on every prompt change. This catches regressions that unit tests can’t.

Tag traces with metadata. Use the metadata parameter in @traceable to attach version numbers, A/B test groups, or user segments. This makes filtering in the dashboard much more useful.

1
2
3
4
5
6
7
@traceable(
    run_type="chain",
    name="qa-pipeline",
    metadata={"version": "2.1", "ab_group": "treatment"},
)
def qa_pipeline(question: str) -> dict:
    ...