How to Test and Debug AI Agents with LangSmith Tracing

When your agent silently picks the wrong tool, loops 47 times, then returns garbage – good luck figuring out why from print statements. LangSmith gives you a full trace of every LLM call, tool invocation, and routing decision your agent makes, displayed as a nested run tree you can actually read.

This guide covers the three things that matter most: instrumenting your agent code for tracing, reading traces to diagnose real failures, and building datasets that catch regressions before they hit production.

Set Up LangSmith Tracing

Install the SDK and set your environment variables. You need a LangSmith API key from smith.langchain.com.

1
pip install -U langsmith langchain-openai langchain-core

1
2
3
4
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="lsv2_pt_your-key-here"
export LANGSMITH_PROJECT="agent-debugging"
export OPENAI_API_KEY="sk-your-key-here"

LANGSMITH_PROJECT groups your traces. If you skip it, everything lands in a project called “default” and gets messy fast.

Instrument Your Agent with @traceable

The @traceable decorator logs function inputs, outputs, latency, and errors to LangSmith automatically. Every decorated function becomes a node in your trace tree.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from openai import OpenAI

client = wrap_openai(OpenAI())

@traceable(name="classify_intent")
def classify_intent(user_query: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Classify the user intent as: search, calculate, or summarize. Reply with one word."},
            {"role": "user", "content": user_query},
        ],
    )
    return response.choices[0].message.content.strip().lower()

@traceable(name="run_tool")
def run_tool(intent: str, query: str) -> str:
    if intent == "search":
        return f"[search results for: {query}]"
    elif intent == "calculate":
        return f"[calculation result for: {query}]"
    elif intent == "summarize":
        return f"[summary of: {query}]"
    else:
        raise ValueError(f"Unknown intent: {intent}")

@traceable(name="agent_loop")
def agent_loop(user_query: str) -> str:
    intent = classify_intent(user_query)
    result = run_tool(intent, user_query)
    return result

wrap_openai is the key piece here. It patches the OpenAI client so every chat.completions.create call gets traced with the full prompt, response tokens, model parameters, and latency. Without it, your trace tree shows the agent steps but not the actual LLM calls inside them.

Run agent_loop("what is 2+2") and check the LangSmith UI. You will see a nested trace: agent_loop > classify_intent > OpenAI call > run_tool.

Read Traces to Find Real Bugs

The trace tree in the LangSmith UI is where debugging actually happens. Each row is a “run” – an LLM call, tool invocation, or function execution. Here are the failure patterns you will see most often.

Infinite Agent Loops

Your agent calls a tool, gets a result, decides it needs to call the tool again, and repeats until it hits the recursion limit. In LangSmith, this shows up as a deeply nested trace with the same node name repeated dozens of times.

If you are using LangGraph, you will see this error:

1
GraphRecursionError: Recursion limit of 25 reached without hitting a stop condition.

Fix it by setting an explicit recursion limit and adding a fallback:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from langgraph.graph import StateGraph

graph = StateGraph(AgentState)
# ... define nodes and edges ...
app = graph.compile()

# Set a hard limit
result = app.invoke(
    {"messages": [HumanMessage(content="research quantum computing")]},
    config={"recursion_limit": 15},
)

In the LangSmith trace, check what the LLM returned right before the loop started. Usually the model is generating a tool call when it should be generating a final answer. The fix is almost always a prompt change – explicitly tell the model when to stop calling tools.

Tool Call Failures

When a tool raises an exception, LangSmith marks that run in red and captures the full traceback. Common ones:

ValidationError: The LLM passed arguments that do not match the tool’s schema. Check the “Inputs” tab on the tool run – you will often see the model hallucinating parameter names.
TypeError: missing required argument: The model skipped a required field. Your tool description needs to be more explicit about required vs. optional parameters.
AuthenticationError / HTTPError 401: An API key is missing or expired. This shows up in traces as a failed child run under the tool node.

Slow Traces

Click any run in the trace tree to see its latency. If your agent takes 30 seconds but the LLM calls only account for 8 seconds, something in your tool execution is blocking. LangSmith’s timing breakdown makes this obvious – sort runs by duration and look for the outlier.

Build Datasets for Regression Testing

Traces are great for debugging one-off failures. Datasets let you catch regressions systematically. The workflow: create a dataset of input/expected-output pairs, run your agent against all of them, and score the results with evaluators.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from langsmith import Client
from langsmith.evaluation import evaluate
from langsmith.schemas import Example, Run

client = Client()

# Create a dataset from known-good examples
dataset = client.create_dataset("agent-intent-tests")
client.create_examples(
    inputs=[
        {"query": "what is 15% of 200"},
        {"query": "summarize the latest AI news"},
        {"query": "search for Python packaging best practices"},
        {"query": "calculate compound interest on $1000 at 5% for 3 years"},
    ],
    outputs=[
        {"expected_intent": "calculate"},
        {"expected_intent": "summarize"},
        {"expected_intent": "search"},
        {"expected_intent": "calculate"},
    ],
    dataset_id=dataset.id,
)

# Define what your agent returns
def predict(inputs: dict) -> dict:
    intent = classify_intent(inputs["query"])
    return {"predicted_intent": intent}

# Score it
def intent_accuracy(run: Run, example: Example) -> dict:
    predicted = run.outputs["predicted_intent"]
    expected = example.outputs["expected_intent"]
    return {"key": "intent_match", "score": int(predicted == expected)}

results = evaluate(
    predict,
    data="agent-intent-tests",
    evaluators=[intent_accuracy],
    experiment_prefix="intent-v1",
    description="Baseline intent classification accuracy",
)

This creates an “experiment” in LangSmith that you can compare against future runs. Change the model, tweak the prompt, then run the same evaluation with experiment_prefix="intent-v2" – the UI shows a side-by-side diff of scores.

Turn Production Traces Into Test Cases

The best datasets come from real failures. In the LangSmith UI, filter traces by error status, find a representative failure, and click “Add to Dataset.” This turns the actual inputs that broke your agent into a permanent regression test. Over time, your dataset grows into a comprehensive test suite that reflects real-world usage patterns, not just the cases you thought to write by hand.

Evaluate Agent Trajectories

For complex agents, checking the final output is not enough. You need to verify the agent took the right steps. LangSmith supports trajectory evaluators that score the sequence of tool calls, not just the end result.

1
2
3
4
5
6
7
8
9
def trajectory_check(run: Run, example: Example) -> dict:
    """Check that the agent called the expected tools in order."""
    # Extract tool calls from the run's child runs
    child_runs = [r for r in run.child_runs or [] if r.run_type == "tool"]
    tool_names = [r.name for r in child_runs]

    expected_tools = example.outputs.get("expected_tools", [])
    match = tool_names == expected_tools
    return {"key": "trajectory_match", "score": int(match)}

This catches a sneaky class of bugs: agents that return the right answer but through a wasteful or unreliable path. If your agent calls a search tool three times when once would suffice, the final answer might still be correct, but the trajectory evaluator flags the inefficiency.

Automate It in CI

Run evaluations on every pull request to prevent regressions from shipping:

1
2
3
4
5
# In your CI pipeline
python -m langsmith evaluate \
  --dataset "agent-intent-tests" \
  --target "my_agent:predict" \
  --experiment-prefix "ci-$(git rev-parse --short HEAD)"

Set a threshold on your evaluator scores. If intent accuracy drops below 90%, fail the build. This turns your LangSmith datasets into the same kind of safety net that unit tests provide for regular code.

Common Setup Mistakes

Traces not appearing in the UI: Check that LANGSMITH_TRACING is set to true (not "True" or 1). The SDK is strict about this string value. Also verify your API key starts with lsv2_pt_.

Nested traces showing as flat: If child runs appear as separate top-level traces instead of nested under a parent, you are probably creating new threads instead of passing context. Make sure your @traceable functions call each other directly – do not dispatch them through a thread pool or async queue without propagating the LangSmith run context.

“Project not found” errors: The project specified in LANGSMITH_PROJECT gets created automatically on the first trace. But if you typo the name and then look for traces in a different project, you will stare at an empty dashboard wondering why nothing shows up. Check the project dropdown in the UI.

Set Up LangSmith Tracing#

Instrument Your Agent with @traceable#

Read Traces to Find Real Bugs#

Infinite Agent Loops#

Tool Call Failures#

Slow Traces#

Build Datasets for Regression Testing#

Turn Production Traces Into Test Cases#

Evaluate Agent Trajectories#

Automate It in CI#

Common Setup Mistakes#

Related Guides#

About the Author