The Full Agent Loop in 30 Seconds

A customer support agent needs three capabilities: answer questions from your docs (RAG), take actions like checking order status (tool calling), and know when to hand off to a human (escalation). Here’s the architecture:

  1. User sends a message
  2. Retrieve relevant docs from ChromaDB
  3. Call OpenAI with the context, conversation history, and tool definitions
  4. If the model calls a tool, execute it and feed the result back
  5. If the model’s answer has low confidence, escalate to a human
  6. Return the response

Let’s build each piece.

Setting Up the Knowledge Base with ChromaDB

First, load your support docs into a vector store. ChromaDB handles embedding and retrieval out of the box.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import chromadb
from chromadb.utils import embedding_functions

# Use OpenAI embeddings for better semantic matching
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-3-small"
)

client = chromadb.PersistentClient(path="./support_kb")
collection = client.get_or_create_collection(
    name="support_docs",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}
)

# Ingest your docs — chunk them first in production
docs = [
    "Refund policy: Full refunds within 30 days of purchase. After 30 days, store credit only.",
    "Shipping: Standard 5-7 business days. Express 1-2 business days. Free shipping over $50.",
    "Returns: Items must be unused and in original packaging. Start a return at account.example.com/returns.",
    "Account issues: Reset password at account.example.com/reset. Two-factor auth available in settings.",
]

collection.add(
    documents=docs,
    ids=[f"doc_{i}" for i in range(len(docs))],
    metadatas=[{"source": "faq"} for _ in docs]
)

In production, you’d chunk longer documents (aim for 200-500 tokens per chunk), add metadata like last_updated and category, and refresh the index on a schedule. But this gets you running.

Defining Tools for the Agent

OpenAI function calling lets the model decide when to use tools. Define the tools as JSON schemas and write the actual implementations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import json
from openai import OpenAI

client_oai = OpenAI()

# Tool definitions for OpenAI
tools = [
    {
        "type": "function",
        "function": {
            "name": "lookup_order_status",
            "description": "Look up the current status of a customer order by order ID",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "description": "The order ID, e.g. ORD-12345"
                    }
                },
                "required": ["order_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "escalate_to_human",
            "description": "Transfer the conversation to a human support agent when the issue is complex, sensitive, or you are not confident in your answer",
            "parameters": {
                "type": "object",
                "properties": {
                    "reason": {
                        "type": "string",
                        "description": "Why the conversation needs human attention"
                    },
                    "priority": {
                        "type": "string",
                        "enum": ["low", "medium", "high"],
                        "description": "Urgency level"
                    }
                },
                "required": ["reason", "priority"]
            }
        }
    }
]

# Actual tool implementations
def lookup_order_status(order_id: str) -> dict:
    """In production, this hits your order management API."""
    # Simulated response
    fake_orders = {
        "ORD-12345": {"status": "shipped", "tracking": "1Z999AA10123456784", "eta": "2026-02-18"},
        "ORD-67890": {"status": "processing", "tracking": None, "eta": "2026-02-20"},
    }
    order = fake_orders.get(order_id)
    if not order:
        return {"error": f"Order {order_id} not found"}
    return order

def escalate_to_human(reason: str, priority: str) -> dict:
    """In production, this creates a ticket in your helpdesk system."""
    print(f"[ESCALATION] Priority: {priority} | Reason: {reason}")
    return {"ticket_id": "TKT-9001", "status": "queued", "estimated_wait": "5 minutes"}

TOOL_MAP = {
    "lookup_order_status": lookup_order_status,
    "escalate_to_human": escalate_to_human,
}

The escalate_to_human tool is the key safety valve. By giving the model an explicit tool for escalation, it will use it when the system prompt tells it to. This works far better than trying to parse confidence from the text output.

The Agent Conversation Loop

This is where everything comes together. The agent retrieves context, calls the LLM, handles tool calls in a loop, and detects when to escalate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
SYSTEM_PROMPT = """You are a customer support agent for Example Store. Your job:

1. Answer questions using ONLY the provided knowledge base context. Do not make up policies.
2. Use the lookup_order_status tool when customers ask about their orders.
3. Use the escalate_to_human tool when:
   - The customer is angry or frustrated
   - You don't have enough context to answer confidently
   - The issue involves billing disputes, account security, or legal matters
   - The customer explicitly asks for a human

Be concise, helpful, and honest. If you're unsure, say so and escalate."""


def retrieve_context(query: str, n_results: int = 3) -> str:
    results = collection.query(query_texts=[query], n_results=n_results)
    docs = results["documents"][0]
    return "\n---\n".join(docs)


def run_agent(user_message: str, conversation_history: list) -> str:
    # Step 1: Retrieve relevant knowledge base docs
    context = retrieve_context(user_message)

    # Step 2: Build messages with context injection
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "system", "content": f"Knowledge base context:\n{context}"},
        *conversation_history,
        {"role": "user", "content": user_message},
    ]

    # Step 3: Call LLM with tools — loop until we get a final text response
    max_tool_rounds = 5  # prevent infinite loops
    for _ in range(max_tool_rounds):
        response = client_oai.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto",
            temperature=0.3,  # lower temp for support — you want consistency
        )

        msg = response.choices[0].message

        # If no tool calls, we have our final answer
        if not msg.tool_calls:
            assistant_reply = msg.content
            conversation_history.append({"role": "user", "content": user_message})
            conversation_history.append({"role": "assistant", "content": assistant_reply})
            return assistant_reply

        # Step 4: Execute each tool call
        messages.append(msg)  # add assistant's tool_call message

        for tool_call in msg.tool_calls:
            fn_name = tool_call.function.name
            fn_args = json.loads(tool_call.function.arguments)

            print(f"[TOOL CALL] {fn_name}({fn_args})")

            fn = TOOL_MAP.get(fn_name)
            if fn is None:
                result = {"error": f"Unknown tool: {fn_name}"}
            else:
                result = fn(**fn_args)

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result),
            })

    return "I'm having trouble processing your request. Let me connect you with a human agent."

A few things worth noting. The max_tool_rounds guard prevents runaway tool loops – you’ll hit this maybe once every few thousand conversations, but it matters. Temperature 0.3 keeps answers consistent; bump it up if your agent sounds too robotic. And the context injection as a second system message keeps it separate from the behavioral instructions, which empirically gives better results than mixing them together.

Running a Conversation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
history = []

# Customer asks about shipping
reply = run_agent("How long does express shipping take?", history)
print(f"Agent: {reply}")
# Agent: Express shipping takes 1-2 business days. Free shipping on orders over $50.

# Customer asks about an order
reply = run_agent("Where is my order ORD-12345?", history)
print(f"Agent: {reply}")
# [TOOL CALL] lookup_order_status({'order_id': 'ORD-12345'})
# Agent: Your order ORD-12345 has shipped! Tracking number: 1Z999AA10123456784, ETA Feb 18.

# Customer gets frustrated
reply = run_agent("This is ridiculous, I was promised next-day delivery!", history)
print(f"Agent: {reply}")
# [TOOL CALL] escalate_to_human({'reason': 'Customer frustrated about delivery timeline...', 'priority': 'high'})
# Agent: I understand your frustration. I've escalated this to a specialist (ticket TKT-9001)...

Production Hardening

Before shipping this, add these pieces:

  • Rate limiting: Cap tool calls per conversation and per minute. A user spamming order lookups can hammer your API.
  • Input validation: Sanitize order IDs before passing to your backend. The model will pass through whatever the user types.
  • Logging: Store every conversation turn, tool call, and escalation. You need this for debugging and quality review.
  • Guardrails: Add a content filter on both input and output. The system prompt alone won’t stop determined prompt injection.
  • Timeouts: Set a 10-second timeout on tool calls. If your order API is slow, the user shouldn’t wait forever.
  • Conversation length limits: Cap at 20-30 turns per session. Long conversations degrade quality and eat tokens.

Common Errors and Fixes

chromadb.errors.InvalidCollectionException – You’re trying to query a collection that doesn’t exist or was created with a different embedding function. Delete the ./support_kb directory and re-ingest.

openai.BadRequestError: 'tool_call_id' is required – You forgot to include the tool_call_id in the tool response message. Every tool result must reference the exact tool_call.id from the assistant’s message. This is the most common mistake when building tool loops.

The model ignores knowledge base context and makes up answers – Your system prompt isn’t strict enough, or the retrieved context isn’t relevant. Check what ChromaDB is returning with collection.query() and inspect the distances. If distances are above 1.0 (ChromaDB cosine distances range from 0 to 2, where 0 is identical), the context probably isn’t relevant. Also try adding “If the context doesn’t contain the answer, say you don’t know” to the system prompt.

The model never escalates – The escalation tool description needs to be explicit about when to use it. Vague descriptions like “use when appropriate” don’t work. List specific triggers in both the tool description and system prompt.

Tool calls loop forever – This happens when the model calls a tool, gets an error, and retries the same call. The max_tool_rounds guard prevents this, but you should also return clear error messages from your tools so the model can tell the user what went wrong instead of retrying.

json.JSONDecodeError when parsing tool arguments – Rare with GPT-4o but happens with smaller models. Wrap the json.loads call in a try/except and return a structured error to the model so it can retry with valid JSON.