How to Build a Memory-Augmented Agent with Vector Search

Most AI agents forget everything the moment a conversation ends. They have no memory beyond the current context window. That is a problem when you want an agent that learns from past interactions, recalls user preferences, or builds on previous work.

The fix is vector search. You store conversation snippets as embeddings, then retrieve the most relevant memories before each LLM call. The agent gets context it would otherwise lose.

Here is the full approach: a VectorMemory class backed by ChromaDB, OpenAI embeddings for encoding, and an agent loop that pulls relevant memories on every turn.

Install Dependencies

1
pip install openai chromadb

ChromaDB runs in-memory by default, but you can point it at a persistent directory so memories survive restarts.

The VectorMemory Class

This class handles storing memories, retrieving them by semantic similarity, and enforcing a capacity limit so you do not blow up your context window.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import time
import chromadb
from openai import OpenAI

client = OpenAI()

class VectorMemory:
    def __init__(self, collection_name: str = "agent_memory", max_memories: int = 1000, persist_dir: str = "./agent_memory_db"):
        self.chroma_client = chromadb.PersistentClient(path=persist_dir)
        self.collection = self.chroma_client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"},
        )
        self.max_memories = max_memories

    def _embed(self, text: str) -> list[float]:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
        )
        return response.data[0].embedding

    def store(self, content: str, metadata: dict | None = None):
        memory_id = f"mem_{int(time.time() * 1000)}_{self.collection.count()}"
        meta = metadata or {}
        meta["timestamp"] = time.time()
        meta["content_preview"] = content[:200]

        self.collection.add(
            ids=[memory_id],
            embeddings=[self._embed(content)],
            documents=[content],
            metadatas=[meta],
        )

        # Enforce capacity limit by dropping oldest memories
        if self.collection.count() > self.max_memories:
            self._evict_oldest()

    def retrieve(self, query: str, top_k: int = 5, min_score: float = 0.3) -> list[dict]:
        if self.collection.count() == 0:
            return []

        results = self.collection.query(
            query_embeddings=[self._embed(query)],
            n_results=min(top_k, self.collection.count()),
            include=["documents", "metadatas", "distances"],
        )

        memories = []
        for doc, meta, distance in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        ):
            # ChromaDB cosine distance: 0 = identical, 2 = opposite
            # Convert to similarity score: 1 - (distance / 2)
            similarity = 1 - (distance / 2)
            if similarity >= min_score:
                memories.append({
                    "content": doc,
                    "metadata": meta,
                    "similarity": round(similarity, 4),
                })

        return memories

    def _evict_oldest(self):
        all_data = self.collection.get(include=["metadatas"])
        paired = list(zip(all_data["ids"], all_data["metadatas"]))
        paired.sort(key=lambda x: x[1].get("timestamp", 0))

        evict_count = self.collection.count() - self.max_memories
        ids_to_remove = [p[0] for p in paired[:evict_count]]
        self.collection.delete(ids=ids_to_remove)

    def count(self) -> int:
        return self.collection.count()

A few decisions worth explaining:

Cosine distance is the default metric for ChromaDB, and it works well for text similarity. The hnsw:space metadata makes this explicit.
text-embedding-3-small is cheap and fast. Use text-embedding-3-large if you need higher recall on nuanced queries, but the small model handles most agent memory tasks fine.
min_score filtering prevents the agent from retrieving irrelevant garbage. A threshold of 0.3 is a reasonable starting point. Raise it if the agent pulls in too much noise.
Eviction drops the oldest memories first. You could get fancier with LRU or importance scoring, but oldest-first works for most use cases.

The Agent Loop with Memory

Now wire the memory into an actual agent. Before each LLM call, retrieve relevant memories and inject them into the system prompt.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import json

memory = VectorMemory(max_memories=500)

SYSTEM_PROMPT = """You are a helpful assistant with access to long-term memory.
You remember past conversations and use them to give better answers.
When memories are provided, reference them naturally without announcing that you are recalling them."""

def build_messages(user_input: str, conversation: list[dict]) -> list[dict]:
    # Retrieve relevant memories for this query
    relevant_memories = memory.retrieve(user_input, top_k=5, min_score=0.4)

    system_content = SYSTEM_PROMPT
    if relevant_memories:
        memory_block = "\n\n".join(
            f"[Memory (similarity: {m['similarity']})] {m['content']}"
            for m in relevant_memories
        )
        system_content += f"\n\nRelevant memories from past interactions:\n{memory_block}"

    messages = [{"role": "system", "content": system_content}]
    messages.extend(conversation)
    messages.append({"role": "user", "content": user_input})
    return messages

def store_exchange(user_input: str, assistant_response: str):
    # Store the full exchange as a single memory unit
    snippet = f"User asked: {user_input}\nAssistant answered: {assistant_response}"
    memory.store(snippet, metadata={"type": "conversation"})

def agent_loop():
    conversation = []
    print("Agent ready. Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ("quit", "exit"):
            break

        messages = build_messages(user_input, conversation)

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0.7,
        )

        assistant_message = response.choices[0].message.content
        print(f"\nAgent: {assistant_message}\n")

        # Keep recent conversation for short-term context
        conversation.append({"role": "user", "content": user_input})
        conversation.append({"role": "assistant", "content": assistant_message})

        # Trim short-term conversation to last 10 exchanges
        if len(conversation) > 20:
            conversation = conversation[-20:]

        # Store in long-term memory
        store_exchange(user_input, assistant_message)

if __name__ == "__main__":
    agent_loop()

This gives you two layers of memory. The conversation list acts as short-term memory (last 10 exchanges). The VectorMemory is long-term – it persists across sessions and surfaces past context that is semantically relevant to the current question.

Managing Memory Quality

Not every exchange deserves to be stored. You can filter what goes into long-term memory by adding a relevance check.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def should_store(user_input: str, assistant_response: str) -> bool:
    # Skip trivial exchanges
    trivial_patterns = ["hello", "hi", "thanks", "bye", "ok", "yes", "no"]
    if user_input.lower().strip() in trivial_patterns:
        return False

    # Skip very short responses that likely carry little information
    if len(assistant_response.split()) < 15:
        return False

    return True

def store_exchange(user_input: str, assistant_response: str):
    if not should_store(user_input, assistant_response):
        return

    snippet = f"User asked: {user_input}\nAssistant answered: {assistant_response}"
    memory.store(snippet, metadata={"type": "conversation"})

You can also tag memories with categories and filter on retrieval. ChromaDB supports metadata filtering in queries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Store with a topic tag
memory.store(
    "The user prefers Python over JavaScript for backend work.",
    metadata={"type": "preference", "topic": "programming"},
)

# Retrieve only preference memories
results = memory.collection.query(
    query_embeddings=[memory._embed("what language should I use")],
    n_results=5,
    where={"type": "preference"},
    include=["documents", "metadatas", "distances"],
)

Tuning Retrieval Parameters

The three knobs that matter most:

top_k: How many memories to retrieve. Start with 5. Going above 10 floods the context window and confuses the model.
min_score: The similarity threshold. Set it too low and you get irrelevant results. Too high and the agent forgets useful context. Start at 0.3-0.4 and adjust based on your domain.
max_memories: Total storage capacity. 500-1000 is reasonable for a personal agent. For production systems processing thousands of users, shard per user with separate collections.

A good rule of thumb: the total injected memory text should stay under 2000 tokens. You do not want memories eating half your context window.

Common Errors and Fixes

ChromaDB DuplicateIDError: You tried to store a memory with an ID that already exists. The VectorMemory class above uses timestamps plus a counter to avoid this, but if you are running concurrent writes, add a UUID:

1
2
3
import uuid

memory_id = f"mem_{uuid.uuid4().hex}"

Empty query results when collection has data: ChromaDB returns an error if n_results exceeds the collection count. The retrieve method above handles this with min(top_k, self.collection.count()), but watch for race conditions if another process is deleting memories.

openai.RateLimitError on embeddings: Embedding calls count against your OpenAI rate limit. If you are storing many memories in a batch, add a small delay or use the batch embedding endpoint:

1
2
3
4
5
6
# Batch embed multiple texts in one call
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["text one", "text two", "text three"],
)
embeddings = [item.embedding for item in response.data]

Memory retrieval returns low-quality matches: Lower your min_score threshold gradually. If results are still poor, check that you are not storing very long documents. Chunk them into 200-500 word segments before embedding – long texts produce diluted embeddings that match everything weakly.

Persistent storage not working: Make sure you are using chromadb.PersistentClient(path="./agent_memory_db") instead of chromadb.Client(). The in-memory client loses everything when the process exits.

Install Dependencies#

The VectorMemory Class#

The Agent Loop with Memory#

Managing Memory Quality#

Tuning Retrieval Parameters#

Common Errors and Fixes#

Related Guides#

About the Author