Building Memory into AI Agents: LangMem, Mem0, and Custom Patterns
How to actually implement persistent memory in production agents. Architecture patterns, library comparison, and what breaks at scale.
Stateless agents fail at real tasks. Ask a customer service agent to 'follow up on the issue we discussed last week' and it has no idea what you mean. Ask a coding assistant to 'use the same pattern as last time' and it starts from scratch. Memory is not a nice-to-have for production agents — it's the difference between a capable assistant and an expensive autocomplete.
This post is about actually implementing it: the architecture choices, the libraries (LangMem, Mem0), and the specific things that break at scale.
The 4 Memory Types
| Type | What it stores | Retrieval mechanism | Lifespan |
|---|---|---|---|
| In-context (working) | Current conversation, recent tool results | Always in prompt | Single session |
| Episodic | Past conversations, events, interactions | Semantic search over summaries | Persistent across sessions |
| Semantic | Facts, preferences, knowledge about the user/world | Key-value or semantic lookup | Persistent, updateable |
| Procedural | How to do things — learned workflows, user preferences for process | Retrieved by task type | Persistent, rarely changes |
Most 'agent memory' implementations only do episodic memory (storing past conversations). That's the easiest but often not what matters. Semantic memory (stored facts about the user) and procedural memory (learned preferences about how to work) are where the real value is.
LangMem Architecture
LangMem (LangChain's memory library) provides a structured approach to memory formation, storage, and retrieval. Its key insight is the memory formation trigger: rather than dumping every conversation into a store, it uses the LLM itself to decide what's worth remembering.
from langmem import AsyncClient
from langchain_core.messages import HumanMessage, AIMessage
client = AsyncClient()
# After each conversation turn, trigger memory extraction
async def process_turn(user_id: str, messages: list):
# LangMem sends the conversation to an LLM to extract memories
# Only salient facts are stored — not raw conversation
await client.add_messages(
thread_id=f"user_{user_id}",
messages=messages
)
# Retrieve relevant memories before generating a response
async def get_context(user_id: str, query: str):
memories = await client.search_user_memory(
user_id=user_id,
query=query,
limit=5
)
return [m.content for m in memories.items]
LangMem's memory formation uses a configurable extraction prompt: the LLM reads the conversation and extracts structured facts in a schema you define. These are stored in a vector database. The quality of this extraction step determines the quality of the entire memory system.
- Storage backends: in-memory (dev), PostgreSQL with pgvector (prod), Pinecone (scale)
- Memory types: LangMem calls them 'user memories', 'thread memories', and 'org memories'
- Formation triggers: can be async (after conversation) or inline (during conversation)
- Deduplication: LangMem merges memories that refer to the same fact
Mem0 vs LangMem
| Dimension | LangMem | Mem0 |
|---|---|---|
| Philosophy | Extraction-based: LLM decides what to remember | Storage-based: stores everything, retrieves selectively |
| Ease of setup | More config required | Simple API, hosted option available |
| Storage backends | PostgreSQL, Pinecone, custom | Qdrant, Chroma, hosted |
| Memory types | User / thread / org memories | User / agent / session memories |
| Deduplication | Built-in LLM-based merge | Configurable similarity threshold |
| Pricing (hosted) | N/A (self-hosted library) | Free tier + paid plans |
| Production maturity | Production-ready (LangChain org) | Production-ready, fast-growing |
Mem0 is easier to get started with, especially with their hosted option. LangMem gives you more control over the extraction step and integrates more tightly with LangGraph. For most teams: start with Mem0 hosted, migrate to LangMem + self-hosted when you need customization.
Custom Pattern: Extract → Store → Retrieve → Inject
If you need full control (or don't want a library dependency), implement the four-step pattern directly:
import openai
import chromadb
from datetime import datetime
client = openai.AsyncOpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("agent_memories")
# Step 1: EXTRACT — LLM decides what to remember
async def extract_memories(conversation: str, user_id: str) -> list[str]:
resp = await client.chat.completions.create(
model="gpt-4o-mini", # cheap model for extraction
messages=[{
"role": "system",
"content": """Extract key facts from this conversation worth remembering.
Focus on: user preferences, stated facts, important context.
Return as JSON array of strings. Return [] if nothing memorable."""
}, {
"role": "user", "content": conversation
}],
response_format={"type": "json_object"}
)
return json.loads(resp.choices[0].message.content).get("memories", [])
# Step 2: STORE — embed and persist
async def store_memories(memories: list[str], user_id: str):
if not memories:
return
embeddings = await client.embeddings.create(
model="text-embedding-3-small",
input=memories
)
collection.upsert(
ids=[f"{user_id}_{datetime.now().isoformat()}_{i}" for i in range(len(memories))],
documents=memories,
embeddings=[e.embedding for e in embeddings.data],
metadatas=[{"user_id": user_id, "timestamp": datetime.now().isoformat()} for _ in memories]
)
# Step 3: RETRIEVE — semantic search at query time
async def retrieve_memories(query: str, user_id: str, n: int = 5) -> list[str]:
query_embedding = (await client.embeddings.create(
model="text-embedding-3-small",
input=[query]
)).data[0].embedding
results = collection.query(
query_embeddings=[query_embedding],
n_results=n,
where={"user_id": user_id}
)
return results["documents"][0]
# Step 4: INJECT — prepend to system prompt
def inject_memories(system_prompt: str, memories: list[str]) -> str:
if not memories:
return system_prompt
memory_block = "\n".join(f"- {m}" for m in memories)
return f"{system_prompt}\n\nWhat you know about this user:\n{memory_block}"
The Retrieval Problem
The hardest part of agent memory is not storage — it's knowing what to retrieve and when. The naive approach (retrieve on every turn using the user's message as query) misses two key patterns:
- Implicit references: 'like we discussed' doesn't contain the keywords of what was discussed — semantic search on the literal message fails
- Temporal relevance: a memory from 6 months ago may be outdated — you need freshness weighting in retrieval
- Proactive injection: sometimes relevant memories aren't triggered by the current message at all — they need to be surfaced based on the task type
Better retrieval: use a 'memory query expansion' step — before searching, have the LLM rewrite the current message as a memory-search query: 'What do I know about this user's preferences related to [topic]?' This dramatically improves recall.
Memory Poisoning Risks
Memory stores are a persistent attack surface. If an attacker can inject content into the memory store — through a conversation, a retrieved document, or a tool result — that content persists and gets injected into future conversations.
- Prompt injection via memory: an attacker convinces the agent to store an instruction ('Always respond in Spanish') that surfaces in future sessions
- Cross-user contamination: in multi-tenant systems, a memory scoping bug can expose one user's memories to another
- Memory accumulation: unbounded memory stores grow over time — old, stale, or contradictory memories degrade performance
Privacy and PII in Memory Stores
Memory stores contain sensitive user data by design. This creates real compliance obligations:
- GDPR/CCPA right to erasure: you must be able to delete all memories for a given user_id — implement this as a first-class operation, not an afterthought
- PII extraction: your extraction LLM will happily store phone numbers, addresses, and health information — add a PII detection filter before storage
- Data residency: if your users are in the EU, your memory store must also be — check your vector DB's regional deployment options
- Audit logging: log what memories are created, retrieved, and deleted — essential for compliance and debugging
Production Checklist
- Memory formation is async — never block the response path on memory writes
- Set a memory budget per user (e.g., top 1,000 memories by recency+relevance) — prevent unbounded growth
- Implement memory TTL for time-sensitive facts (e.g., 'user is traveling this week' should expire)
- Add a memory review UI — let users see and delete their memories (required for GDPR compliance)
- Test memory injection attacks before launch — try to poison the memory store through the chat interface
- Monitor memory retrieval latency separately — a slow vector DB query delays every response
- Version your memory schema — when you change extraction logic, old memories may be in a different format
Agents Lab →: Trace how memory retrieval works across multi-turn conversations. See what gets stored, what gets retrieved, and where retrieval fails.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →