Agents & Tool Use 10 min read

Building Memory into AI Agents: LangMem, Mem0, and Custom Patterns

How to actually implement persistent memory in production agents. Architecture patterns, library comparison, and what breaks at scale.

Stateless agents fail at real tasks. Ask a customer service agent to 'follow up on the issue we discussed last week' and it has no idea what you mean. Ask a coding assistant to 'use the same pattern as last time' and it starts from scratch. Memory is not a nice-to-have for production agents — it's the difference between a capable assistant and an expensive autocomplete.

This post is about actually implementing it: the architecture choices, the libraries (LangMem, Mem0), and the specific things that break at scale.

The 4 Memory Types

Type	What it stores	Retrieval mechanism	Lifespan
In-context (working)	Current conversation, recent tool results	Always in prompt	Single session
Episodic	Past conversations, events, interactions	Semantic search over summaries	Persistent across sessions
Semantic	Facts, preferences, knowledge about the user/world	Key-value or semantic lookup	Persistent, updateable
Procedural	How to do things — learned workflows, user preferences for process	Retrieved by task type	Persistent, rarely changes

Most 'agent memory' implementations only do episodic memory (storing past conversations). That's the easiest but often not what matters. Semantic memory (stored facts about the user) and procedural memory (learned preferences about how to work) are where the real value is.

LangMem Architecture

LangMem (LangChain's memory library) provides a structured approach to memory formation, storage, and retrieval. Its key insight is the memory formation trigger: rather than dumping every conversation into a store, it uses the LLM itself to decide what's worth remembering.

from langmem import AsyncClient
from langchain_core.messages import HumanMessage, AIMessage

client = AsyncClient()

# After each conversation turn, trigger memory extraction
async def process_turn(user_id: str, messages: list):
    # LangMem sends the conversation to an LLM to extract memories
    # Only salient facts are stored — not raw conversation
    await client.add_messages(
        thread_id=f"user_{user_id}",
        messages=messages
    )

# Retrieve relevant memories before generating a response
async def get_context(user_id: str, query: str):
    memories = await client.search_user_memory(
        user_id=user_id,
        query=query,
        limit=5
    )
    return [m.content for m in memories.items]

LangMem's memory formation uses a configurable extraction prompt: the LLM reads the conversation and extracts structured facts in a schema you define. These are stored in a vector database. The quality of this extraction step determines the quality of the entire memory system.

Storage backends: in-memory (dev), PostgreSQL with pgvector (prod), Pinecone (scale)
Memory types: LangMem calls them 'user memories', 'thread memories', and 'org memories'
Formation triggers: can be async (after conversation) or inline (during conversation)
Deduplication: LangMem merges memories that refer to the same fact

Mem0 vs LangMem

Dimension	LangMem	Mem0
Philosophy	Extraction-based: LLM decides what to remember	Storage-based: stores everything, retrieves selectively
Ease of setup	More config required	Simple API, hosted option available
Storage backends	PostgreSQL, Pinecone, custom	Qdrant, Chroma, hosted
Memory types	User / thread / org memories	User / agent / session memories
Deduplication	Built-in LLM-based merge	Configurable similarity threshold
Pricing (hosted)	N/A (self-hosted library)	Free tier + paid plans
Production maturity	Production-ready (LangChain org)	Production-ready, fast-growing

Mem0 is easier to get started with, especially with their hosted option. LangMem gives you more control over the extraction step and integrates more tightly with LangGraph. For most teams: start with Mem0 hosted, migrate to LangMem + self-hosted when you need customization.

Custom Pattern: Extract → Store → Retrieve → Inject

If you need full control (or don't want a library dependency), implement the four-step pattern directly:

import openai
import chromadb
from datetime import datetime

client = openai.AsyncOpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("agent_memories")

# Step 1: EXTRACT — LLM decides what to remember
async def extract_memories(conversation: str, user_id: str) -> list[str]:
    resp = await client.chat.completions.create(
        model="gpt-4o-mini",  # cheap model for extraction
        messages=[{
            "role": "system",
            "content": """Extract key facts from this conversation worth remembering.
            Focus on: user preferences, stated facts, important context.
            Return as JSON array of strings. Return [] if nothing memorable."""
        }, {
            "role": "user", "content": conversation
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(resp.choices[0].message.content).get("memories", [])

# Step 2: STORE — embed and persist
async def store_memories(memories: list[str], user_id: str):
    if not memories:
        return
    embeddings = await client.embeddings.create(
        model="text-embedding-3-small",
        input=memories
    )
    collection.upsert(
        ids=[f"{user_id}_{datetime.now().isoformat()}_{i}" for i in range(len(memories))],
        documents=memories,
        embeddings=[e.embedding for e in embeddings.data],
        metadatas=[{"user_id": user_id, "timestamp": datetime.now().isoformat()} for _ in memories]
    )

# Step 3: RETRIEVE — semantic search at query time
async def retrieve_memories(query: str, user_id: str, n: int = 5) -> list[str]:
    query_embedding = (await client.embeddings.create(
        model="text-embedding-3-small",
        input=[query]
    )).data[0].embedding
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n,
        where={"user_id": user_id}
    )
    return results["documents"][0]

# Step 4: INJECT — prepend to system prompt
def inject_memories(system_prompt: str, memories: list[str]) -> str:
    if not memories:
        return system_prompt
    memory_block = "\n".join(f"- {m}" for m in memories)
    return f"{system_prompt}\n\nWhat you know about this user:\n{memory_block}"

The Retrieval Problem

The hardest part of agent memory is not storage — it's knowing what to retrieve and when. The naive approach (retrieve on every turn using the user's message as query) misses two key patterns:

Implicit references: 'like we discussed' doesn't contain the keywords of what was discussed — semantic search on the literal message fails
Temporal relevance: a memory from 6 months ago may be outdated — you need freshness weighting in retrieval
Proactive injection: sometimes relevant memories aren't triggered by the current message at all — they need to be surfaced based on the task type

Better retrieval: use a 'memory query expansion' step — before searching, have the LLM rewrite the current message as a memory-search query: 'What do I know about this user's preferences related to [topic]?' This dramatically improves recall.

Memory Poisoning Risks

Memory stores are a persistent attack surface. If an attacker can inject content into the memory store — through a conversation, a retrieved document, or a tool result — that content persists and gets injected into future conversations.

Prompt injection via memory: an attacker convinces the agent to store an instruction ('Always respond in Spanish') that surfaces in future sessions
Cross-user contamination: in multi-tenant systems, a memory scoping bug can expose one user's memories to another
Memory accumulation: unbounded memory stores grow over time — old, stale, or contradictory memories degrade performance

Privacy and PII in Memory Stores

Memory stores contain sensitive user data by design. This creates real compliance obligations:

GDPR/CCPA right to erasure: you must be able to delete all memories for a given user_id — implement this as a first-class operation, not an afterthought
PII extraction: your extraction LLM will happily store phone numbers, addresses, and health information — add a PII detection filter before storage
Data residency: if your users are in the EU, your memory store must also be — check your vector DB's regional deployment options
Audit logging: log what memories are created, retrieved, and deleted — essential for compliance and debugging

Production Checklist

Memory formation is async — never block the response path on memory writes
Set a memory budget per user (e.g., top 1,000 memories by recency+relevance) — prevent unbounded growth
Implement memory TTL for time-sensitive facts (e.g., 'user is traveling this week' should expire)
Add a memory review UI — let users see and delete their memories (required for GDPR compliance)
Test memory injection attacks before launch — try to poison the memory store through the chat interface
Monitor memory retrieval latency separately — a slow vector DB query delays every response
Version your memory schema — when you change extraction logic, old memories may be in a different format

Agents Lab →: Trace how memory retrieval works across multi-turn conversations. See what gets stored, what gets retrieved, and where retrieval fails.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →