GenAI Systems Lab Open interactive version →
Agents & Tool Use 10 min read

Building Memory into AI Agents: LangMem, Mem0, and Custom Patterns

How to actually implement persistent memory in production agents. Architecture patterns, library comparison, and what breaks at scale.

Stateless agents fail at real tasks. Ask a customer service agent to 'follow up on the issue we discussed last week' and it has no idea what you mean. Ask a coding assistant to 'use the same pattern as last time' and it starts from scratch. Memory is not a nice-to-have for production agents — it's the difference between a capable assistant and an expensive autocomplete.

This post is about actually implementing it: the architecture choices, the libraries (LangMem, Mem0), and the specific things that break at scale.

The 4 Memory Types

TypeWhat it storesRetrieval mechanismLifespan
In-context (working)Current conversation, recent tool resultsAlways in promptSingle session
EpisodicPast conversations, events, interactionsSemantic search over summariesPersistent across sessions
SemanticFacts, preferences, knowledge about the user/worldKey-value or semantic lookupPersistent, updateable
ProceduralHow to do things — learned workflows, user preferences for processRetrieved by task typePersistent, rarely changes

Most 'agent memory' implementations only do episodic memory (storing past conversations). That's the easiest but often not what matters. Semantic memory (stored facts about the user) and procedural memory (learned preferences about how to work) are where the real value is.

LangMem Architecture

LangMem (LangChain's memory library) provides a structured approach to memory formation, storage, and retrieval. Its key insight is the memory formation trigger: rather than dumping every conversation into a store, it uses the LLM itself to decide what's worth remembering.

from langmem import AsyncClient
from langchain_core.messages import HumanMessage, AIMessage

client = AsyncClient()

# After each conversation turn, trigger memory extraction
async def process_turn(user_id: str, messages: list):
    # LangMem sends the conversation to an LLM to extract memories
    # Only salient facts are stored — not raw conversation
    await client.add_messages(
        thread_id=f"user_{user_id}",
        messages=messages
    )

# Retrieve relevant memories before generating a response
async def get_context(user_id: str, query: str):
    memories = await client.search_user_memory(
        user_id=user_id,
        query=query,
        limit=5
    )
    return [m.content for m in memories.items]

LangMem's memory formation uses a configurable extraction prompt: the LLM reads the conversation and extracts structured facts in a schema you define. These are stored in a vector database. The quality of this extraction step determines the quality of the entire memory system.

Mem0 vs LangMem

DimensionLangMemMem0
PhilosophyExtraction-based: LLM decides what to rememberStorage-based: stores everything, retrieves selectively
Ease of setupMore config requiredSimple API, hosted option available
Storage backendsPostgreSQL, Pinecone, customQdrant, Chroma, hosted
Memory typesUser / thread / org memoriesUser / agent / session memories
DeduplicationBuilt-in LLM-based mergeConfigurable similarity threshold
Pricing (hosted)N/A (self-hosted library)Free tier + paid plans
Production maturityProduction-ready (LangChain org)Production-ready, fast-growing

Mem0 is easier to get started with, especially with their hosted option. LangMem gives you more control over the extraction step and integrates more tightly with LangGraph. For most teams: start with Mem0 hosted, migrate to LangMem + self-hosted when you need customization.

Custom Pattern: Extract → Store → Retrieve → Inject

If you need full control (or don't want a library dependency), implement the four-step pattern directly:

import openai
import chromadb
from datetime import datetime

client = openai.AsyncOpenAI()
db = chromadb.Client()
collection = db.get_or_create_collection("agent_memories")

# Step 1: EXTRACT — LLM decides what to remember
async def extract_memories(conversation: str, user_id: str) -> list[str]:
    resp = await client.chat.completions.create(
        model="gpt-4o-mini",  # cheap model for extraction
        messages=[{
            "role": "system",
            "content": """Extract key facts from this conversation worth remembering.
            Focus on: user preferences, stated facts, important context.
            Return as JSON array of strings. Return [] if nothing memorable."""
        }, {
            "role": "user", "content": conversation
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(resp.choices[0].message.content).get("memories", [])

# Step 2: STORE — embed and persist
async def store_memories(memories: list[str], user_id: str):
    if not memories:
        return
    embeddings = await client.embeddings.create(
        model="text-embedding-3-small",
        input=memories
    )
    collection.upsert(
        ids=[f"{user_id}_{datetime.now().isoformat()}_{i}" for i in range(len(memories))],
        documents=memories,
        embeddings=[e.embedding for e in embeddings.data],
        metadatas=[{"user_id": user_id, "timestamp": datetime.now().isoformat()} for _ in memories]
    )

# Step 3: RETRIEVE — semantic search at query time
async def retrieve_memories(query: str, user_id: str, n: int = 5) -> list[str]:
    query_embedding = (await client.embeddings.create(
        model="text-embedding-3-small",
        input=[query]
    )).data[0].embedding
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n,
        where={"user_id": user_id}
    )
    return results["documents"][0]

# Step 4: INJECT — prepend to system prompt
def inject_memories(system_prompt: str, memories: list[str]) -> str:
    if not memories:
        return system_prompt
    memory_block = "\n".join(f"- {m}" for m in memories)
    return f"{system_prompt}\n\nWhat you know about this user:\n{memory_block}"

The Retrieval Problem

The hardest part of agent memory is not storage — it's knowing what to retrieve and when. The naive approach (retrieve on every turn using the user's message as query) misses two key patterns:

Better retrieval: use a 'memory query expansion' step — before searching, have the LLM rewrite the current message as a memory-search query: 'What do I know about this user's preferences related to [topic]?' This dramatically improves recall.

Memory Poisoning Risks

Memory stores are a persistent attack surface. If an attacker can inject content into the memory store — through a conversation, a retrieved document, or a tool result — that content persists and gets injected into future conversations.

Privacy and PII in Memory Stores

Memory stores contain sensitive user data by design. This creates real compliance obligations:

Production Checklist

Agents Lab →: Trace how memory retrieval works across multi-turn conversations. See what gets stored, what gets retrieved, and where retrieval fails.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →