Agents & Tool Use 10 min read

The Four Memory Problems Every Agent Has

Short-term, long-term, episodic, and semantic memory each require different storage and retrieval strategies. Why "just use a vector DB" fails, the similar≠relevant problem, and the decision layer that makes agents actually get smarter.

The memory problem no one talks about

Most agent tutorials show you how to call tools. Almost none show you how the agent remembers anything between calls — or across sessions. This is the gap that kills agents in production. The agent forgets the user's name. It repeats a question it asked three turns ago. It retrieves a document from six months ago that was superseded last week. These are not model quality problems. They are memory architecture problems.

There are four distinct memory problems every agent faces. Each one requires a different storage technology and retrieval strategy. Treating them all the same — dumping everything into a vector DB, or stuffing everything into the context window — is why most production agents break in the same three ways.

The four memory types

1. Short-term memory — the context window

Short-term memory is whatever is currently in the context window: the conversation history, retrieved documents, tool outputs, system prompt. It is fast, perfectly accurate, and immediately available. It is also expensive and finite.

The failure mode is cost explosion. A customer support agent that appends every message to the context window hits 100K tokens within a few dozen turns. At $3/M tokens input that is $0.30 per session — acceptable for a demo, catastrophic at 10,000 sessions/day. The naive fix — truncate old messages — introduces a worse failure: the agent forgets critical context from earlier in the conversation.

Production pattern: maintain a rolling window of the last N turns verbatim, plus a compressed summary of earlier context. Summarize every K turns, keep the summary pinned at the top of the context. Cost drops 60-80%; the agent retains the thread.

2. Long-term memory — vector retrieval

Long-term memory is the persistent store of facts, documents, and knowledge the agent can draw on across sessions. The standard implementation is a vector DB: embed the content, store the vectors, retrieve by semantic similarity at query time.

The critical failure here is the similar ≠ relevant problem. Vector search returns semantically similar content. Semantically similar is not the same as contextually relevant for the current task. A query about a user's project timeline may retrieve many documents about project management best practices — all semantically close — while missing the specific document that says this user's deadline was moved to Friday.

A second failure: long-term memory has no native notion of recency or authority. A document from two years ago and a document from last week look identical to the retriever. Without explicit freshness metadata filtering, agents confidently surface stale information.

Long-term memory retrieval should always include: (1) a relevance score threshold — don't inject content below 0.75 cosine similarity, (2) a recency filter — surface document date alongside retrieved content so the LLM can reason about staleness, (3) a source authority signal — internal docs outrank web-scraped docs for factual queries.

3. Episodic memory — what happened, exactly

Episodic memory is the record of specific past events: what the user asked last Tuesday, what the agent did in response, what the outcome was. Unlike long-term memory (which stores general knowledge), episodic memory stores particular instances with exact timing and sequencing.

Vector search is wrong for episodic memory. If a user asks 'what did I ask you yesterday?', you do not want semantically similar past queries — you want the exact queries from yesterday, ordered by time. This is a structured lookup, not a similarity search. Postgres or any relational store with a timestamp index is the right tool: SELECT * FROM episodes WHERE user_id = X AND created_at > NOW() - INTERVAL '1 day' ORDER BY created_at.

The agents that confabulate about past interactions — 'As we discussed earlier...' when no such discussion happened — usually have no episodic memory at all. The agent is generating plausible-sounding history from parametric memory rather than retrieving actual history from a structured store.

4. Semantic memory — learned preferences

Semantic memory is the agent's model of the user: their preferences, working style, recurring needs, communication preferences. Not what happened last Tuesday (episodic), but what is generally true about this person across all their interactions.

This is the hardest memory type to build well. It requires inference — the agent must conclude from many interactions that 'this user prefers concise answers' or 'this user always wants cost estimates alongside recommendations.' That inference must then be stored, updated over time, and retrieved at the start of every session to personalize behavior.

The practical implementation: a key-value store or structured Postgres table of user preferences, updated by a background process that periodically summarizes recent interactions and extracts or updates preference signals. The agent reads these preferences from the system prompt on every session start.

The production memory stack

The teams building reliable agents at scale typically run a layered stack:

Redis (hot cache) — last 10–20 interactions, sub-millisecond read, evicts on TTL. Feeds short-term context compression.
Postgres (structured store) — episodic memory (full interaction log with timestamps) + semantic memory (user preferences as structured rows). Queryable by time, user, outcome.
Vector DB — long-term knowledge retrieval. Chroma or pgvector for smaller scale, Qdrant or Pinecone for production volume. Filtered by metadata before semantic search.
LLM layer — decides what to fetch, what to write, what to discard. This is where most implementations fail.

The real problem is not storage. It is the decision layer: knowing when to remember vs when to forget, what is worth writing to long-term memory vs what should stay ephemeral, and when retrieved memory is stale enough to be harmful. An agent that remembers everything is as broken as one that remembers nothing.

What to remember and what to forget

The decision layer is a small LLM call (Haiku or GPT-4o-mini) that runs at the end of each interaction and answers: should anything from this session be written to long-term or semantic memory? It evaluates: was a new preference expressed? Was a fact established that will matter in future sessions? Was an outcome reached that future sessions should know about?

Without this layer, agents write everything (cost explosion, noise overwhelms signal) or nothing (no persistence across sessions). The decision layer is the difference between an agent that gets smarter with use and one that starts from zero every session.

The forgetting side is equally important. Semantic memory preferences should decay or be explicitly overridden when contradicted. Episodic memory should be pruned by outcome — resolved interactions matter less than open ones. Long-term knowledge should be invalidated when source documents are updated. Memory without a retention policy is a liability.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →