Skip to main content

Writing

Agent memory, end to end

12 min read

Large language models are stateless. Each request stands on its own, and the model only knows what fits inside the current context window. For an agent meant to help across days and dozens of conversations, that’s a real limitation. It has to remember decisions, preferences, and facts gathered weeks ago, then pull the right ones up at the right moment.

The obvious fix is to keep the whole chat history and feed it into every prompt. That gets expensive fast, and quality drops as the window fills with noise. Storing history is the easy part. The hard part comes after: deciding what’s worth keeping, finding the right piece when the wording has changed, and keeping the store from filling up with stale duplicates.

Why full history gets expensive

cumulative input tokens conversation turns
replay full history
bounded memory + retrieval

Replaying the whole transcript means every turn re-sends every earlier turn, so the tokens you pay for climb with the square of the conversation length. A bounded store plus retrieval sends a roughly fixed amount each turn.

Illustrative model. This assumes about 600 new tokens per turn and a fixed ~1,500-token bounded prompt.

Most production agent-memory systems land on a similar shape. The vocabulary for it comes from cognitive science, and the working examples are real: Hermes Agent, mem0, and Quarq.

The vocabulary

Cognitive scientists split human long-term memory into a few kinds, a distinction that traces back to Endel Tulving in the 1970s. People building agents quietly adopted the same words, because they map well onto the problem.

Semantic memory is decontextualised fact: “Paris is the capital of France,” or “this user prefers TypeScript.” You know it without recalling when you learned it.

Episodic memory is tied to a time and place. What happened, and when. A meeting last Tuesday, a decision made 3 weeks ago.

Procedural memory is how to do something: a skill or routine, like the steps to cut a software release.

Above all of these sits working memory, the small amount in active use right now. For an LLM that’s the context window of the current turn. Information moves between the stores through encoding, consolidation, retrieval, and forgetting.

Few agent systems build these as cleanly separated stores. More often the categories emerge from components built for other reasons, and the labels fit after the fact. Here’s how they tend to line up.

The taxonomy, mapped to real components

Working memory

what you're holding right now

the context window, this turn

Semantic memory

decontextualised facts

two char-capped files: MEMORY.md, USER.md

Episodic memory

events, with a when

an append-only log of every conversation

Procedural memory

how to do things

skills, loaded on demand

How information moves between these stores

  • Encodingwriting something into memory
  • Consolidationmerging and tidying entries over time
  • Retrievalfinding the right memory again
  • Decaydropping what is no longer used

Semantic memory: a small, curated fact store

Semantic memory holds the durable facts an agent should always have on hand: where the user is, what stack they use, how they like their answers formatted. These get read on every turn, so they have to stay small and stay reliable.

One clean approach, used by Hermes Agent from Nous Research, keeps them in 2 short text files: one for facts about the project or environment, one for facts about the user. No database, no embeddings, no retrieval step. The files are read once at the start of a turn and frozen into the system prompt.

The detail that makes this work is a hard size limit. Hermes caps the files at roughly 2,200 and 1,375 characters. The limit is in characters rather than tokens, which keeps the count identical across models and easy to diff in version control.

The cap sounds stingy, and that’s the point. When a file is full the agent can’t just append, so it has to rewrite or drop something to make room. A bounded store forces curation.

Editing a store like this has its own traps. You point at the entry you mean by matching its full text after normalising it (collapsing whitespace, lowercasing, trimming), not by substring, since a loose partial match will eventually delete the wrong line.

Try it: write to memory

MEMORY.md0 / 2200
  • Project uses Node 24, SQLite, and TypeScript.
  • Deploy script needs the --staging flag before the build step.
USER.md0 / 1375
  • Prefers concise, direct answers with concrete examples.

The cap is the feature. When a file is full the only way to add is to remove or rewrite something. Try adding a duplicate, or a long entry, to see the dedupe and the limit bounce it.

The 2,200 / 1,375-character caps, the § delimiter, and the normalized duplicate check are taken from Hermes Agent's memory store.

Surviving compaction

Eventually a conversation outgrows the context window, and the agent has to compress older turns into a summary to keep going. Summaries are lossy, so any detail the summary drops is gone. If that detail was the one fact worth keeping, the agent acts like it never happened.

A fix popularised by Hermes Agent runs a step before the summary: a memory flush. The system spins up a short-lived agent whose only tool is memory, hands it the slice of conversation about to be compressed, and asks it to save anything durable first. Only then does the summariser run, so durable facts land in the permanent store before the lossy step touches them.

How the flush handles failure is an implementation choice. A robust version makes it best-effort: if the extra model call fails, the system logs it and summarises anyway, since a failed flush is no worse than ordinary compaction. Make it mandatory and a flaky model call can stall the whole conversation, which is worse than missing the odd fact.

A practical detail helps here too: showing the flush the current memory before it writes, so it doesn’t re-save what the agent already knows.

Watch a context window compact

0 / 1200 tokens

Add turns until the window fills past the dashed line.

turn summary recent, kept verbatim

Illustrative model. The window size and per-turn token counts are made up for the demo.

Episodic recall

Where semantic memory holds what the agent knows, episodic memory holds what happened. A common implementation is an append-only log of every past conversation in a local database, with each message and tool call stored as a row and nothing overwritten.

The interesting problem is retrieval. Keyword search, say SQLite’s FTS5 with bm25 ranking, is precise and cheap, but brittle: it only finds conversations that used the exact words. Vector search embeds the query and finds semantically similar conversations even when the wording differs, though it’s weaker on exact names, dates, and identifiers.

Combining the two, often called hybrid search, is a well-supported pattern: the lexical and vector scores get normalised and blended. Some systems also fold in a recency signal so newer conversations rank a little higher. The weighting is a tuning choice that varies by system; one implementation blends keyword 0.5, vector 0.4, and recency 0.1.

Hybrid recall: blend the legs

Searching past conversations for: "when did I last talk to Alex about billing?"

Each row's number is its blended score; the bar shows how much each leg contributed (raw signal × weight).

keyword vector recency

Mixed. The 0.5 / 0.4 / 0.1 default weights are real, taken from the hybrid search implementation. The three example conversations and their per-leg scores are illustrative.

Query expansion

Even a good hybrid search has a blind spot: it searches with the user’s exact phrasing. People rarely query with the words a memory was stored under. Someone asks “when did I last talk to Alex about billing?” while the conversation got filed under “the Stripe invoice thread.”

Query expansion closes that gap. Before searching, the system uses a cheap model call to rewrite the query into a few alternate phrasings, each coming at the need from a different angle: the people involved, the action, the timeframe, the topic. Each phrasing runs through the search, and the results get merged and deduplicated. The idea shows up in systems like Quarq’s agent, and it’s standard practice across many retrieval pipelines.

A query fanned into 3 phrasings might look like this:

QUERY: when did I last talk to Alex about billing?
  → Alex billing conversation history
  → search past chats with Alex regarding invoices
  → last mention of billing in messages with Alex

The second phrasing turns “billing” into “invoices” on its own, which is exactly the wording a literal search would miss. Done well, expansion only widens recall. If the rewrite call fails, the system falls back to the original query, so it can’t make search worse than the baseline.

Consolidation: keeping memory accurate

Memory shouldn’t only grow. People change jobs, preferences shift, projects start and end. A store that only appends will pile up duplicates and contradictions until retrieval suffers. So mature systems add a maintenance pass that runs in the background, after a response is delivered or while the agent is idle, so it adds no latency to the conversation.

This pass reviews recent conversations and proposes changes to long-term memory: new facts to add, and existing entries to update or remove. A bad memory edit can quietly distort behaviour on every later turn, so some systems keep a human in the loop, queuing proposed changes for review instead of applying them automatically. That costs a few seconds of review per change, against a class of errors that’s otherwise hard to catch.

Add, dedupe, or supersede

Pulling a candidate fact out of a conversation is the easy half. Deciding what to do with it, given everything already stored, is where it gets interesting. mem0 frames this as a small router that, for each new fact, compares it against the most similar existing memories and picks an action: add it as new, drop it as a duplicate, or supersede an existing entry by merging the two into one updated fact.

A useful optimisation gates the expensive step. The system first embeds the new fact and measures cosine similarity against its nearest stored neighbours. If the closest sits below a threshold, around 0.55 in some implementations, the fact is clearly novel and gets added without the router running at all, which saves a model call on the common case. The router only fires when something is similar enough to be ambiguous.

Supersede is the action that keeps memory honest. Instead of letting a corrected fact pile up beside the outdated one, it rewrites the old entry. That’s consolidation in the cognitive sense: a memory updated when it’s revisited rather than duplicated.

The reconciler: add, dedupe, or supersede

new factUser moved to Vienna in May 2026.
nearest neighbourLives in Budapest.
below 0.55 · add, no LLM
0.55+ · router decides

Grounded. The ~0.55 cosine gate, the top-5 neighbour comparison, and the add / dedupe / supersede actions follow a mem0-style reconciler. The Budapest to Vienna pair is an example to make the supersede case concrete.

Dreaming

The reconciler decides fact by fact, as memories arrive. There’s a heavier version that takes the whole store at once, and it has a fittingly biological name: dreaming. Memory consolidation in mammals happens largely during sleep, when the hippocampus replays the day’s activity to the cortex and what’s worth keeping gets written into long-term storage. The metaphor has been in machine learning for years: experience replay in reinforcement learning, or generative-replay methods that rehearse synthetic memories so a network doesn’t forget old tasks (Shin et al., 2017, framed explicitly after the hippocampus).

Anthropic took the word literally. Dreams, a research-preview feature for its managed agents, is consolidation as a first-class operation. An agent writes to its memory store incrementally as it works, and across many sessions that store accumulates duplicates, contradictions, and stale entries, the same drift the reconciler fights one fact at a time. A dream is the batch version: it reads the existing store alongside a set of past session transcripts (up to 100 of them) and produces a new, reorganised store, with duplicates merged, contradicted entries replaced by the latest value, and fresh insights pulled out of the transcripts.

The design choice that makes it safe is that the dream never touches its input. It writes the result to a separate store, so the rewrite becomes a candidate you inspect, then adopt or discard: attach the new store to future sessions, or delete it. It’s the same human-in-the-loop review from earlier, except instead of approving edits one at a time, you judge the whole reorganised store at once. The job runs asynchronously, taking minutes to tens of minutes, well outside any live turn, and you can steer it with plain instructions such as “focus on coding-style preferences, ignore one-off debugging notes.”

Dreaming here doesn’t touch the model’s weights the way generative replay does; the model stays fixed. It just reorganises the external store the model reads at inference time. The name is about timing and role, offline work done while nothing’s waiting on it, rather than the mechanism itself.

Run a dream over a memory store

Input store unchanged

  • Lives in Budapest.
  • Moved to Vienna in May 2026.
  • Prefers TypeScript.
  • Prefers TypeScript.
  • Uses pnpm.

Candidate store review first

  • Run the dream to produce a reorganised store.

The store has drifted: a stale fact, a duplicate, nothing tying them together. Run a dream to clean it up.

The input-store-plus-sessions inputs, the separate reviewable output, and the merge / supersede / surface-insight behaviour follow Anthropic's Dreams feature for managed agents. The specific entries are an example.

Procedural memory and external knowledge

The last category, procedural memory, covers know-how rather than facts: reusable routines the agent can invoke. Many frameworks implement this as “skills” or tools, self-contained procedures loaded on demand instead of kept in the prompt the whole time. That mirrors human procedural memory, which stays dormant until a task calls for it. A maintenance pass can even propose new skills from procedures the agent worked out during a task.

Some knowledge fits none of these stores neatly: a worked-out design, a synthesis of research, a reference document. Agents increasingly keep this in a structured knowledge base, often a set of linked notes or a wiki, with edits gated behind human approval because the content is durable and open-ended. Linking pages to one another turns the knowledge into a graph you can navigate instead of a heap of disconnected notes.

Putting it together

An agent’s memory is a stack of cooperating layers: a context window for the moment, a small curated store for durable facts, a searchable log of past conversations, skills for know-how, and a background process that keeps the whole thing consolidated. None of the individual pieces is exotic. Most come from existing systems and from decades-old ideas about how memory works.

One turn, through the layers

Semantic store · facts Episodic log · past chats Skills · procedural Retrieval hybrid + query expansion User message Context window working memory Model Response Consolidation · background add · dedupe · supersede injected every turn loaded on demand writes back
read into the prompt written back in the background

What makes an agent feel like it remembers is the wiring between those layers, and the unglamorous decisions around it: which layer is allowed to fail, what a human gets to review, and when a stale fact gets rewritten instead of left to pile up. Get those right and the model stops feeling stateless, even though it still is.

Sources

The interactive figures run entirely in your browser. Each one notes whether its numbers are measured from a real system or illustrative, with the assumptions stated underneath.

  • The episodic / semantic / procedural split goes back to Endel Tulving, “Episodic and Semantic Memory” (1972).
  • The two-file store and the character caps are from Hermes Agent’s memory docs (Nous Research).
  • Hermes Agent’s context-compression docs cover the compaction itself (head, middle, tail; the middle is summarised). The memory-flush-before-summary refinement is described in write-ups of Hermes such as this explainer; making the flush best-effort and showing it the current memory are implementation choices.
  • The add / dedupe / supersede reconciliation and the cosine gate follow mem0.
  • Dreaming, offline memory consolidation that writes a reviewable store separate from the original, follows Anthropic’s Dreams feature for managed agents.
  • The generative-replay framing of “dreaming” is from Shin et al. (2017), Continual Learning with Deep Generative Replay.
  • Query expansion and the layered memory framing draw on Quarq’s agent.
  • Hybrid retrieval, blending bm25 keyword search with vector cosine similarity, is standard practice across retrieval pipelines; the recency weighting is an implementation choice rather than a norm.