The Token Economy - Engineering an AI's Working Memory
Welcome back. In our previous post, we established the fundamental challenge: Large Language Models are stateless by design. Their conversational abilities are an illusion, carefully crafted through meticulous Context Engineering. Now we’re getting into the first pillar of that illusion: the Short-Term Conversational Memory (STCM).
Think of the STCM as the AI’s workbench. It’s where the immediate ingredients for the current conversational “dish” are laid out, organised, ready for the LLM “chef” to use. Like any workbench, the space is finite and needs constant tidying to stay efficient.
The First Trick: Context Persistence
When you chat with an AI, you expect it to remember what you just said. Ask “What’s the capital of France?” then follow up with “And how about its population?” and you expect “its” to refer to France. That seemingly basic expectation requires us to explicitly inject the previous turn (the user’s question and the model’s answer) into the prompt of the second turn.
The simplest approach to maintaining conversational state would be keeping the entire chat history in a JavaScript array in memory. That falls apart quickly:
- Process Restarts: If the application crashes or restarts, the entire conversation history vanishes. The illusion of memory, gone.
- Multi-User Scenarios: For a production system handling multiple simultaneous conversations, an in-memory array per user becomes a resource-intensive headache.
So we need a persistent store. For the sample application (revealed in the final post), I’ve opted for SQLite. I won’t bore you with database internals (that’s a different blog series), but the why matters for our AI system design:
- Embedded and Lightweight: SQLite is a serverless, single-file database. It lives inside our application, making it fast for local access and trivial to deploy. That low overhead is crucial for performance-sensitive AI applications. (In this pet project it matters more than you’d think.)
- Structured Storage for AI Prompts: Unlike a raw JSON file, SQLite provides structured tables. Every piece of a conversational turn (
role,content,timestamp,conversation_id) is stored consistently, ready to be reconstructed into the precise format the LLM API expects.
I pointed out in the previous post that any other local or hosted database system could work here. Use your favourite RDBMS or NoSQL database. MongoDB would be a particularly good fit given its flexible document storage, which maps almost 1:1 to the JSON format expected by the LLM API.
Crafting the Prompt: Our STCM Schema
The short-term memory revolves around two tables within our memory.db file: messages and conversation_summaries. The design of these tables is driven by the format modern LLMs expect for conversational turns.
// memory/shortTermMemoryStore.js
import sqlite3 from 'sqlite3';
import { open } from 'sqlite';
export class ShortTermMemoryStore {
// ... constructor ...
async init() {
this.db = await open({
filename: './memory.db',
driver: sqlite3.Database
});
await this.db.exec(`
-- Stores metadata about each unique conversation
CREATE TABLE IF NOT EXISTS conversations (
id TEXT PRIMARY KEY,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
-- Stores individual messages (user or AI) within a conversation
CREATE TABLE IF NOT EXISTS messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
conversation_id TEXT,
role TEXT, -- 'user' or 'model' - CRUCIAL for LLM interaction
content TEXT,
timestamp TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (conversation_id) REFERENCES conversations (id)
);
-- Stores AI-generated summaries of parts of a conversation
CREATE TABLE IF NOT EXISTS conversation_summaries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
conversation_id TEXT,
summary TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (conversation_id) REFERENCES conversations (id)
);
`);
}
// ...
}
The Power of the role Field
Notice the role column in the messages table. That’s not an arbitrary label. It directly reflects how LLMs like Gemini or GPT models are designed to be prompted in conversation: messages from the user are labelled user, the model’s responses are labelled model.
Note that we’re ignoring tool calling responses here as this basic application doesn’t have tool calling capabilities, but the database and code can easily be extended to accommodate that.
When we retrieve messages from SQLite, we reconstruct them into an array of objects like this:
[
{ "role": "user", "content": "What is the capital of France?" },
{ "role": "model", "content": "The capital of France is Paris." },
{ "role": "user", "content": "What currenty do they use there?" }
]
This structured format is critical. Without it, the LLM can’t differentiate who said what, leading to confusing responses: impersonating the user, repeating its own questions. Our SQLite schema directly supports this foundational requirement.
The Token Economy: The Expanding Context Window Problem
Even with structured storage, a major challenge remains: the ever-expanding conversation history. Every message bolted on means more tokens sent to the LLM with each subsequent turn. This escalates fast into a problem of token economics.
What Is a Token?
A token isn’t simply a word. It’s a fundamental unit of text that LLMs process. For English, a token can be part of a word, a whole word, or even a few words. More complex words or non-English languages often require more tokens. For example:
- “Hello” = 1 token
- “serendipity” = typically 5 tokens
- Emojis, punctuation, and spaces are often individual tokens.
Why does this matter?
- Context Window Limits: Every LLM has a hard limit on tokens it can process in a single prompt. Exceed this, and the model truncates the oldest parts of your conversation. The AI literally forgets the beginning of the chat because it can’t fit it into its working memory.
- Cost: LLM API calls are priced per token (some LLMs price input and output tokens separately). Longer context means a higher bill for every single interaction.
- Latency: Larger prompts take longer to process, increasing response times and degrading user experience.
- Context Rot: The longer the context window gets, the less reliable the LLM’s answers become, as shown in this research.
That’s the “context window wall.” Hitting it means higher costs, slower responses, and an AI that becomes increasingly incoherent as it loses its past. This is where Context Engineering needs to be particularly clever.
AI to the Rescue: Intelligent Summarisation
The solution to the token economy problem: use the AI itself to manage its own memory. Instead of sending the full, raw history indefinitely, we employ intelligent summarisation.
In the ConversationManager, after every few turns (I’ve set a threshold of 5 messages for demo purposes), a background process kicks off to summarise the recent conversation.
// conversationManager.js
// ... in handleMessage method ...
const messageCount = await this.shortTermMemory.getMessageCount(session.conversationId);
if (messageCount > 0 && messageCount % 5 === 0) { // Check if it's time to summarise
this.performSummarization(session.conversationId);
}
// ...
async performSummarization(conversationId) {
console.log(`Summarizing conversation ${conversationId}...`);
// Crucially, this is fire-and-forget. We don't await its completion.
this.summaryProcessor.summarizeAndCleanup(conversationId);
}
The Asynchronous Advantage
Notice this.summaryProcessor.summarizeAndCleanup(conversationId); called without await. That’s a deliberate design choice. Summarisation is an LLM call itself, which introduces latency. We don’t want the user waiting while the AI “thinks” about summarising before they get a response to their latest query. By making it fire-and-forget, the main conversational thread stays snappy. The summary gets created in the background, ready for future turns.
The SummaryProcessor: An AI for AI Memory Management
The SummaryProcessor is the unsung hero of short-term memory. Its job: transform verbose conversational history into concise, token-efficient summaries. It’s an example of using a “meta-LLM,” an AI-powered process that optimises the performance of the primary AI interaction.
// processing/summaryProcessor.js
// ... constructor with shortTermMemory, longTermMemory, and model ...
async summarizeAndCleanup(conversationId) {
// 1. Retrieve the messages that need summarising
const recentMessages = await this.shortTermMemory.getMessagesForSummarization(conversationId);
if (recentMessages.length === 0) {
return; // Nothing to summarise
}
// 2. Craft a prompt for the LLM to perform the summarisation
// (Actual prompt construction details would be in this.model.summarize)
const summary = await this.model.summarize(recentMessages);
// 3. Store the summary in both short-term (for current context) and long-term (for cross-convo memory)
await this.shortTermMemory.saveSummary(conversationId, summary);
await this.longTermMemory.storeSummary(conversationId, summary); // More on this in Part 3!
// 4. Critically: Prune the original messages to reclaim context window space
const messageIds = recentMessages.map(m => m.id);
await this.shortTermMemory.deleteMessages(messageIds);
console.log(`Successfully summarised and cleaned up ${messageIds.length} messages for conversation ${conversationId}.`);
}
// ...
Here’s what the summarisation process looks like when it fires:
Abstraction, Not Just Compression
This isn’t merely shortening text. It’s semantic abstraction. We’re using the LLM’s understanding capabilities to extract the core meaning and salient points from a chunk of dialogue. A good summarisation prompt (more on prompt engineering in Part 4) instructs the LLM not just to reduce word count, but to retain critical information: names, dates, key decisions, user preferences. The summary contains far more “meaning per token” than the raw messages, making it a highly efficient form of context.
By replacing, say, 500 tokens of raw messages with a 50-token summary, we’ve achieved a 90% reduction in context window footprint for that segment, without losing the essential thread. That’s pure gold in the token economy.
If you’re a Claude Code user, you may have seen this in action. Claude Code uses a similar technique to manage the context window, but with more sophisticated summarisation (they trigger it when context usage hits 90-95%, which makes it dynamic context management). They call the process
compacting.
Another thing to consider: different LLMs can be used for different purposes. You could use a “pro” model for summarisation and a “basic” model for chat. This lets you optimise for different tasks and costs.
What’s Next? The Limits of the “Now”
We’ve now built a token-efficient system for managing the active conversation. The AI has a keen, tidy working memory. It can chat for extended periods without forgetting what you just said, all while keeping costs in check.
But the STCM, by design, remains a conversational echo chamber. It knows everything about this chat, but nothing about that chat you had last week. To push our AI beyond this limitation, to give it enduring knowledge that spans all interactions, we need a different kind of memory.
In Part 3, we’ll get into the Long-Term Semantic Memory: vector embeddings, cloud-based storage, and searching by meaning. Stay tuned.