The Token Economy - Engineering an AI's Working Memory
Welcome back! In our previous post, we established the fundamental challenge: Large Language Models (LLMs) are, by design, largely stateless. Their impressive conversational abilities are an illusion, carefully crafted by us, the engineers, through meticulous Context Engineering. Today, we’re diving deep into the first pillar of this illusion: the Short-Term Conversational Memory (STCM).
Think of the STCM as your AI’s personal workbench. It’s where the immediate ingredients for the current conversational “dish” are laid out, neatly organised and ready for the LLM “chef” to use. Just like a chef’s workbench, this space is finite and needs constant management to remain efficient.
The Illusionist’s First Trick: Context Persistence
When you chat with an AI, you expect it to remember what you just said. If you ask, “What’s the capital of France?” and then immediately follow up with, “And how about its population?”, you expect “its” to refer to France. This seemingly basic expectation requires us to explicitly inject the previous turn (both the user’s question of “What’s the capital of France?”, and the model’s answer of “Paris”) into the prompt of the second turn.
The simplest way to maintain this conversational state would be to keep the entire chat history in a JavaScript array in memory. However, this poses immediate problems for robustness and scalability:
- Process Restarts: If our application crashes or is restarted, the entire conversation history would vanish, breaking the illusion of memory.
- Multi-User Scenarios: For a production system handling multiple simultaneous conversations, an in-memory array per user becomes a complex, resource-intensive headache.
So, we need a persistent store. For the sample application (that will be revealed at the very last post), I have opted for SQLite. While I won’t bore you with the intricate details of database systems (that’s a different blog series entirely!), it’s important to understand why this choice is relevant to our AI system design:
- Embedded & Lightweight: SQLite is a serverless, single-file database. It lives directly within our application, making it incredibly fast for local access and trivial to deploy. This low overhead is crucial for performance-sensitive AI applications where every millisecond counts. (note that in this pet project this matters a lot)
- Structured Storage for AI Prompts: Unlike a raw JSON file, SQLite provides structured tables. This ensures that every piece of a conversational turn (
role,content,timestamp,conversation_id) is stored consistently, ready to be reconstructed into the precise format required by the LLM API.
I have pointed out in the previous post that any other local or hosted database system could have been used, if you want you can use your favourite RDBMS or NoSQL database. In fact, a database such as MongoDB would be an amazing fit due to their way of storing data in a flexible manner, that could have a 1:1 mapping to the JSON format expected by the LLM API.
Creating the Prompt: Our STCM Schema
Our short term memory primarily revolves around two tables within our memory.db file: messages and conversation_summaries. The design of these tables is driven by the format modern LLMs expect for conversational turns.
// memory/shortTermMemoryStore.js
import sqlite3 from 'sqlite3';
import { open } from 'sqlite';
export class ShortTermMemoryStore {
// ... constructor ...
async init() {
this.db = await open({
filename: './memory.db',
driver: sqlite3.Database
});
await this.db.exec(`
-- Stores metadata about each unique conversation
CREATE TABLE IF NOT EXISTS conversations (
id TEXT PRIMARY KEY,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
-- Stores individual messages (user or AI) within a conversation
CREATE TABLE IF NOT EXISTS messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
conversation_id TEXT,
role TEXT, -- 'user' or 'model' - CRUCIAL for LLM interaction
content TEXT,
timestamp TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (conversation_id) REFERENCES conversations (id)
);
-- Stores AI-generated summaries of parts of a conversation
CREATE TABLE IF NOT EXISTS conversation_summaries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
conversation_id TEXT,
summary TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (conversation_id) REFERENCES conversations (id)
);
`);
}
// ...
}Deep Dive: The Power of the role field
Notice the role column in the messages table. This isn’t just an arbitrary label. It’s a direct reflection of how LLMs like Gemini or GPT models are designed to be prompted in a conversation - any message sent from the user is labelled as user and the model’s response is model.
Note that we are ignoring the fact that tool calling responses can happen as this basic application does not have tool calling capabilities, but the databse and the code can easily be extended to accommodate that.
When we retrieve messages from SQLite, we reconstruct them into an array of objects like this:
[
{ "role": "user", "content": "What is the capital of France?" },
{ "role": "model", "content": "The capital of France is Paris." },
{ "role": "user", "content": "What currenty do they use there?" }
]This structured format is absolutely critical. Without it, the LLM struggles to differentiate who said what, leading to confusing or unhelpful responses, such as impersonating the user or repeating its own questions. Our SQLite schema directly supports this foundational requirement of LLM prompting.
The Token Economy: The Scourge of the Expanding Context Window
Even with our structured storage, a significant challenge remains: the ever-expanding conversation history. Every message added means more tokens sent to the LLM with each subsequent turn. This rapidly escalates into a problem of token economics.
Deep Dive: What is a Token?
A token isn’t simply a word. It’s a fundamental unit of text that LLMs process. For English, a token can be part of a word, a whole word, or even a few words. More complex words or non-English languages often require more tokens. For example:
- “Hello” = 1 token
- “serendipity” = typically 5 tokens
- Emojis, punctuation, and even spaces are often individual tokens.
Why does this matter?
- Context Window Limits: Every LLM has a hard limit on the number of tokens it can process in a single prompt. Exceed this, and the model simply truncates the oldest parts of your conversation. The AI literally “forgets” the beginning of the chat because it can’t fit it into its working memory.
- Cost: LLM API calls are priced per token (note some LLMs price input and output tokens separately). A longer context means a higher bill for every single interaction.
- Latency: Larger prompts take longer for the LLM to process, increasing response times and degrading user experience.
- Context Rot:: The longer the context window is, the less reliable the LLM’s answers are going to be as shown in this researc.
This is the “context window wall,” and hitting it means higher costs, slower responses, and an AI that becomes increasingly incoherent as it loses its past. This is where our Context Engineering needs to be particularly clever.
AI to the Rescue: Intelligent Summarisation
Our solution to the token economy problem is to use the AI itself to manage its own memory. Instead of sending the full, raw history indefinitely, we employ a sophisticated strategy of intelligent summarisation.
In our ConversationManager, after every few turns (I have set a threshold of 5 messages for demo purposes), we trigger a background process to summarise the recent conversation.
// conversationManager.js
// ... in handleMessage method ...
const messageCount = await this.shortTermMemory.getMessageCount(session.conversationId);
if (messageCount > 0 && messageCount % 5 === 0) { // Check if it's time to summarise
this.performSummarization(session.conversationId);
}
// ...
async performSummarization(conversationId) {
console.log(`Summarizing conversation ${conversationId}...`);
// Crucially, this is fire-and-forget. We don't await its completion.
this.summaryProcessor.summarizeAndCleanup(conversationId);
}Deep Dive: The Asynchronous Advantage & AI UX
Notice the this.summaryProcessor.summarizeAndCleanup(conversationId); call without an await. This is a deliberate AI UX design choice. Summarisation is an LLM call itself, which can introduce latency. We don’t want the user to wait for the AI to “think” about summarising before getting a response to their latest query. By making it “fire-and-forget,” we keep the main conversational thread snappy and responsive, enhancing the perceived performance for the user. The summary is created in the background, ready for future turns.
The SummaryProcessor: An AI for AI Memory Management
Our SummaryProcessor is the unsung hero of the short term memory. Its mission is to transform verbose conversational history into concise, token-efficient summaries. It’s an example of using a “meta-LLM”: an AI-powered process to optimise the performance of the primary AI interaction.
// processing/summaryProcessor.js
// ... constructor with shortTermMemory, longTermMemory, and model ...
async summarizeAndCleanup(conversationId) {
// 1. Retrieve the messages that need summarising
const recentMessages = await this.shortTermMemory.getMessagesForSummarization(conversationId);
if (recentMessages.length === 0) {
return; // Nothing to summarise
}
// 2. Craft a prompt for the LLM to perform the summarisation
// (Actual prompt construction details would be in this.model.summarize)
const summary = await this.model.summarize(recentMessages);
// 3. Store the summary in both short-term (for current context) and long-term (for cross-convo memory)
await this.shortTermMemory.saveSummary(conversationId, summary);
await this.longTermMemory.storeSummary(conversationId, summary); // More on this in Part 3!
// 4. Critically: Prune the original messages to reclaim context window space
const messageIds = recentMessages.map(m => m.id);
await this.shortTermMemory.deleteMessages(messageIds);
console.log(`Successfully summarised and cleaned up ${messageIds.length} messages for conversation ${conversationId}.`);
}
// ...Deep Dive: Abstraction, Not Just Compression
This isn’t merely about shortening text. It’s about semantic abstraction. We’re leveraging the LLM’s understanding capabilities to extract the core meaning and salient points from a chunk of dialogue. A good summarisation prompt (which we’ll explore in more detail when we discuss prompt engineering in Part 4) instructs the LLM not just to reduce word count, but to retain critical information: names, dates, key decisions, user preferences. This means the summary contains far more “meaning per token” than the raw messages, making it a highly efficient form of context.
By replacing, say, 500 tokens of raw messages with a 50-token summary, we’ve achieved a 90% reduction in context window footprint for that segment, without losing the essential thread of the conversation. This is pure gold in the token economy.
If you are a user of Claude Code, you may have seen this in action. Claude Code uses a similar technique to manage the context window, but with a more sophisticated summarisation (they run it when the context length/usage is reaching 90-95% - this is a great approach because it introduces dynamic context management). They call the process
compacting.
Another thing to consider is that different LLMs can be used for different purposes. For example, you could use a “pro” model for summarisation and a “basic” model for chat. This allows you to optimise for different tasks and costs.
What’s Next? The Limits of the “Now”
We’ve now built a robust, token-efficient system for managing the active conversation. Our AI has a keen, tidy working memory. It can chat for extended periods without forgetting what you just said, all while keeping costs in check.
However, this STCM, by its very design, remains a conversational echo chamber. It knows everything about this chat, but nothing about that chat you had last week. To elevate our AI beyond this limitation, to give it true, enduring knowledge that spans across all interactions, we need a different kind of memory.
In Part 3, we’ll embark on a journey into the vast, semantic landscape of Long-Term Semantic Memory, where we’ll explore vector embeddings, cloud-based storage, and the revolutionary concept of searching by meaning. Stay tuned!