The Grand Orchestration - Engineering a Dual-Memory AI for Enduring Conversations
Picture this. You pour someone a glass of milk. They remind you they’re lactose intolerant. You apologise. Next time you meet them, you do it again.
That’s what it feels like when a conversational AI agent forgets everything between sessions. In this article (and the follow-up series) I’ll walk you through a dual-memory system. We’ll cover the why and how with the detail that Context Engineering properly deserves.
If you’ve chatted with a large language model, you’ve probably been impressed by its fluency, creativity, and sheer breadth of knowledge. It can summarise books, write poetry, explain quantum physics. Yet ask it about something you said five turns ago and it draws a blank. That’s not a bug. It’s a fundamental characteristic of how these models work.
The Stateless Problem: Why LLMs Forget
At their core, most LLMs are stateless. Every interaction, every tool call, is treated as an independent event. The model doesn’t inherently “remember” previous turns. It’s like talking to someone with complete amnesia between conversations.
So how do we make them appear conversational? We feed them the entire conversation history with every prompt. Copy and paste everything said so far into the input window for each new query. That’s the simplest form of “memory,” and it works. Until it doesn’t.
This naive approach slams into several real, costly problems:
- The Context Window Wall: LLMs have a finite input length, the context window. If your conversation exceeds this limit (8k, 16k, 32k, or even 128k tokens), you’re forced to truncate older messages. The AI literally forgets the beginning of the chat.
- The Token Tax: Every word you send costs money (or compute cycles). A longer history means a larger prompt, more tokens processed, a higher bill. You can’t ignore this in practical AI system design.
- Latency and Performance: Enormous prompts take longer to transmit and longer to process. A sluggish AI is a frustrating AI.
- Cognitive Overload (for the AI): Even within the context window, a massive amount of raw, unorganised text dilutes the AI’s focus. It has to sift through everything to find the relevant bits, potentially producing less coherent responses. This is often called “context rot”.
That’s the heart of the Context Engineering challenge. The goal: craft an illusion of continuous memory, not by overloading the AI, but by intelligently curating and injecting the most relevant information at the most opportune moment.
The Grand Vision: A Dual-Memory AI System
The application we’ll build mimics human memory more closely through a dual-memory architecture. It goes beyond storing data. It’s about intelligent memory management that feeds precisely what the AI needs, when it needs it.
Two parts:
-
Short-Term Conversational Memory (“STCM”): The AI’s “working memory.” A local, fast store (using SQLite) that holds the immediate, turn-by-turn history of the current conversation. Built for rapid access to recent interactions, keeping the conversational flow natural. Think of it as the AI’s active scratchpad.
-
Long-Term Semantic Memory (“LTSM”): The AI’s “personal library.” A cloud-based, semantically searchable store (using Google Cloud Firestore with vector embeddings) that contains crucial information extracted from all past conversations. This memory works with concepts and meanings, not just keywords, letting the AI recall relevant knowledge from its entire history regardless of when or in which conversation it was learned.
As a sidenote, you can use different databases for both the STCM and LTSM. You could use a local or hosted Postgres instance that supports both relational data and vector embeddings, or MongoDB, or a mix of both. I picked SQLite and Firestore because I wanted to experiment with these two options.
This division of labour is critical. The STCM keeps the AI current with the immediate context. The LTSM provides enduring knowledge that spans sessions, overcoming the stateless nature of the underlying LLM.
Here’s a visualisation of the overall design:
+-----------------------------+
| USER INTERACTION (CLI) |
| |
| User Input --> ChatCLI |
+-----------------------------+
|
v
+-----------------------------+
| AI CORE ORCHESTRATION |
| |
| ConversationManager |
| | |
| v |
| SessionManager |
| | |
| v |
| LLM API Call |
| | |
| v |
| AI Response |
+-----------------------------+
|
v
+-----------------------------+
| MEMORY SYSTEMS |
| |
| Background Processors |
| | |
| +--> Short-Term |
| | (SQLite) |
| | |
| +--> Long-Term |
| (Firestore) |
+-----------------------------+
The diagram illustrates the flow of AI context. The SessionManager dynamically assembles the LLM prompt by drawing from both STCM and LTSM, guided by the ConversationManager’s orchestration. Background processors (summarisers, fact extractors) work asynchronously to enrich these memories, forming a continuous learning loop.
The Orchestrator: ConversationManager
The application’s entry point, app.js, is lean. It sets up the command-line interface and kicks off the main loop. The real orchestration lives inside the ConversationManager. This class acts as the central nervous system, connecting all memory components, processors, and the LLM itself.
Here’s its handleMessage method, the core processing loop for every user input:
// conversationManager.js
// ... various imports for memory stores, services, processors ...
export class ConversationManager {
constructor() {
// Initialise all our AI-centric services and memory interfaces
this.embeddingService = new EmbeddingService();
this.longTermMemory = new LongTermMemoryStore(this.embeddingService);
this.shortTermMemory = new ShortTermMemoryStore();
this.sessionManager = new SessionManager(this.shortTermMemory, this.longTermMemory);
this.model = new GeminiModel(); // Our interface to the LLM
this.summaryProcessor = new SummaryProcessor(this.shortTermMemory, this.longTermMemory, this.model);
this.criticalInfoProcessor = new CriticalInfoProcessor(this.longTermMemory, this.model);
}
async handleMessage(message, conversationId) {
// Step 1: Context Engineering in action: Gather all relevant history and memories.
const { session, isNewConversation } = await this.sessionManager.getSession(conversationId, message);
// Step 2: Invoke the LLM with the carefully constructed prompt.
const aiResponse = await this.model.generateResponse(session.getHistory());
// Step 3: Update Short-Term Memory with the latest turn.
await this.shortTermMemory.saveMessage('user', message, session.conversationId);
await this.shortTermMemory.saveMessage('ai', aiResponse, session.conversationId);
// Step 4: Kick off asynchronous, AI-driven memory enrichment processes.
// These tasks enhance long-term memory without blocking the user experience.
this.criticalInfoProcessor.processUserMessage(message, session.conversationId);
const messageCount = await this.shortTermMemory.getMessageCount(session.conversationId);
// Trigger summarisation based on message count - another AI-driven task.
if (messageCount > 0 && messageCount % 5 === 0) { // Using messageCount > 0 to prevent summarising empty new convos
this.performSummarization(session.conversationId);
}
return { response: aiResponse, conversationId: session.conversationId };
}
// ... performSummarization and other methods ...
}
Technical Callout: The ConversationManager as an AI State Machine
The handleMessage method isn’t just shuffling data around. It’s a precisely engineered state machine for AI interaction:
- It first prepares the AI’s state (by building the
sessioncontext). - It then executes the AI’s core function (generating a response).
- Finally, it updates the AI’s internal memory state (saving messages, triggering processors) based on the new interaction.
Every user message produces a response and contributes to the continuous learning and memory-building of the system. The non-blocking nature of the background processors (criticalInfoProcessor and performSummarization, called fire-and-forget from the main thread) is a deliberate design choice. It keeps the user experience responsive, which is critical in AI product development.
What’s Next? The Token Economy and Short-Term Memory
We’ve established the architectural pillars of our AI system. Next, we’ll strip back the layers of the Short-Term Conversational Memory. We’ll confront the “context window wall” head-on and explore how we manage the token economy of our conversations, keeping the AI focused and the budget intact.
Watch this space!