Skip to main content

The Grand Orchestration - Engineering a Dual-Memory AI for Enduring Conversations

10 min read

Hello there, fellow AI enthusiast! I’m glad that you are here. Let me welcome you with this tasty glass of milk. Oh, wait, what? You are lactose intolerant? I didn’t know that. Excuse me? You told me already? Hmmm, that’s strange.

Okay, the above may have been a little bit too exaggerated however it does draw upon a potential experience where a conversational AI agent forgets the conversation history. In this article (and in the follow up series) I will introduce you to the concept of a dual-memory system. Now, we are going to unpack the why and how with the meticulous detail that the art of Context Engineering truly deserves.

If you’ve ever had a chat with a large language model (LLM), you’ve likely marvelled at its fluency, its creativity, its sheer informational breadth. It can summarise entire books, write poetry, and explain quantum physics with aplomb. Yet, ask it about something you said just five turns ago, and it might well draw a blank. This isn’t a bug; it’s a fundamental characteristic of how these powerful models operate.

The Stateless Enigma: Why LLMs Forget

At their core, most LLMs are stateless. Each interaction, each tool call, is treated as an independent event. The model doesn’t inherently “remember” previous turns of a conversation. It’s like talking to someone who has complete amnesia between sentences or better - between conversations.

So, how do we make them appear conversational? We feed them the entire conversation history with every prompt. We literally copy and paste everything that’s been said so far into the input window for each new query. This is the simplest form of “memory,” and it works… until it doesn’t.

This naive approach quickly runs into several, very real, very costly problems:

  1. The Context Window Wall: LLMs have a finite input length, known as the context window. If your conversation exceeds this limit (e.g., 8k, 16k, 32k, or even 128k tokens), you’re forced to truncate older messages. The AI literally “forgets” the beginning of the chat.
  2. The Token Tax: Every word you send to the AI, and every word it sends back, costs you money (or compute cycles). A longer conversation history means a larger prompt, which means more tokens processed, and thus, a higher bill. It’s an economic consideration that cannot be ignored in practical AI system design.
  3. Latency & Performance: Sending enormous prompts takes longer to transmit and longer for the model to process. A sluggish AI is a frustrating AI.
  4. Cognitive Overload (for the AI): Even within the context window, a massive amount of raw, unorganised text can dilute the AI’s focus. It has to sift through everything to find the relevant bits, potentially leading to less coherent or accurate responses. This is often referred to as the “context rot”.

This is the heart of the Context Engineering challenge. Our goal is to craft an illusion of continuous memory, not by overloading the AI, but by intelligently curating and injecting the most relevant information at the most opportune moment.

The Grand Vision: A Dual-Memory AI System

The application that we will learn about is a system that mimics human memory more closely: a dual-memory architecture. It’s not just about storing data; it’s about intelligent memory management that feeds precisely what the AI needs, when it needs it.

It has the following two parts:

  1. Short-Term Conversational Memory (“STCM”): This is our AI’s “working memory” - a local, fast store (using SQLite) that holds the immediate, turn-by-turn history of the current conversation. It’s designed for rapid access to recent interactions, keeping the conversational flow natural and immediate. Think of it as the AI’s active scratchpad.

  2. Long-Term Semantic Memory (“LTSM”): This is the AI’s “personal library” - a cloud-based, semantically searchable store (using Google Cloud Firestore with vector embeddings) that contains crucial information extracted from all past conversations. This memory is about concepts and meanings, not just keywords, allowing the AI to recall relevant knowledge from its entire history, regardless of when or in which conversation it was learned.

As a sidenote, you can use different databases for both the STCM and LTSM. You could use a local or a hosted Postgres instance that supports both relational data as well as vector embeddings, or MongoDB, or a mix of both - you get the idea. I picked SQLite and Firestore because I wanted to experiment with these two options.

This intelligent division of labour is paramount. The STCM ensures the AI is always up-to-date with the immediate context, while the LTSM provides the depth of enduring knowledge, overcoming the stateless nature of the underlying LLM.

Let’s visualise this grand design:


+-----------------------------+
|   USER INTERACTION (CLI)    |
|                             |
|   User Input --> ChatCLI    |
+-----------------------------+
               |
               v
+-----------------------------+
|   AI CORE ORCHESTRATION     |
|                             |
|   ConversationManager       |
|         |                   |
|         v                   |
|   SessionManager            |
|         |                   |
|         v                   |
|   LLM API Call              |
|         |                   |
|         v                   |
|   AI Response               |
+-----------------------------+
               |
               v
+-----------------------------+
|      MEMORY SYSTEMS         |
|                             |
|   Background Processors     |
|         |                   |
|         +--> Short-Term     |
|         |    (SQLite)       |
|         |                   |
|         +--> Long-Term      |
|              (Firestore)    |
+-----------------------------+

This diagram is more than just a box-and-arrow chart; it illustrates the flow of AI context. The SessionManager is the critical piece that dynamically assembles the LLM prompt by intelligently drawing from both STCM and LTSM, guided by the orchestration of the ConversationManager. Background processors (like summarisers and fact extractors) work asynchronously to enrich these memories, forming a continuous learning loop.

The Orchestrator: ConversationManager

The application’s entry point, app.js, is quite lean, primarily setting up the command-line interface and kicking off the main loop. The real AI-centric orchestration begins within the ConversationManager. This class acts as the central nervous system for our AI, connecting all the memory components, processors, and the LLM itself.

Let’s examine its handleMessage method, which defines the core AI processing loop for every user input:

// conversationManager.js

// ... various imports for memory stores, services, processors ...

export class ConversationManager {
  constructor() {
    // Initialise all our AI-centric services and memory interfaces
    this.embeddingService = new EmbeddingService();
    this.longTermMemory = new LongTermMemoryStore(this.embeddingService);
    this.shortTermMemory = new ShortTermMemoryStore();
    this.sessionManager = new SessionManager(this.shortTermMemory, this.longTermMemory);
    this.model = new GeminiModel(); // Our interface to the LLM
    this.summaryProcessor = new SummaryProcessor(this.shortTermMemory, this.longTermMemory, this.model);
    this.criticalInfoProcessor = new CriticalInfoProcessor(this.longTermMemory, this.model);
  }

  async handleMessage(message, conversationId) {
    // Step 1: Context Engineering in action: Gather all relevant history and memories.
    const { session, isNewConversation } = await this.sessionManager.getSession(conversationId, message);

    // Step 2: Invoke the LLM with the carefully constructed prompt.
    const aiResponse = await this.model.generateResponse(session.getHistory());

    // Step 3: Update Short-Term Memory with the latest turn.
    await this.shortTermMemory.saveMessage('user', message, session.conversationId);
    await this.shortTermMemory.saveMessage('ai', aiResponse, session.conversationId);

    // Step 4: Kick off asynchronous, AI-driven memory enrichment processes.
    // These tasks enhance long-term memory without blocking the user experience.
    this.criticalInfoProcessor.processUserMessage(message, session.conversationId);
    
    const messageCount = await this.shortTermMemory.getMessageCount(session.conversationId);
    // Trigger summarisation based on message count - another AI-driven task.
    if (messageCount > 0 && messageCount % 5 === 0) { // Using messageCount > 0 to prevent summarising empty new convos
      this.performSummarization(session.conversationId);
    }

    return { response: aiResponse, conversationId: session.conversationId };
  }

  // ... performSummarization and other methods ...
}

Technical Callout: The ConversationManager as an AI State Machine

Notice how the handleMessage method isn’t just about passing data around. It’s a precisely engineered state machine for AI interaction:

  • It first prepares the AI’s state (by building the session context).
  • It then executes the AI’s core function (generating a response).
  • Finally, it updates the AI’s internal memory state (saving messages, triggering processors) based on the new interaction.

This loop ensures that every user message not only receives a response but also contributes to the continuous learning and memory-building of the AI system. The non-blocking nature of the background processors (like criticalInfoProcessor and performSummarization which is often awaited later but called as fire-and-forget from the main thread) is a key design choice for maintaining a responsive user experience, a critical aspect of AI product development.

What’s Next? The Token Economy and Short-Term Memory

We’ve established the architectural pillars of our AI system. Now, it’s time to delve into the intricate details of each. In our next post, we’ll strip back the layers of the Short-Term Conversational Memory. We’ll confront the very real challenge of the “context window wall” and explore how we manage the token economy of our conversations, ensuring our AI can keep its focus and its budget.

Watch this space!