Semantic Horizons - Engineering an AI's Enduring Long-Term Memory

Right then, chaps, welcome back! In our last instalment, we tamed the “now” of conversation, engineering a Short-Term Conversational Memory (STCM) that keeps our AI responsive and fiscally responsible. We learned about the token economy and how AI-driven summarisation keeps the LLM’s working memory lean and focused.

But as I hinted, that brilliant STCM is fundamentally limited to a single conversation. It’s an expert on the current chat, but utterly clueless about discussions from yesterday, last week, or even last year. To elevate our AI beyond a sophisticated echo chamber, to give it true, enduring knowledge that spans across all interactions, we need a radically different kind of memory: the Long-Term Semantic Memory (LTSM).

This is where things get really exciting, venturing into the realm of vector embeddings and semantic search - the cornerstone of advanced Context Engineering for AI.

The Problem: Beyond Keyword Search

Imagine your AI has had hundreds of conversations. It’s stored thousands of summaries and critical facts. If a user asks, “How’s that project going?” how does the AI know which project they’re talking about? A simple keyword search for “project” would return a bewildering deluge of irrelevant data.

The human brain doesn’t search for keywords; it searches for meaning. If you say “my furry friend,” I don’t need the word “dog” to understand you’re likely referring to your canine companion (hopefully). We need our AI to do the same.

This is where semantic search shines. Instead of matching text strings, it matches the underlying meaning or context of text.

Deep Dive: Vector Embeddings - The Language of Meaning

The magic behind semantic search is vector embeddings. This is a core AI concept that transforms qualitative text data into quantitative numerical data that machines can easily process and compare.

What they are: A vector embedding is a high-dimensional list of numbers (e.g., 768 or 1536 floating-point numbers). Each number represents a unique semantic feature of the original text.
The “Meaning Space”: Imagine these numbers as coordinates in a vast, abstract “meaning space.” Text snippets that are semantically similar (e.g., “fast car,” “speedy vehicle,” “rapid automobile”) will have vectors that are “close” to each other in this space. Text snippets with different meanings will be “far apart.”
How they’re created: Dedicated AI models (embedding models) are trained to perform this transformation. They learn to map words, phrases, sentences, and even entire documents into this meaning space.

When a new query comes in, we turn it into a vector embedding. Then, to find relevant past information, we simply look for existing embeddings in our database that are “closest” to our query’s embedding. This “closeness” is calculated using mathematical distance metrics (like cosine similarity).

Our Embedding Engine: `gemini-embedding-001`

To perform this crucial transformation, we rely on a powerful, purpose-built embedding model: gemini-embedding-001. This model excels at converting text into high-quality, 768-dimensional vectors.

The EmbeddingService is a thin, focused wrapper around Google’s Generative AI SDK, designed solely to provide this essential capability:

// services/embeddingService.js

export class EmbeddingService {
  constructor(genAI, embeddingDimension = 768) {
    this.genAI = genAI;
    this.embeddingDimension = embeddingDimension;
  }

  async createEmbedding(text) {
    const result = await this.genAI.models.embedContent({
      model: 'gemini-embedding-001',
      contents: text,
      config: {
        outputDimensionality: this.embeddingDimension,
      },
    });

    if (result.embeddings && result.embeddings[0].values) {
      return result.embeddings[0].values;
    } else {
      throw new Error('No valid embedding returned by SDK');
    }
  }
}

Technical Callout: Configurable Output Dimensionality

The outputDimensionality parameter controls the size of the resulting embedding vector. The default 768 dimensions provide a good balance between semantic richness and computational efficiency. Higher dimensions capture more nuance but require more storage and processing power. This configurability allows you to tune the trade-off based on your specific use case.

Building the AI’s Library: Firestore with Vector Indexing

With our embedding service in hand, we need a place to store these vectors and efficiently search them. For our LTSM, I have chosen Google Cloud Firestore, augmented with its powerful vector indexing capabilities.

I have pointed this out in previous articles as well but you can pick any other database that supports vector indicies.

Firestore provides a highly scalable and flexible NoSQL document database. Each “document” can represent a piece of knowledge (like a summary or a critical fact). Critically, Firestore allows us to store our generated embedding vectors alongside this knowledge and then build vector indexes. These indexes are specialised structures that allow Firestore to perform super-fast “nearest neighbour” searches, finding vectors that are semantically closest to a query vector.

Our LongTermMemoryStore is responsible for interacting with this intelligent archive.

The Pay-off: A Deep Dive into `getRelevantContext`

Storing the data is one thing, but retrieving it intelligently is where the system truly comes alive. The vectorSearchWithEmbedding method in longTermMemoryStore.js is a masterclass in practical, multi-faceted AI information retrieval, going far beyond a simple database lookup.

Let’s break down how it works.

// memory/longTermMemoryStore.js (Simplified for clarity)

  async vectorSearchWithEmbedding(queryEmbedding, options = {}) {
    const {
      limit = 10,
      type = null,
      similarityThreshold = 0.3,
      // ... other filters like category, minImportance ...
    } = options;

    let query = this.db.collection(this.memoryVectorsCollection).where('userId', '==', this.userId);

    // ... code to add type and category filters ...

    // 1. Build the core vector query using Firestore's native capabilities
    const vectorQueryOptions = {
      vectorField: 'embedding',
      queryVector: queryEmbedding,
      limit: Math.min(limit * 3, 50), // Fetch more to filter down later
      distanceMeasure: 'COSINE',
      distanceResultField: 'vector_distance', // Ask Firestore to calculate and return the distance
    };

    const vectorQuery = query.findNearest(vectorQueryOptions);
    const snapshot = await vectorQuery.get();

    const results = [];
    snapshot.forEach((doc) => {
      // 2. Natively retrieve the distance and convert to similarity
      const distance = doc.get('vector_distance') || 0;
      const similarity = 1 - distance; // The key conversion from distance to similarity

      // 3. Apply a multi-layered filtering logic
      if (similarity < similarityThreshold) {
        return; // Exclude results that aren't semantically close enough
      }
      // ... more filtering based on importance, confidence ...

      // 4. Calculate a final score and collect results
      const data = doc.data();
      results.push({
        // ... doc data ...
        similarity: similarity,
        score: similarity * (data.importance || 0.7), // Weighted scoring
      });
    });

    // 5. Sort by the combined, weighted score and return the final list
    return results.sort((a, b) => b.score - a.score).slice(0, limit);
  }

Technical Callout: From Distance to Similarity in Production

The implementation leverages Firestore’s native findNearest capability, which is highly optimised.

distanceResultField: 'vector_distance': This is the crucial instruction. We tell Firestore not only to find the nearest vectors but also to calculate their distance (COSINE distance, in this case) and return that value in a field named vector_distance. This offloads the expensive mathematical computation to the highly optimised database layer.
similarity = 1 - distance: The COSINE distance ranges from 0 (identical vectors) to 2 (opposite vectors). By subtracting the distance from 1, we elegantly convert this into a “similarity score,” where 1 means identical and lower values mean less similar. This is the practical, production-grade way to handle the output of a vector search.

Multi-Layered AI Filtering and Scoring

What makes this implementation truly exceptional is that it doesn’t stop at raw similarity. It applies further layers of AI-driven logic:

Thresholds: It ignores any results below a similarityThreshold, preventing weakly related memories from polluting the context.
Metadata Filtering: It can filter by type (summary vs. fact) or category before the vector search, narrowing the search space.
Weighted Scoring: The final ranking isn’t just based on semantic similarity. It’s a weighted score that combines similarity with the fact’s pre-assigned importance. This allows the AI to prioritise memories that are not only relevant but also known to be important, a powerful tool for improving response quality.

In a later post we will discuss what category is.

AI-Driven Deduplication

Furthermore, the storeCriticalFact method includes a brilliant AI-driven deduplication step. Before storing a new fact, it performs a quick vector search for highly similar existing facts. If a near-duplicate is found, it skips storing the new one. This prevents the LTSM from getting clogged with redundant information, ensuring the long-term knowledge base remains clean, efficient, and semantically diverse.

This is a far cry from a simple database INSERT. It’s an intelligent, self-regulating memory curation process.

What’s Next? The Unveiling of True AI Intelligence

With a clear and now accurate understanding of our powerful LTSM, we can appreciate the tools at our disposal. We have a fast STCM and a profound, efficient, self-cleaning LTSM. Now, how do we bring them together?

In Part 4, we’ll unveil the central nervous system of our Context Engineering strategy: the SessionManager. We’ll explore how this masterful component dynamically engineers the perfect prompt for the LLM at every single turn, blending the “now” with the “then” to achieve truly coherent and deeply personalised AI interactions. Prepare for the ‘Aha!’ moment!