Skip to main content

Semantic Horizons - Engineering an AI's Enduring Long-Term Memory

9 min read

Right. In our last instalment, we tamed the “now” of conversation. We engineered a Short-Term Conversational Memory (STCM) that keeps our AI responsive and fiscally responsible. We covered the token economy and how AI-driven summarisation keeps the LLM’s working memory lean and focused.

But the STCM is fundamentally limited to a single conversation. It’s an expert on the current chat, utterly clueless about discussions from yesterday, last week, or last year. To give the AI enduring knowledge that spans all interactions, we need something different: the Long-Term Semantic Memory (LTSM).

This is where things get interesting. Vector embeddings and semantic search, the cornerstone of advanced Context Engineering for AI.

Imagine the AI has had hundreds of conversations. It’s stored thousands of summaries and critical facts. If a user asks “How’s that project going?”, how does the AI know which project? A keyword search for “project” would return a flood of irrelevant data.

The human brain doesn’t search by keywords; it searches by meaning. If you say “my furry friend,” I don’t need the word “dog” to understand you’re probably talking about your canine companion (hopefully). We need our AI to do the same.

Semantic search solves this. Instead of matching text strings, it matches the underlying meaning of text.

Vector Embeddings: The Language of Meaning

The engine behind semantic search is vector embeddings. This is a core AI concept that transforms qualitative text data into quantitative numerical data that machines can process and compare.

  • What they are: A vector embedding is a high-dimensional list of numbers (e.g., 768 or 1536 floating-point numbers). Each number represents a semantic feature of the original text.
  • The “Meaning Space”: Think of these numbers as coordinates in a vast, abstract “meaning space.” Text snippets that are semantically similar (“fast car,” “speedy vehicle,” “rapid automobile”) will have vectors that sit close to each other. Text with different meanings will be far apart.
  • How they’re created: Dedicated AI models (embedding models) are trained to perform this transformation. They learn to map words, phrases, sentences, and entire documents into this meaning space.

When a new query comes in, we turn it into a vector embedding. Then, to find relevant past information, we look for existing embeddings in our database that are closest to the query’s embedding. That “closeness” is calculated using mathematical distance metrics (like cosine similarity).

Our Embedding Engine: gemini-embedding-001

For this transformation, we rely on gemini-embedding-001. This model converts text into high-quality, 768-dimensional vectors.

The EmbeddingService is a thin wrapper around Google’s Generative AI SDK, focused solely on providing this capability:

// services/embeddingService.js

export class EmbeddingService {
  constructor(genAI, embeddingDimension = 768) {
    this.genAI = genAI;
    this.embeddingDimension = embeddingDimension;
  }

  async createEmbedding(text) {
    const result = await this.genAI.models.embedContent({
      model: 'gemini-embedding-001',
      contents: text,
      config: {
        outputDimensionality: this.embeddingDimension,
      },
    });

    if (result.embeddings && result.embeddings[0].values) {
      return result.embeddings[0].values;
    } else {
      throw new Error('No valid embedding returned by SDK');
    }
  }
}

Technical Callout: Configurable Output Dimensionality

The outputDimensionality parameter controls the size of the resulting vector. The default 768 dimensions strike a good balance between semantic richness and computational efficiency. Higher dimensions capture more nuance but demand more storage and processing power. This configurability lets you tune the trade-off for your specific use case.

Building the AI’s Library: Firestore with Vector Indexing

With our embedding service sorted, we need a place to store these vectors and search them efficiently. For the LTSM, I’ve chosen Google Cloud Firestore, bolted on with its vector indexing capabilities.

I’ve pointed this out in previous articles, but you can pick any other database that supports vector indices.

Firestore provides a scalable, flexible NoSQL document database. Each “document” represents a piece of knowledge (a summary, a critical fact). Firestore lets us store our generated embedding vectors alongside this knowledge and build vector indexes. These indexes are specialised structures that allow super-fast “nearest neighbour” searches, finding vectors semantically closest to a query vector.

Our LongTermMemoryStore handles all interaction with this intelligent archive.

The Pay-off: How getRelevantContext Works

Storing data is one thing. Retrieving it intelligently is where the system comes alive. The vectorSearchWithEmbedding method in longTermMemoryStore.js is a masterclass in practical, multi-faceted AI information retrieval.

Here’s how it breaks down:

// memory/longTermMemoryStore.js (Simplified for clarity)

  async vectorSearchWithEmbedding(queryEmbedding, options = {}) {
    const {
      limit = 10,
      type = null,
      similarityThreshold = 0.3,
      // ... other filters like category, minImportance ...
    } = options;

    let query = this.db.collection(this.memoryVectorsCollection).where('userId', '==', this.userId);

    // ... code to add type and category filters ...

    // 1. Build the core vector query using Firestore's native capabilities
    const vectorQueryOptions = {
      vectorField: 'embedding',
      queryVector: queryEmbedding,
      limit: Math.min(limit * 3, 50), // Fetch more to filter down later
      distanceMeasure: 'COSINE',
      distanceResultField: 'vector_distance', // Ask Firestore to calculate and return the distance
    };

    const vectorQuery = query.findNearest(vectorQueryOptions);
    const snapshot = await vectorQuery.get();

    const results = [];
    snapshot.forEach((doc) => {
      // 2. Natively retrieve the distance and convert to similarity
      const distance = doc.get('vector_distance') || 0;
      const similarity = 1 - distance; // The key conversion from distance to similarity

      // 3. Apply a multi-layered filtering logic
      if (similarity < similarityThreshold) {
        return; // Exclude results that aren't semantically close enough
      }
      // ... more filtering based on importance, confidence ...

      // 4. Calculate a final score and collect results
      const data = doc.data();
      results.push({
        // ... doc data ...
        similarity: similarity,
        score: similarity * (data.importance || 0.7), // Weighted scoring
      });
    });

    // 5. Sort by the combined, weighted score and return the final list
    return results.sort((a, b) => b.score - a.score).slice(0, limit);
  }

Here’s what a semantic search looks like when the AI retrieves relevant context from long-term memory:

$
$

Technical Callout: From Distance to Similarity in Production

The implementation uses Firestore’s native findNearest capability, which is highly optimised.

  1. distanceResultField: 'vector_distance': This is the crucial instruction. We tell Firestore not only to find the nearest vectors but also to calculate their distance (COSINE distance) and return that value in a field called vector_distance. This offloads the expensive mathematical computation to the optimised database layer.
  2. similarity = 1 - distance: COSINE distance ranges from 0 (identical vectors) to 2 (opposite vectors). Subtracting the distance from 1 converts this into a “similarity score,” where 1 means identical and lower values mean less similar. That’s the practical, production-grade way to handle vector search output.

Multi-Layered Filtering and Scoring

What makes this implementation stand out is that it doesn’t stop at raw similarity. It applies further layers of logic:

  • Thresholds: Results below a similarityThreshold get tossed out, preventing weakly related memories from polluting the context.
  • Metadata Filtering: It can filter by type (summary vs. fact) or category before the vector search, narrowing the search space.
  • Weighted Scoring: The final ranking isn’t purely semantic similarity. It’s a weighted score combining similarity with the fact’s pre-assigned importance. This lets the AI prioritise memories that are both relevant and known to be important. A powerful lever for improving response quality.

In a later post we’ll discuss what category is.

AI-Driven Deduplication

The storeCriticalFact method includes an AI-driven deduplication step. Before storing a new fact, it runs a quick vector search for highly similar existing facts. If a near-duplicate turns up, it skips the store. This prevents the LTSM from getting clogged with redundant information, keeping the long-term knowledge base clean, efficient, and semantically diverse.

That’s a far cry from a simple database INSERT. It’s an intelligent, self-regulating memory curation process.

What’s Next? Bringing It All Together

With a clear understanding of our LTSM, we can appreciate the full toolkit. We have a fast STCM and a deep, efficient, self-cleaning LTSM. The question now: how do we bring them together?

In Part 4, we’ll look at the central nervous system of our Context Engineering strategy, the SessionManager. We’ll explore how this component dynamically engineers the perfect prompt for the LLM at every single turn, blending the “now” with the “then” to achieve coherent, deeply personalised AI interactions. That’s where things click into place.