Semantic Horizons - Engineering an AI's Enduring Long-Term Memory
Right. In our last instalment, we tamed the “now” of conversation. We engineered a Short-Term Conversational Memory (STCM) that keeps our AI responsive and fiscally responsible. We covered the token economy and how AI-driven summarisation keeps the LLM’s working memory lean and focused.
But the STCM is fundamentally limited to a single conversation. It’s an expert on the current chat, utterly clueless about discussions from yesterday, last week, or last year. To give the AI enduring knowledge that spans all interactions, we need something different: the Long-Term Semantic Memory (LTSM).
This is where things get interesting. Vector embeddings and semantic search, the cornerstone of advanced Context Engineering for AI.
The Problem: Beyond Keyword Search
Imagine the AI has had hundreds of conversations. It’s stored thousands of summaries and critical facts. If a user asks “How’s that project going?”, how does the AI know which project? A keyword search for “project” would return a flood of irrelevant data.
The human brain doesn’t search by keywords; it searches by meaning. If you say “my furry friend,” I don’t need the word “dog” to understand you’re probably talking about your canine companion (hopefully). We need our AI to do the same.
Semantic search solves this. Instead of matching text strings, it matches the underlying meaning of text.
Vector Embeddings: The Language of Meaning
The engine behind semantic search is vector embeddings. This is a core AI concept that transforms qualitative text data into quantitative numerical data that machines can process and compare.
- What they are: A vector embedding is a high-dimensional list of numbers (e.g., 768 or 1536 floating-point numbers). Each number represents a semantic feature of the original text.
- The “Meaning Space”: Think of these numbers as coordinates in a vast, abstract “meaning space.” Text snippets that are semantically similar (“fast car,” “speedy vehicle,” “rapid automobile”) will have vectors that sit close to each other. Text with different meanings will be far apart.
- How they’re created: Dedicated AI models (embedding models) are trained to perform this transformation. They learn to map words, phrases, sentences, and entire documents into this meaning space.
When a new query comes in, we turn it into a vector embedding. Then, to find relevant past information, we look for existing embeddings in our database that are closest to the query’s embedding. That “closeness” is calculated using mathematical distance metrics (like cosine similarity).
Our Embedding Engine: gemini-embedding-001
For this transformation, we rely on gemini-embedding-001. This model converts text into high-quality, 768-dimensional vectors.
The EmbeddingService is a thin wrapper around Google’s Generative AI SDK, focused solely on providing this capability:
// services/embeddingService.js
export class EmbeddingService {
constructor(genAI, embeddingDimension = 768) {
this.genAI = genAI;
this.embeddingDimension = embeddingDimension;
}
async createEmbedding(text) {
const result = await this.genAI.models.embedContent({
model: 'gemini-embedding-001',
contents: text,
config: {
outputDimensionality: this.embeddingDimension,
},
});
if (result.embeddings && result.embeddings[0].values) {
return result.embeddings[0].values;
} else {
throw new Error('No valid embedding returned by SDK');
}
}
}
Technical Callout: Configurable Output Dimensionality
The outputDimensionality parameter controls the size of the resulting vector. The default 768 dimensions strike a good balance between semantic richness and computational efficiency. Higher dimensions capture more nuance but demand more storage and processing power. This configurability lets you tune the trade-off for your specific use case.
Building the AI’s Library: Firestore with Vector Indexing
With our embedding service sorted, we need a place to store these vectors and search them efficiently. For the LTSM, I’ve chosen Google Cloud Firestore, bolted on with its vector indexing capabilities.
I’ve pointed this out in previous articles, but you can pick any other database that supports vector indices.
Firestore provides a scalable, flexible NoSQL document database. Each “document” represents a piece of knowledge (a summary, a critical fact). Firestore lets us store our generated embedding vectors alongside this knowledge and build vector indexes. These indexes are specialised structures that allow super-fast “nearest neighbour” searches, finding vectors semantically closest to a query vector.
Our LongTermMemoryStore handles all interaction with this intelligent archive.
The Pay-off: How getRelevantContext Works
Storing data is one thing. Retrieving it intelligently is where the system comes alive. The vectorSearchWithEmbedding method in longTermMemoryStore.js is a masterclass in practical, multi-faceted AI information retrieval.
Here’s how it breaks down:
// memory/longTermMemoryStore.js (Simplified for clarity)
async vectorSearchWithEmbedding(queryEmbedding, options = {}) {
const {
limit = 10,
type = null,
similarityThreshold = 0.3,
// ... other filters like category, minImportance ...
} = options;
let query = this.db.collection(this.memoryVectorsCollection).where('userId', '==', this.userId);
// ... code to add type and category filters ...
// 1. Build the core vector query using Firestore's native capabilities
const vectorQueryOptions = {
vectorField: 'embedding',
queryVector: queryEmbedding,
limit: Math.min(limit * 3, 50), // Fetch more to filter down later
distanceMeasure: 'COSINE',
distanceResultField: 'vector_distance', // Ask Firestore to calculate and return the distance
};
const vectorQuery = query.findNearest(vectorQueryOptions);
const snapshot = await vectorQuery.get();
const results = [];
snapshot.forEach((doc) => {
// 2. Natively retrieve the distance and convert to similarity
const distance = doc.get('vector_distance') || 0;
const similarity = 1 - distance; // The key conversion from distance to similarity
// 3. Apply a multi-layered filtering logic
if (similarity < similarityThreshold) {
return; // Exclude results that aren't semantically close enough
}
// ... more filtering based on importance, confidence ...
// 4. Calculate a final score and collect results
const data = doc.data();
results.push({
// ... doc data ...
similarity: similarity,
score: similarity * (data.importance || 0.7), // Weighted scoring
});
});
// 5. Sort by the combined, weighted score and return the final list
return results.sort((a, b) => b.score - a.score).slice(0, limit);
}
Here’s what a semantic search looks like when the AI retrieves relevant context from long-term memory:
Technical Callout: From Distance to Similarity in Production
The implementation uses Firestore’s native findNearest capability, which is highly optimised.
distanceResultField: 'vector_distance': This is the crucial instruction. We tell Firestore not only to find the nearest vectors but also to calculate their distance (COSINE distance) and return that value in a field calledvector_distance. This offloads the expensive mathematical computation to the optimised database layer.similarity = 1 - distance: COSINE distance ranges from 0 (identical vectors) to 2 (opposite vectors). Subtracting the distance from 1 converts this into a “similarity score,” where 1 means identical and lower values mean less similar. That’s the practical, production-grade way to handle vector search output.
Multi-Layered Filtering and Scoring
What makes this implementation stand out is that it doesn’t stop at raw similarity. It applies further layers of logic:
- Thresholds: Results below a
similarityThresholdget tossed out, preventing weakly related memories from polluting the context. - Metadata Filtering: It can filter by
type(summary vs. fact) orcategorybefore the vector search, narrowing the search space. - Weighted Scoring: The final ranking isn’t purely semantic similarity. It’s a weighted
scorecombiningsimilaritywith the fact’s pre-assignedimportance. This lets the AI prioritise memories that are both relevant and known to be important. A powerful lever for improving response quality.
In a later post we’ll discuss what category is.
AI-Driven Deduplication
The storeCriticalFact method includes an AI-driven deduplication step. Before storing a new fact, it runs a quick vector search for highly similar existing facts. If a near-duplicate turns up, it skips the store. This prevents the LTSM from getting clogged with redundant information, keeping the long-term knowledge base clean, efficient, and semantically diverse.
That’s a far cry from a simple database INSERT. It’s an intelligent, self-regulating memory curation process.
What’s Next? Bringing It All Together
With a clear understanding of our LTSM, we can appreciate the full toolkit. We have a fast STCM and a deep, efficient, self-cleaning LTSM. The question now: how do we bring them together?
In Part 4, we’ll look at the central nervous system of our Context Engineering strategy, the SessionManager. We’ll explore how this component dynamically engineers the perfect prompt for the LLM at every single turn, blending the “now” with the “then” to achieve coherent, deeply personalised AI interactions. That’s where things click into place.