Filesystem as Context: Building an AI Detective with bash-tool
If you’ve spent any time building AI agents, you’ve hit the same wall: context windows fill up fast. The instinct is to cram everything into the prompt and hope the model sorts it out. But tokens are finite, attention degrades with length, and you’re paying for every character. It doesn’t scale.
There’s a better pattern. Instead of bringing the data to the model, hand the model a filesystem and let it go find what it needs.
Last Christmas I spent a ridiculous amount of time playing Cluedo, which planted the seed for this project. I’ll walk you through building a murder mystery detective agent that uses Vercel’s bash-tool to investigate case files, interrogate evidence, and crack a crime. The other spark came from Vercel’s own Call Summary Agent template. I took the same architectural pattern and pointed it at something more dramatic (and more fun) than sales calls.
If you’d like to see the full source code, check out the repo.
The Problem with Prompt Stuffing
Say you’ve got 17 markdown files across multiple directories: suspect profiles, forensic reports, witness statements, location descriptions, a timeline. You could concatenate them all and drop them into a single prompt. For 17 files, you’d probably get away with it.
But the technique falls apart fast. What happens at 170 files? Or 1,700? What happens when the files aren’t all equally relevant and you’re burning tokens on a garden shed description when the answer was buried in the forensic report?
The filesystem-based approach sidesteps this entirely. You give the agent:
- A set of files mounted into a virtual filesystem
- Tools to explore that filesystem (
bash,readFile,writeFile) - A task and the autonomy to investigate
The agent decides what to read, when to read it, and how to cross-reference findings. It pulls small, targeted slices of context rather than swallowing everything upfront.
The Stack
The project runs on a lean set of dependencies:
- AI SDK (
ai): the agentic loop, tool calling, and structured output - bash-tool: filesystem access via bash commands,
readFile, andwriteFile - @ai-sdk/google: Gemini as the underlying model
- Zod: schema validation for the agent’s structured verdict
No framework, no orchestration layer, no database. The entire agent is under 50 lines of code.
Setting Up the Filesystem
First step: load the case files into memory. These are plain markdown files organised in a directory structure that mirrors a real case file:
case-files/
├── case-summary.md
├── timeline.md
├── victim/
│ └── profile.md
├── suspects/
│ ├── alice-chen.md
│ ├── bob-martinez.md
│ ├── carol-thompson.md
│ └── david-kim.md
├── evidence/
│ ├── forensics.md
│ ├── weapon.md
│ └── physical/
│ ├── torn-letter.md
│ ├── muddy-shoes.md
│ └── wine-glass.md
├── witnesses/
│ ├── neighbor.md
│ ├── housekeeper.md
│ └── business-partner.md
└── locations/
├── crime-scene.md
└── garden-shed.md
At startup, we walk this directory and load every .md file into a Record<string, string> where the key is the virtual path:
async function loadCaseFiles(dir: string): Promise<Record<string, string>> {
const files: Record<string, string> = {};
async function walk(currentDir: string, basePath: string) {
const entries = await readdir(currentDir, { withFileTypes: true });
for (const entry of entries) {
const fullPath = join(currentDir, entry.name);
const relativePath = join(basePath, entry.name);
if (entry.isDirectory()) {
await walk(fullPath, relativePath);
} else if (entry.name.endsWith(".md")) {
const content = await readFile(fullPath, "utf-8");
files[`/case-files/${relativePath}`] = content;
}
}
}
await walk(dir, "");
return files;
}
These files then get mounted into bash-tool’s virtual filesystem. The agent never touches the real filesystem. Everything runs in an in-memory sandbox.
Creating the Agent
The agent setup is surprisingly compact. We pass the loaded files to bash-tool, which hands back a set of tools the AI can call:
import { createBashTool } from 'bash-tool';
import { google } from '@ai-sdk/google';
import { generateText, stepCountIs, Output } from 'ai';
export async function createDetectiveAgent(files: Record<string, string>) {
const { tools } = await createBashTool({
files,
destination: '/',
});
return {
tools,
model: google('gemini-3-flash-preview'),
};
}
Note the usage of
gemini-3-flash-preview. This is the first Gemini model that supports tool calling and structured output.
The createBashTool call does the heavy lifting. It takes a flat map of file paths to contents and mounts them into a virtual filesystem. The returned tools object exposes bash, readFile, and writeFile, all standard AI SDK tools that the model can invoke during its agentic loop.
The Investigation Loop
Here’s where things get interesting. The investigate function kicks off the agentic loop using the AI SDK’s generateText with tool calling:
export async function investigate(files: Record<string, string>): Promise<Verdict> {
const { tools, model } = await createDetectiveAgent(files);
const { output } = await generateText({
model,
tools,
output: Output.object({ schema: verdictSchema }),
stopWhen: stepCountIs(50),
system: systemPrompt,
prompt: taskPrompt,
});
return output as Verdict;
}
A few design decisions worth noting.
Output.object with a Zod schema acts as the termination signal. The agent loops through tool calls (reading files, running bash commands) until it’s ready to produce a structured verdict. The schema is a contract: the loop won’t end until the model produces valid JSON matching verdictSchema. Far more reliable than hoping the model says “I’m done” in plain text.
stepCountIs(50) is the safety net. If the agent spirals or gets stuck in an unproductive loop, it stops after 50 steps. Think of it as a budget. In practice, the agent typically cracks the case in 15 to 25 steps.
No explicit iteration logic. The AI SDK handles the loop internally. The model calls a tool, gets a result, decides what to do next, calls another tool, and so on. We don’t write while loops or manage state. The loop emerges from the model’s reasoning.
Guiding the Detective
The system prompt establishes methodology without micromanaging execution:
export const systemPrompt = `You are Detective Monsieur Grey Cells, a brilliant criminal investigator
with decades of experience solving complex murder cases.
Your investigation methodology:
1. First, get an overview of the case by reading the case summary
2. Study the victim's profile to understand who they were and potential motives
3. Review the timeline to understand the sequence of events
4. Examine each suspect's profile, noting motives, alibis, and inconsistencies
5. Analyze all physical evidence and forensic reports
6. Interview witnesses through their statements
7. Visit locations to understand the crime scene and surroundings
8. Cross-reference evidence with alibis to find contradictions
9. Build a chain of evidence that points to the true killer
Use bash commands to explore the case files. The case files are located in /case-files/.`;
This is the filesystem-as-context pattern in action. We don’t paste in the case files. We tell the agent where they are and how to approach them. The agent then uses ls, cat, find, and other bash commands to navigate the filesystem at its own pace.
The task prompt reinforces this by laying out what’s available without revealing contents:
export const taskPrompt = `Investigate the murder case in the case files directory.
Use bash commands like ls, cat, and find to explore the files and uncover the truth.
Start by listing the contents of /case-files/ to see what's available, then systematically
investigate:
- Read the case summary
- Study the victim
- Examine the timeline
- Review each suspect
- Analyze all evidence
- Check witness statements
- Explore locations`;
Structured Output as the Verdict
The agent doesn’t just output free text. It produces a structured verdict defined by a Zod schema:
export const verdictSchema = z.object({
verdict: z.object({
murderer: z.string(),
confidence: z.number().min(0).max(100),
motive: z.string(),
}),
evidenceChain: z.array(
z.object({
item: z.string(),
implicates: z.string(),
significance: z.string(),
})
),
suspectRankings: z.array(
z.object({
name: z.string(),
suspicionScore: z.number().min(0).max(100),
alibiStatus: z.enum(["verified", "unverified", "broken"]),
motive: z.string().nullable(),
})
),
keyDeductions: z.array(z.string()),
});
This forces the agent to commit. It must name a murderer, assign a confidence score, rank every suspect, and lay out its evidence chain. No hedging, no “it could be anyone.” The schema is the accountability mechanism.
Watching the Agent Think
One of the most satisfying parts of this project is watching the investigation unfold in real time. The onStepFinish callback logs every tool call. Here’s what a typical run looks like:
You can see the agent following its methodology: overview first, then victim, then suspects, then cross-referencing physical evidence with targeted grep commands. It’s not reading files randomly. It’s investigating.
And here’s the structured verdict it produces, with a full evidence chain and suspect rankings:
Why This Pattern Matters
The murder mystery is fun, but the underlying pattern (filesystem-as-context) applies broadly:
- Sales call analysis: load transcripts as files, let the agent grep and cross-reference across calls.
- Codebase exploration: mount a repository, let the agent navigate with
findandcatto answer architectural questions. - Legal document review: case files, contracts, exhibits. The agent reads what’s relevant rather than ingesting everything.
- Customer support: mount conversation histories, knowledge base articles, product docs. The agent pulls context as needed.
The key insight is that agents are better at retrieving their own context than we are at pre-selecting it for them. When you stuff everything into a prompt, you’re making the retrieval decision. When you give the agent a filesystem and tools, the agent makes the retrieval decision, and it can adapt based on what it finds.
This is also fundamentally different from RAG. With RAG, you pre-compute embeddings, run a similarity search, and inject the top-k results. The agent has no say in what gets retrieved. With filesystem-based retrieval, the agent formulates its own queries and follows threads dynamically.
Wrapping Up
The entire project (agent, prompts, schema, runner) is under 100 lines of TypeScript. The case files are just markdown. There’s no database, no embedding pipeline, no retrieval infrastructure. And yet the agent reliably investigates, cross-references, and solves the case.
If you’re building agents that need to reason over documents, consider reaching for a filesystem before reaching for a vector database. Sometimes the simplest retrieval mechanism is cat.