Skip to main content

Filesystem as Context: Building an AI Detective with bash-tool

16 min read

If you’ve spent any time building AI agents, you’ve likely hit the same wall: context windows fill up fast. The instinct is to stuff everything into the prompt - paste in all the documents, all the data, all the context - and hope the model figures it out. But this approach doesn’t scale. Tokens are finite, attention degrades with length, and you’re paying for every character.

There’s a better pattern emerging, and it flips the script entirely: instead of bringing the data to the model, give the model a filesystem and let it go find what it needs - autonomously.

This past Christmas, I spent quote some time playing with Cluedo which served as the inspiration for this post. I’ll walk you through building a murder mystery detective agent that uses Vercel’s bash-tool to investigate case files, interrogate evidence, and ultimately solve a crime. The other piece of inspiration came from Vercel’s own Call Summary Agent template - I took the same architectural pattern and applied it to something a bit more dramatic (and fun) than sales calls.

If you would like to see the full source code, check out the repo.

The Problem with Prompt Stuffing

Let’s say you have 17 markdown files across multiple directories - suspect profiles, forensic reports, witness statements, location descriptions, a timeline. You could concatenate them all and drop them into a single prompt. For 17 files, you’d probably get away with it.

But the technique doesn’t hold. What happens when it’s 170 files? Or 1,700? What happens when the files aren’t all equally relevant, and you’re burning tokens on a garden shed description when the answer was in the forensic report?

The filesystem-based approach solves this elegantly. You give the agent:

  1. A set of files mounted into a virtual filesystem
  2. Tools to explore that filesystem (bash, readFile, writeFile)
  3. A task and the autonomy to investigate

The agent decides what to read, when to read it, and how to cross-reference findings. It retrieves small, targeted slices of context rather than consuming everything upfront.

The Stack

The project uses a lean set of dependencies:

  • AI SDK (ai): the agentic loop, tool calling, and structured output
  • bash-tool: filesystem access via bash commands, readFile, and writeFile
  • @ai-sdk/google: Gemini as the underlying model
  • Zod: schema validation for the agent’s structured verdict

That’s it. No framework, no orchestration layer, no database. The entire agent is under 50 lines of code.

Setting Up the Filesystem

The first step is loading the case files into memory. These are plain markdown files organised in a directory structure that mirrors a real case file:

case-files/
├── case-summary.md
├── timeline.md
├── victim/
│   └── profile.md
├── suspects/
│   ├── alice-chen.md
│   ├── bob-martinez.md
│   ├── carol-thompson.md
│   └── david-kim.md
├── evidence/
│   ├── forensics.md
│   ├── weapon.md
│   └── physical/
│       ├── torn-letter.md
│       ├── muddy-shoes.md
│       └── wine-glass.md
├── witnesses/
│   ├── neighbor.md
│   ├── housekeeper.md
│   └── business-partner.md
└── locations/
    ├── crime-scene.md
    └── garden-shed.md

At startup, we walk this directory and load every .md file into a Record<string, string> where the key is the virtual path:

async function loadCaseFiles(dir: string): Promise<Record<string, string>> {
  const files: Record<string, string> = {};

  async function walk(currentDir: string, basePath: string) {
    const entries = await readdir(currentDir, { withFileTypes: true });

    for (const entry of entries) {
      const fullPath = join(currentDir, entry.name);
      const relativePath = join(basePath, entry.name);

      if (entry.isDirectory()) {
        await walk(fullPath, relativePath);
      } else if (entry.name.endsWith(".md")) {
        const content = await readFile(fullPath, "utf-8");
        files[`/case-files/${relativePath}`] = content;
      }
    }
  }

  await walk(dir, "");
  return files;
}

These files then get mounted into bash-tool’s virtual filesystem. The agent never touches the real filesystem - everything runs in an in-memory sandbox.

Creating the Agent

The agent setup is remarkably straightforward. We pass the loaded files to bash-tool, which gives us back a set of tools the AI can call:


import { createBashTool } from 'bash-tool';
import { google } from '@ai-sdk/google';
import { generateText, stepCountIs, Output } from 'ai';

export async function createDetectiveAgent(files: Record<string, string>) {
  const { tools } = await createBashTool({
    files,
    destination: '/',
  });

  return {
    tools,
    model: google('gemini-3-flash-preview'),
  };
}

Note the usage of gemini-3-flash-preview. The reason for this is because this is the first Gemini model that supports tool calling and structured output.

The createBashTool call is doing the heavy lifting here. It takes a flat map of file paths to contents and mounts them into a virtual filesystem. The returned tools object exposes bash, readFile, and writeFile - standard AI SDK tools that the model can invoke during its agentic loop.

The Investigation Loop

This is where things get interesting. The investigate function kicks off the agentic loop using the AI SDK’s generateText with tool calling:

export async function investigate(files: Record<string, string>): Promise<Verdict> {
  const { tools, model } = await createDetectiveAgent(files);

  const { output } = await generateText({
    model,
    tools,
    output: Output.object({ schema: verdictSchema }),
    stopWhen: stepCountIs(50),
    system: systemPrompt,
    prompt: taskPrompt,
  });

  return output as Verdict;
}

There are a few key design decisions here worth unpacking.

Output.object with a Zod schema - this is the termination signal. The agent loops through tool calls (reading files, running bash commands) until it’s ready to produce a structured verdict. The schema acts as a contract: the loop doesn’t end until the model produces valid JSON matching verdictSchema. This is far more reliable than hoping the model says “I’m done” in natural language.

stepCountIs(50) - the safety net. If the agent spirals or gets stuck in an unproductive loop, it stops after 50 steps. Think of it as a budget. In practice, the agent typically solves the case in 15–25 steps.

No explicit iteration logic - the AI SDK handles the loop internally. The model calls a tool, gets a result, decides what to do next, calls another tool, and so on. We don’t write while loops or manage state. The loop emerges from the model’s reasoning.

Guiding the Detective

The system prompt establishes methodology without micromanaging execution:

export const systemPrompt = `You are Detective Monsieur Grey Cells, a brilliant criminal investigator
with decades of experience solving complex murder cases.

Your investigation methodology:
1. First, get an overview of the case by reading the case summary
2. Study the victim's profile to understand who they were and potential motives
3. Review the timeline to understand the sequence of events
4. Examine each suspect's profile, noting motives, alibis, and inconsistencies
5. Analyze all physical evidence and forensic reports
6. Interview witnesses through their statements
7. Visit locations to understand the crime scene and surroundings
8. Cross-reference evidence with alibis to find contradictions
9. Build a chain of evidence that points to the true killer

Use bash commands to explore the case files. The case files are located in /case-files/.`;

This is the filesystem-as-context pattern in action. We don’t paste in the case files. We tell the agent where they are and how to approach them. The agent then uses ls, cat, find, and other bash commands to navigate the filesystem at its own pace.

The task prompt reinforces this by laying out what’s available without revealing contents:

export const taskPrompt = `Investigate the murder case in the case files directory.
Use bash commands like ls, cat, and find to explore the files and uncover the truth.

Start by listing the contents of /case-files/ to see what's available, then systematically
investigate:
- Read the case summary
- Study the victim
- Examine the timeline
- Review each suspect
- Analyze all evidence
- Check witness statements
- Explore locations`;

Structured Output as the Verdict

The agent doesn’t just output free text - it produces a structured verdict defined by a Zod schema:

export const verdictSchema = z.object({
  verdict: z.object({
    murderer: z.string(),
    confidence: z.number().min(0).max(100),
    motive: z.string(),
  }),
  evidenceChain: z.array(
    z.object({
      item: z.string(),
      implicates: z.string(),
      significance: z.string(),
    })
  ),
  suspectRankings: z.array(
    z.object({
      name: z.string(),
      suspicionScore: z.number().min(0).max(100),
      alibiStatus: z.enum(["verified", "unverified", "broken"]),
      motive: z.string().nullable(),
    })
  ),
  keyDeductions: z.array(z.string()),
});

This forces the agent to commit. It must name a murderer, assign a confidence score, rank every suspect, and lay out its evidence chain. No hedging, no “it could be anyone.” The schema is the accountability mechanism.

Watching the Agent Think

One of the most satisfying parts of this project is watching the investigation unfold in real time. The onStepFinish callback logs every tool call. Expand the below for a sample output:

Expand here for the output Loading case files... Loaded 17 files

Detective is investigating…

[1] $ ls -R /case-files/
[2] read: /case-files/case-summary.md
[3] read: /case-files/victim/profile.md
[4] read: /case-files/timeline.md
[5] read: /case-files/suspects/alice-chen.md
[6] read: /case-files/suspects/bob-martinez.md
[7] read: /case-files/suspects/carol-thompson.md
[8] read: /case-files/suspects/david-kim.md
[9] read: /case-files/evidence/weapon.md
[10] read: /case-files/evidence/physical/wine-glass.md
[11] read: /case-files/evidence/physical/torn-letter.md
[12] read: /case-files/evidence/physical/muddy-shoes.md
[13] read: /case-files/evidence/forensics.md
[14] read: /case-files/witnesses/housekeeper.md
[15] read: /case-files/witnesses/business-partner.md
[16] read: /case-files/witnesses/neighbor.md
[17] read: /case-files/locations/crime-scene.md
[18] $ grep -r “Midnight Rose” /case-files/
[19] $ grep -ri “perfume” /case-files/
[20] read: /case-files/locations/garden-shed.md
[21] $ grep -ri “Midnight Rose” /case-files/
[22] $ grep -ri “Rose” /case-files/
grep -ri “Midnight” /case-files/
[23] $ grep -ri “scent” /case-files/
grep -ri “smell” /case-files/
grep -ri “fragrance” /case-files/

DETECTIVE’S FINAL VERDICT

MURDERER: Carol Thompson
CONFIDENCE: 98%
MOTIVE: To prevent Richard Thornwood from testifying in a fraud lawsuit that would have resulted in her bankruptcy and the loss of her reputation. She also sought to retrieve or destroy incriminating evidence he held against her.

EVIDENCE CHAIN:

  1. Navy blue cardigan Implicates: Carol Thompson Significance: Found hidden in the garden shed with a tear matching a navy blue thread found on the murder weapon. Ownership was confirmed via dry-cleaning tags.

  2. Crimson Passion lipstick on wine glass Implicates: Carol Thompson Significance: Found on a wine glass in the library, directly contradicting Carol’s claim that she never entered the room that evening.

  3. Side garden mud Implicates: Carol Thompson Significance: Mud found on Carol’s shoes matches the specific soil and fertilizer composition of the side garden path leading to the library, contradicting her claim that she never went outside.

  4. Torn letter fragments Implicates: Carol Thompson Significance: Handwriting and fingerprints on a threatening letter found at the scene match Carol, revealing she had demanded a private meeting to ‘settle’ the trial matters.

  5. Witness sighting Implicates: Carol Thompson Significance: A neighbor saw a female figure at the library’s French doors at 9:10 PM, which aligns with Carol’s 20-minute absence from the parlor.

SUSPECT RANKINGS:

  1. Carol Thompson - Score: 95/100 - Alibi: broken Motive: Facing bankruptcy and criminal exposure due to a lawsuit where the victim was the star witness.
  2. David Kim - Score: 25/100 - Alibi: unverified Motive: Resentment over being fired; however, he was the one who originally alerted the victim to Carol’s corporate espionage.
  3. Bob Martinez - Score: 15/100 - Alibi: verified Motive: Inheritance of $15 million, but physical evidence (mud) supports his alibi of being in the front garden, not near the crime scene.
  4. Alice Chen - Score: 10/100 - Alibi: unverified Motive: Unpaid debts owed by the victim, though his death actually makes debt collection more difficult.

KEY DEDUCTIONS:

  1. The killer entered the library through the side garden’s French doors to avoid detection by the housekeeper and other guests.
  2. The murder was premeditated, as evidenced by the killer wearing gloves (smudged prints) and hiding the navy blue cardigan in the garden shed after the crime.
  3. The victim’s back was turned because he was distracted by the legal documents on his desk, likely believing he was in a safe negotiation with Carol.
  4. The ‘Midnight Rose’ perfume and ‘Crimson Passion’ lipstick traces are classic signatures of Carol Thompson’s high-end lifestyle, further placing her at the immediate scene of the crime.
  5. The time of death (approx. 9:15-9:20 PM) perfectly matches the window when Carol was missing from the parlor and seen by the neighbor near the library.

You can see the agent following its methodology: overview first, then victim, then suspects, then cross-referencing physical evidence. It’s not reading files randomly - it’s investigating.

Why This Pattern Matters

The murder mystery is fun, but the underlying pattern - filesystem-as-context - applies broadly:

  • Sales call analysis - Vercel’s original use case. Load transcripts as files, let the agent grep and cross-reference across calls.
  • Codebase exploration - mount a repository, let the agent navigate with find and cat to answer architectural questions.
  • Legal document review - case files, contracts, exhibits. The agent reads what’s relevant rather than ingesting everything.
  • Customer support - mount conversation histories, knowledge base articles, product docs. The agent pulls context as needed.

The key insight is that agents are better at retrieving their own context than we are at pre-selecting it for them. When you stuff everything into a prompt, you’re making the retrieval decision. When you give the agent a filesystem and tools, the agent makes the retrieval decision - and it can adapt based on what it finds.

This is also fundamentally different from RAG. With RAG, you pre-compute embeddings, run a similarity search, and inject the top-k results. The agent has no say in what gets retrieved. With filesystem-based retrieval, the agent formulates its own queries and follows threads dynamically.

Wrapping Up

The entire project - agent, prompts, schema, runner - is under 100 lines of TypeScript. The case files are just markdown. There’s no database, no embedding pipeline, no retrieval infrastructure. And yet the agent reliably investigates, cross-references, and solves the case.

If you’re building agents that need to reason over documents, consider reaching for a filesystem before reaching for a vector database. Sometimes the simplest retrieval mechanism is cat.