# Filesystem as Context: Building an AI Detective with bash-tool

Source: https://tpiros.dev/blog/filesystem-as-context-building-an-ai-detective-with-bash-tool

If you've spent any time building AI agents, you've hit the same wall: context windows fill up fast. The instinct is to cram everything into the prompt and hope the model sorts it out. But tokens are finite, [attention degrades with length](https://research.trychroma.com/context-rot), and you're paying for every character. It doesn't scale.

There's a better pattern. Instead of bringing the data to the model, hand the model a filesystem and let it go find what it needs.

Last Christmas I spent a ridiculous amount of time playing [Cluedo](https://en.wikipedia.org/wiki/Cluedo), which planted the seed for this project. I'll walk you through building a murder mystery detective agent that uses **Vercel's [bash-tool](https://vercel.com/changelog/introducing-bash-tool-for-filesystem-based-context-retrieval)** to investigate case files, interrogate evidence, and crack a crime. The other spark came from Vercel's own [Call Summary Agent template](https://vercel.com/templates/next.js/call-summary-agent). I took the same architectural pattern and pointed it at something more dramatic (and more fun) than sales calls.

> If you'd like to see the full source code, check out the [repo](https://github.com/tpiros/murder-mystery-agent/).

## The Problem with Prompt Stuffing

Say you've got 17 markdown files across multiple directories: suspect profiles, forensic reports, witness statements, location descriptions, a timeline. You _could_ concatenate them all and drop them into a single prompt. For 17 files, you'd probably get away with it.

But the technique falls apart fast. What happens at 170 files? Or 1,700? What happens when the files aren't all equally relevant and you're burning tokens on a garden shed description when the answer was buried in the forensic report?

The filesystem-based approach sidesteps this entirely. You give the agent:

1. A set of files mounted into a virtual filesystem
2. Tools to explore that filesystem (`bash`, `readFile`, `writeFile`)
3. A task and the autonomy to investigate

The agent decides what to read, when to read it, and how to cross-reference findings. It pulls small, targeted slices of context rather than swallowing everything upfront.

## The Stack

The project runs on a lean set of dependencies:

- **[AI SDK](https://ai-sdk.dev/)** (`ai`): the agentic loop, tool calling, and structured output
- **[bash-tool](https://www.npmjs.com/package/bash-tool)**: filesystem access via bash commands, `readFile`, and `writeFile`
- **[@ai-sdk/google](https://www.npmjs.com/package/@ai-sdk/google)**: Gemini as the underlying model
- **[Zod](https://zod.dev/)**: schema validation for the agent's structured verdict

No framework, no orchestration layer, no database. The entire agent is under 50 lines of code.

## Setting Up the Filesystem

First step: load the case files into memory. These are plain markdown files organised in a directory structure that mirrors a real case file:

```
case-files/
├── case-summary.md
├── timeline.md
├── victim/
│   └── profile.md
├── suspects/
│   ├── alice-chen.md
│   ├── bob-martinez.md
│   ├── carol-thompson.md
│   └── david-kim.md
├── evidence/
│   ├── forensics.md
│   ├── weapon.md
│   └── physical/
│       ├── torn-letter.md
│       ├── muddy-shoes.md
│       └── wine-glass.md
├── witnesses/
│   ├── neighbor.md
│   ├── housekeeper.md
│   └── business-partner.md
└── locations/
    ├── crime-scene.md
    └── garden-shed.md
```

At startup, we walk this directory and load every `.md` file into a `Record<string, string>` where the key is the virtual path:

```typescript
async function loadCaseFiles(dir: string): Promise<Record<string, string>> {
  const files: Record<string, string> = {};

  async function walk(currentDir: string, basePath: string) {
    const entries = await readdir(currentDir, { withFileTypes: true });

    for (const entry of entries) {
      const fullPath = join(currentDir, entry.name);
      const relativePath = join(basePath, entry.name);

      if (entry.isDirectory()) {
        await walk(fullPath, relativePath);
      } else if (entry.name.endsWith(".md")) {
        const content = await readFile(fullPath, "utf-8");
        files[`/case-files/${relativePath}`] = content;
      }
    }
  }

  await walk(dir, "");
  return files;
}
```

These files then get mounted into `bash-tool`'s virtual filesystem. The agent never touches the real filesystem. Everything runs in an in-memory sandbox.

## Creating the Agent

The agent setup is surprisingly compact. We pass the loaded files to `bash-tool`, which hands back a set of tools the AI can call:

```typescript

  const { tools } = await createBashTool({
    files,
    destination: '/',
  });

  return {
    tools,
    model: google('gemini-3-flash-preview'),
  };
}
```

> Note the usage of `gemini-3-flash-preview`. This is the first Gemini model that [supports tool calling **and** structured output](https://ai.google.dev/gemini-api/docs/structured-output?lang=javascript&example=recipe#structured_outputs_with_tools).

The `createBashTool` call does the heavy lifting. It takes a flat map of file paths to contents and mounts them into a virtual filesystem. The returned `tools` object exposes `bash`, `readFile`, and `writeFile`, all standard AI SDK tools that the model can invoke during its agentic loop.

## The Investigation Loop

Here's where things get interesting. The `investigate` function kicks off the agentic loop using the AI SDK's `generateText` with tool calling:

```typescript

  const { tools, model } = await createDetectiveAgent(files);

  const { output } = await generateText({
    model,
    tools,
    output: Output.object({ schema: verdictSchema }),
    stopWhen: stepCountIs(50),
    system: systemPrompt,
    prompt: taskPrompt,
  });

  return output as Verdict;
}
```

A few design decisions worth noting.

**`Output.object` with a Zod schema** acts as the termination signal. The agent loops through tool calls (reading files, running bash commands) until it's ready to produce a structured verdict. The schema is a contract: the loop won't end until the model produces valid JSON matching `verdictSchema`. Far more reliable than hoping the model says "I'm done" in plain text.

**`stepCountIs(50)`** is the safety net. If the agent spirals or gets stuck in an unproductive loop, it stops after 50 steps. Think of it as a budget. In practice, the agent typically cracks the case in 15 to 25 steps.

**No explicit iteration logic.** The AI SDK handles the loop internally. The model calls a tool, gets a result, decides what to do next, calls another tool, and so on. We don't write `while` loops or manage state. The loop emerges from the model's reasoning.

## Guiding the Detective

The system prompt establishes methodology without micromanaging execution:

```typescript

with decades of experience solving complex murder cases.

Your investigation methodology:
1. First, get an overview of the case by reading the case summary
2. Study the victim's profile to understand who they were and potential motives
3. Review the timeline to understand the sequence of events
4. Examine each suspect's profile, noting motives, alibis, and inconsistencies
5. Analyze all physical evidence and forensic reports
6. Interview witnesses through their statements
7. Visit locations to understand the crime scene and surroundings
8. Cross-reference evidence with alibis to find contradictions
9. Build a chain of evidence that points to the true killer

Use bash commands to explore the case files. The case files are located in /case-files/.`;
```

This is the filesystem-as-context pattern in action. We don't paste in the case files. We tell the agent _where they are_ and _how to approach them_. The agent then uses `ls`, `cat`, `find`, and other bash commands to navigate the filesystem at its own pace.

The task prompt reinforces this by laying out what's available without revealing contents:

```typescript

Use bash commands like ls, cat, and find to explore the files and uncover the truth.

Start by listing the contents of /case-files/ to see what's available, then systematically
investigate:
- Read the case summary
- Study the victim
- Examine the timeline
- Review each suspect
- Analyze all evidence
- Check witness statements
- Explore locations`;
```

## Structured Output as the Verdict

The agent doesn't just output free text. It produces a structured verdict defined by a Zod schema:

```typescript

  verdict: z.object({
    murderer: z.string(),
    confidence: z.number().min(0).max(100),
    motive: z.string(),
  }),
  evidenceChain: z.array(
    z.object({
      item: z.string(),
      implicates: z.string(),
      significance: z.string(),
    })
  ),
  suspectRankings: z.array(
    z.object({
      name: z.string(),
      suspicionScore: z.number().min(0).max(100),
      alibiStatus: z.enum(["verified", "unverified", "broken"]),
      motive: z.string().nullable(),
    })
  ),
  keyDeductions: z.array(z.string()),
});
```

This forces the agent to commit. It must name a murderer, assign a confidence score, rank every suspect, and lay out its evidence chain. No hedging, no "it could be anyone." The schema is the accountability mechanism.

## Watching the Agent Think

One of the most satisfying parts of this project is watching the investigation unfold in real time. The `onStepFinish` callback logs every tool call. Here's what a typical run looks like:

You can see the agent following its methodology: overview first, then victim, then suspects, then cross-referencing physical evidence with targeted `grep` commands. It's not reading files randomly. It's investigating.

And here's the structured verdict it produces, with a full evidence chain and suspect rankings:

## Why This Pattern Matters

The murder mystery is fun, but the underlying pattern (filesystem-as-context) applies broadly:

- **Sales call analysis**: load transcripts as files, let the agent grep and cross-reference across calls.
- **Codebase exploration**: mount a repository, let the agent navigate with `find` and `cat` to answer architectural questions.
- **Legal document review**: case files, contracts, exhibits. The agent reads what's relevant rather than ingesting everything.
- **Customer support**: mount conversation histories, knowledge base articles, product docs. The agent pulls context as needed.

The key insight is that **agents are better at retrieving their own context than we are at pre-selecting it for them**. When you stuff everything into a prompt, you're making the retrieval decision. When you give the agent a filesystem and tools, the agent makes the retrieval decision, and it can adapt based on what it finds.

This is also fundamentally different from RAG. With RAG, you pre-compute embeddings, run a similarity search, and inject the top-k results. The agent has no say in what gets retrieved. With filesystem-based retrieval, the agent formulates its own queries and follows threads dynamically.

## Wrapping Up

The entire project (agent, prompts, schema, runner) is under 100 lines of TypeScript. The case files are just markdown. There's no database, no embedding pipeline, no retrieval infrastructure. And yet the agent reliably investigates, cross-references, and solves the case.

If you're building agents that need to reason over documents, consider reaching for a filesystem before reaching for a vector database. Sometimes the simplest retrieval mechanism is `cat`.
