Skip to main content

Agent Experience (AX) and the Agent Experience Interface (AXI): a working demo

8 min read
Read with Claude Read with ChatGPT Markdown

In an earlier post I talked about Agent Experience (AX) and the Agent Experience Interface (AXI): what they mean, where they come from, and what I’d ask clients to change. That piece is the theory. This one is the thing you can poke at.

The argument there was simple. Agents now use your product on a human’s behalf, and if your product only behaves well when a person is sitting in front of the keyboard, the agent the user “hired” either fails or burns tokens flailing. The four areas worth designing for were Access, Context, Tools, and Orchestration. All sensible, and all a bit abstract until you watch an agent hit the wall in real time.

There are plenty of examples out there where an agent cannot complete an action because the tool/service/product it tries to use requires an action from the user - and I am sure you have come across a few yourself. But what better than to actually build something and then go step by step fixing it.

One CLI, built twice

Meet dbx, a command-line tool for a fictional managed-Postgres provider. The product is made up. The behaviour isn’t: it’s a real Node CLI you can run, and I wrote it twice:

  • v0 is the version most tools quietly are today. You authenticate with dbx login, which opens a browser and waits for you to paste a code. You create a database with dbx db create, which prompts you interactively for a region, a plan, and a yes/no confirmation. Output is a nicely drawn ASCII box. Errors are friendly prose. None of these choices is wrong for a human at a keyboard - when the entire process is done by a human.
  • v1 is the same tool, same commands, rebuilt against the AXI principles. Auth comes from a DBX_TOKEN environment variable. db create takes --region, --plan, and --yes flags. Output is structured. Errors are machine-readable with real exit codes. Retries are idempotent.

Then I pointed a small agent at each. It’s about forty lines of Google’s Agent Development Kit for TypeScript, with one tool: run a dbx command and hand back what it printed. The agent’s goal, in both runs, is identical (create a database called analytics), and it’s told to work on its own, only stopping if it genuinely can’t continue without a person.

Find the hypothetical dbx tool here: https://github.com/tpiros/dbx-ax-demo.

Switch between the two versions below and run it.

Run the agent
idle

The same ADK agent, same goal (create a database called analytics), pointed at each version of the dbx CLI. Every dbx output shown here is the real output the CLI returns; only the agent's narration is abbreviated.

The v0 run ends the way every human-only tool ends: with the agent stopping to write you a polite paragraph explaining what it couldn’t do. The login opened a browser it can’t see, there’s no token flag to fall back on, and db create won’t proceed without a session it was never able to establish. The flags it passed were simply ignored. It has nowhere to go.

The v1 run never asks you for anything. It lists, sees the account is empty, creates the database with flags, confirms by listing again, and reports back. Same agent, same instructions, same task. The only thing that changed is the shape of the tool.

To be precise about what “never asks you for anything” means: the token still comes from somewhere. A person or a secrets store sets DBX_TOKEN once, before the agent starts. That is setup done ahead of time, not someone pasting a code into a prompt mid-run while the agent waits, and that is the distinction the Access rung turns on. v0 demands a human in the loop while the agent is working; v1 needs a credential provisioned beforehand and nothing during the run.

What broke

The failure is spread across the AX areas. v0 trips on each one a little:

AX areaWhat v0 doesWhat v1 does
Accesslogin opens a browser and blocks on stdin; no token, no device flowDBX_TOKEN from the environment; the agent is in immediately
Contextbare dbx prints a help screen; nothing machine-readablebare dbx shows live data; ships an llms.txt that shortcuts the agent’s discovery
Toolsinteractive prompts, ASCII output, non-idempotent, prose errors at exit 0flags not prompts, structured output, idempotent on name, errors with real exit codes

Access is the one that actually stops the agent, and it’s the gate: you cannot document your way past a browser-only login. No amount of lovely llms.txt helps an agent that can’t get a session in the first place. If you only fix one rung, fix that one.

The other two issues don’t block the agent outright, but they tax it. On a CLI this small the llms.txt does little more than let the agent skip a round of --help-and-guess, and v1 would probably succeed without it; its real payback shows up on larger tools, where it can state things a help screen never would, like which operations are idempotent and safe to retry after a failure.

Which brings me to the part I found most telling.

The output tax

In v1, the default output isn’t JSON. It’s TOON, a tabular notation that drops the braces, quotes, and per-field commas that JSON spends tokens on, while staying unambiguous to a model. The agent can still ask for --json when it wants it.

I measured it on a ten-row db list, in bytes. Holding the columns fixed and changing only the notation, the same four fields cost about 49% less in TOON than in JSON. TOON declares the field names once in a header, then lists the values, so it skips the braces, quotes, and repeated "field": keys that JSON pays for on every row. (You can save more again by returning fewer columns by default, but that’s a separate decision from the notation, so I’m leaving it out of the number here.)

Output shape · JSON vs TOON
JSON · 4 fields 736 bytes
TOON (dbx v1 default) · 4 fields 377 bytes
~49% smaller on the wire, same four fields — the notation alone

A 10-row dbx db list, the same four fields in each notation, measured in bytes (a fair proxy for tokens). TOON drops the braces, quotes, and repeated "field": keys that JSON spends tokens on. Reproduce with npm run demo:v1 in the repo.

Bytes aren’t tokens, but they track closely enough, and the direction is the thing. Half the payload, give or take, for the same information, paid back on every single call the agent makes. Multiply that across a task with dozens of list operations and the interface shape becomes a first-order driver of what the agent costs to run.

Did the agent actually try?

Here’s the moment that made me add instrumentation. On one v0 run, the agent ended with a thorough explanation that included this:

The dbx db create <name> command is implemented to strictly prompt the user interactively via standard input for the deployment region, the plan, and a y/N confirmation. It completely ignores CLI options/flags like --region, --plan, or --yes.

That description is correct. The trouble is the agent couldn’t have learned it the way it implies. In v0, db create refuses to do anything until you’re logged in, and the agent never managed to log in. It never reached the prompts. So it never observed the flags being ignored. It either read my source code (its file-reading tool can open the script) or it inferred the behaviour from convention and stated the inference as fact.

I added a trace log to the agent’s tools so I could see exactly what it ran rather than trust its summary. But the lesson was already there in the structure. The blocking is only half of it. When the tool won’t tell the agent what it can do, the agent fills the gap itself, and some of what it fills in will be wrong. A user reading that tidy final paragraph would have no way to know which parts were observed and which were invented. That ambiguity is its own AX failure, and it’s one you won’t catch in your own testing, because you know how your tool works.

Build it yourself

The whole thing is small enough to read in one sitting: two versions of the CLI, a shared llms.txt, and the ADK agent.

cd agent
npm install
cp .env.example .env          # add a Gemini API key
npm run demo:v0               # watch it stall
npm run demo:v1               # watch it finish
npm run trace                 # see every command it actually ran

The agent uses @google/adk and a single function tool that shells out to whichever version of dbx you point it at. There’s nothing clever in it. That’s the point. A trivially simple agent succeeds or fails entirely on the ergonomics of the tool you hand it.

If you ship a CLI, an SDK, or an MCP server, the exercise is worth running on your own tool. Open a coding agent, point it at your product cold, give it a normal first task, and watch where it stops. The place it stops is your AX backlog, in priority order.

And if you want the conceptual map behind all of this (where AX came from, how AXI fits inside it, and what the four areas actually ask of you), that’s in the companion piece.

Sources