Skip to main content

What is Loop Engineering

9 min read
Read with Claude Read with ChatGPT Markdown

On 7 June, Peter Steinberger posted: “Here’s your monthly reminder that you shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” The replies split into two camps. One announced that prompt engineering was dead. The other asked what a loop actually is, and the thread mostly ignored them.

The practice is real, though. Boris Cherny, who created Claude Code, describes his own workflow the same way in an interview: “I don’t prompt Claude anymore. I have loops that are running. They’re the ones that are prompting Claude and figuring out what to do. My job is to write loops.”

So here’s the definition the thread was missing. Loop engineering is building a system that prompts the agent so you don’t have to. You design something that finds work, hands it to an agent, checks the result, records what happened, and decides what comes next, then you let it run on a schedule without you in the room. The unit of work moves from the prompt to the loop, and so does the skill.

What the slogan leaves out is the more interesting half: the loop is the easy bit. Scheduling an agent and piping a prompt into it has been a bash one-liner for a year (Geoffrey Huntley calls his version ralph). The hard part is what a loop does to you once it’s running, and I’ll come back to it. First, what’s actually inside one.

Around the loop

  1. Trigger fires — schedule or condition
  2. Find work — CI failures, open issues, the state file
  3. Hand to an agent — maker, in a worktree
  4. Check the result — checker, a different model
  5. Record + decide — write the state file, queue what's next
  6. ↻ the next run picks up from the state file

You're not on this ring. It took over the three jobs you used to do by hand: deciding when to run (scheduler), what to work on (dispatcher), and remembering across runs (memory).

What’s in a loop

Strip away the branding and every serious loop is the same handful of parts. A trigger gives it a reason to run that isn’t you pressing enter: a fixed schedule, or a condition it works toward until the tests pass. Each task gets its own isolated checkout, a git worktree, so several agents can run at once without editing over each other’s files. The agent loads your skills on the way in: the build steps, the right test command, the constraints you learned the hard way, written down once so it doesn’t re-derive the project from scratch every morning and guess at the gaps. And it reaches the outside world through connectors to the issue tracker, CI and chat, so it can open the PR and link the ticket rather than describe a fix in a transcript it then throws away.

That last move, throwing the transcript away, matters more than it sounds. The model forgets everything between runs; each run starts with an empty context. So the loop’s memory has to live outside it, somewhere durable that every run reads first and writes last. A markdown file in the repo is enough. That file is the spine of the whole thing. It’s where one run leaves notes the next one picks up:

## 2026-06-09
- [merged] #412 null check in webhook retry, PR #418, CI green
- [rejected] #409 checker: patch makes the test pass by widening the type,
  root cause untouched. Re-queued with that note.
- [parked] #415 needs a product decision on rate-limit defaults -> human queue
- watch: auth suite flake correlates with the frozen-clock mock, unconfirmed

The last part is the one that makes leaving it alone tolerable: the agent that writes the code is not allowed to grade it. A model reviewing its own work in the same context will almost always declare success (ask anyone who’s approved their own PR five minutes after opening it). So a separate checker, with different instructions and ideally a different model, reads the diff against the original task. Two readers with different blind spots catch different things. It roughly doubles the cost per task, and it’s worth it for anything that merges while you’re not looking.

A loop, end to end

Put those parts in motion and a realistic first loop looks like this. At 7 each morning a trigger starts a triage run. It pulls overnight CI failures and new issues through connectors, reads yesterday’s state file, and writes a short plan. For each task it opens a worktree and hands it to a maker agent with your skills loaded. A checker reviews the result on a different model: a pass opens the PR and updates the ticket, a rejection goes back into the state file with the objection attached. Anything ambiguous lands in a human queue with a note on what it’s waiting for. Tomorrow’s run reads the file and carries on.

One loop, end to end

07:00 scheduled trigger nobody typed anything
Triage run writes today's plan: 3 tasks, 1 parked
reads state file (yesterday's notes)
pulls CI failures + new issues via connectors
one isolated worktree per task
worktree A maker checker different model, different instructions
worktree B maker checker different model, different instructions
parked ambiguous task → human review queue, with the open question attached
checker passes PR opened · ticket linked · one line to chat via connectors
checker rejects objection written to the state file, re-queued or escalated
↻ tomorrow's 07:00 run reads the state file first and continues

Every box was designed once and lives in version control. The only step a human appears in is the review queue, and the loop put it there on purpose.

Every step there was set up once, in advance. Nobody typed anything that morning.

It’s basically cron, and that’s fine

The honest objection is that this is a cron job with extra steps, and half of it is true: the scheduling really is cron. What cron never had is the bit in the middle. A cron job runs a fixed script and only branches where you wrote a branch. A loop hands the current state to a model that picks an action you never pre-specified, does it, and judges the result. Everything else, the checker and the state file and the budget, is scaffolding around that one act of judgement to stop it running off a cliff.

One tick, side by side

cron job

timer fires
runs a fixed script
takes only branches a programmer wrote in advance
exits with a code

agent loop

timer fires
model reads the current state
chooses an action nobody pre-specified, executes it
checker judges the result; state file updated; budget checked

The shared step is 1970s technology and that's fine. The argument for the loop is the highlighted part, and only when you couldn't have written those branches yourself.

That cuts the other way too. If you can write the branches yourself, do, because cron is cheaper and more predictable. The loop earns its cost only on the work where you can’t. One thing settles the “is this just branding” question: every major platform shipped these same parts independently, under names that now collide (Claude Code’s routines and /goal, the Codex and Cursor Automations, worktrees and skills and subagents everywhere). When four teams with no shared plan build the same six pieces, the pieces are real. Build around them and a platform switch costs you a few connectors instead of a rewrite.

The loop is the easy part

Which brings me back to the hard part, and none of it is technical.

“Done” from a loop is a claim, not a proof. The failure that bites hardest is the loop that makes the claim cheerfully on a half-finished job: the agent emits its completion signal early and the whole thing exits satisfied. Huntley named this one the Ralph Wiggum loop, and the only defence against it is a check the agent can’t talk its way past. The checker lowers the rate of false claims; it doesn’t move the responsibility for them off you. A wrong patch that merged at 7:15 has been load-bearing for hours by the time you look at it.

Worse, your understanding quietly decays. Working interactively forces you to at least skim everything; a loop removes even that. The gap between what’s in the repo and what you actually understand grows at the speed of the loop, and six weeks of green ticks can leave you owning a system you can no longer change by hand. Nothing in the loop will warn you this is happening.

And it costs real money, though not for everyone equally. Steinberger, whose post opened this piece, ran $1.3 million of tokens through about a hundred agents in a single month and was relaxed about it, because OpenAI, where he now works, was footing the bill; he described it as research into how building changes once token cost stops being a constraint. For anyone on a metered plan it hasn’t stopped being one. Schedules multiply runs, the checker doubles each task, a condition-checker bills on every turn. Uber capped its engineers at $1,500 per tool per month after burning a year’s AI budget in four. Give a loop a hard stop, an iteration cap and a budget ceiling, before you ever give it a schedule.

The costs at least have a dial. The outcome doesn’t. Two people can run the identical loop and get opposite outcomes. One uses it to go faster at work they understand, and reads what ships. The other uses it to avoid understanding anything, and ships anyway. The loop cannot tell the difference, and no feature ever will.

That’s why this is harder than prompt engineering, not easier. When you prompt, you correct course every turn. When you write a loop, you front-load the judgement: what’s worth doing, what counts as done, when to stop, all decided in a design you commit before any of the work exists. Get it right and you’ve multiplied yourself. Get it wrong and you’ve automated the production of code no one has read.

Before any of that, run the test the hype skips. A loop only pays off when four things are true at once: the task recurs often enough to amortise the setup, so a one-off prompt isn’t simply cheaper; something automated can fail the work without you in the room, a test or type check or build rather than a second agent’s good opinion; your budget can absorb the retries and re-reads a loop burns whether or not it ships anything; and the agent can run what it writes and watch it break. Miss one and you’re back in the chair reading every diff, which was the job the loop was meant to remove. Most work fails at least one of these today, and for that work a good prompt still wins.

If it passes, build a small one. A single scheduled triage task with a state file, nothing merging unattended. It will teach you more about what’s actually safe to hand off than any thread full of hot takes. The layer below this, the harness that drives a single run, and the orchestration patterns above one run, are both worth reading next. But the loop itself you learn by running one.