Skip to main content

Writing

What is agentic AI? A working definition for 2026

13 min read

“Agentic AI” appears in vendor pitches, analyst reports, and board decks with surprisingly little agreement on what it means. Most explanations either pad out a chatbot demo with the word “agent” or stay so abstract they could describe a calculator.

This is a working definition, with specific examples of what’s actually in production in 2026, an honest list of what isn’t, the technical primitives that matter, and the five things worth building first.

A working definition

An agentic AI system does five things:

  1. Takes a goal stated in natural language.
  2. Plans a sequence of steps.
  3. Calls tools (APIs, functions, search, code execution) to act in the world.
  4. Observes the results and adapts.
  5. Decides when it’s done, when it’s stuck, and when to escalate.

The minimum bar is a language model, a set of tools, and a loop. Everything else (the orchestration patterns, the memory layers, the reasoning frameworks) gets layered on top.

This separates agentic systems from two adjacent things they’re often confused with. A chatbot does single-turn responses with no real-world side effects. A workflow automation runs deterministic, pre-coded steps. An agent decides which steps to take and adapts when they fail.

If your “agent” doesn’t decide and adapt, it’s a workflow with an LLM stitched into the middle. That can still be useful, but the label matters.

The reality check

The 2026 numbers are stark.

Roughly 95% of generative AI pilots never reach production. Over 80% of enterprise AI investments fail to deliver business value. McKinsey’s 2025 State of AI survey reported 88% of organisations using AI in some capacity, with only one-third scaling any AI use enterprise-wide. The 2026 follow-ups show the gap widening.

Why? Three reasons, in order of how often they’re the actual cause:

  1. Tool reliability the agent can trust. Most internal APIs were built for humans, not agents. Error messages tend to be ambiguous, retries aren’t idempotent, and unrelated concerns are bundled into single endpoints. Wrapping them as agent tools surfaces every flaw.
  2. Bounded scope. Agents work in narrow domains where the planning required is shallow. They fail when the task requires open-ended reasoning over many steps.
  3. Honest evaluation. Most teams ship without test suites that can catch regressions. The agent that works on Monday breaks on Friday, and no one notices until users complain.

When those three are missing, the agent looks impressive in a demo and falls over in production. This is the single largest issue companies are facing with AI in 2026.

Where agents actually work

Five domains where deployed agents are doing real, measurable work today.

1. Coding agents

Claude Code, Cursor, Aider, Continue, Sourcegraph’s Cody. These let an agent read code, run tests, edit files, and ship pull requests.

They work because:

  • The feedback loop is tight. Compile, run tests, observe failures, iterate.
  • The output is verifiable. Tests pass or they don’t.
  • A human reviews the diff before merge.

Anthropic’s own engineering team uses Claude Code daily as a primary interface. Cursor reached significant ARR before “agentic” became the term for what it does. By late 2026, “I shipped this with an agent” is becoming as ordinary as “I shipped this with autocomplete” was in 2023.

2. Customer support triage

Routing tickets, drafting responses, escalating edge cases. Klarna publicly claimed in February 2024 that their AI assistant handled 2.3 million conversations in its first month, doing the work of around 700 full-time agents. By May 2025 they were partly reversing course: bringing humans back into the loop and investing in “empathy, expertise, and real conversations” alongside the AI.

The pattern that survives the walk-back: agents handle the 60-70% of cases that fit a template, humans handle the rest, and the cost savings are real on the routine traffic. Anyone selling end-to-end automation for the whole queue is selling a slide deck.

3. Research and synthesis

Perplexity-style query → multi-source retrieval → synthesised answer. OpenAI’s Deep Research mode. Anthropic’s Computer Use for browser-based research tasks (still rough, improving).

These work where the user can verify the output (citations, source links). Where verification is hard, hallucinations slip through and erode trust over time.

4. Internal data agents

“Show me Q3 sales by region from Snowflake” type queries. Agents that translate business questions into SQL, run the query, and explain the result.

The trick is access control and data scoping. A naive agent that can run any SQL on production data is a security incident waiting to happen. The teams that ship these in production have spent meaningful effort on permission models, query allowlists, and audit logs.

5. Operations agents

Reading PRs, summarising standups, scheduling, expense reports, document generation. Internal tools where the cost of being wrong is low and a human reviews the output before any external action.

These are unglamorous and they work. They also pay back fast because they automate the work no one enjoys doing anyway.

What doesn’t work yet

A separate list, equally important. As of 2026:

  • High-stakes autonomous decisions. Agents making capital allocation, medical diagnoses, or legal filings without human review. The reasoning isn’t reliable enough at the tail.
  • Long-running tasks past a few hours of human-equivalent work. METR’s Time Horizon 1.1 measurements (January 2026) put the strongest agent, Claude Opus 4.5, at a 50% success rate on tasks that take humans about 5 hours, with confidence intervals stretching from under 3 to over 12 hours. GPT-5 lands near 3.5 hours, o3 near 2. The textbook compounding-error sketch (95% per step compounds to 36% by step 20) is the right intuition, and the measured curves drop off at least that hard once tasks cross the model’s horizon.
  • Multi-agent systems with no human in the loop. Pretty in demos, brittle in production. Most actual production systems are single-agent with bounded scope, or orchestrator-worker with humans at decision gates.
  • General-purpose agents. Specialised agents with clear scope ship and run for years. The “do anything” pitch rarely makes it past a polished demo.

If a vendor is pitching one of these, expect heavy human supervision under the hood, or a demo that hasn’t been stress-tested against real production conditions.

The technical primitives that matter

If you’re building or evaluating agentic systems, these are the building blocks worth knowing.

Tool calling

The model receives a list of available tools (functions with JSON schemas) and decides when to call them. OpenAI, Anthropic, Google all support this natively. Reliability depends on how the tools are described, how unambiguous the trigger conditions are, and how clearly errors are surfaced back to the model.

Model Context Protocol (MCP)

The open standard for connecting AI systems to tools and data, originally introduced by Anthropic in November 2024 and donated to the Agentic AI Foundation (Linux Foundation) in December 2025. OpenAI adopted it across the Agents SDK and ChatGPT in March 2025; Google announced support in April 2025. MCP defines how tools, resources, and prompts get exposed to any MCP-compatible agent.

The economics: integration cost goes from O(N×M) to O(N+M), where N is the number of agents and M is the number of tools. For a 10-agent, 20-tool world, that’s 200 integrations becoming 30. By end of 2026, MCP servers exist for most major SaaS platforms and bespoke integration starts to look like a smell.

Agent harnesses

The shell around the model. Manages the loop, the context window, persistent state, and integration with the rest of the world. Claude Code, Cursor, Aider, ChatGPT agent (the successor to OpenAI’s Operator, which was retired in August 2025).

The harness is often more important than the model itself. A bad harness wraps a great model in a frustrating experience. A good harness makes a smaller model feel capable. Most of the user-visible quality of an agent comes from harness design.

I’ve written separately about building a local agent harness.

Orchestration patterns

The five patterns that show up in production:

  • Sequential: each step’s output feeds the next. Simplest. Used for pipelines.
  • Routing: classify the task, send it to the right specialist. Used when tasks vary widely.
  • Orchestrator-worker: a planning agent delegates subtasks to specialist agents. Used for compound work. See coordinator agent pattern.
  • Reflection loop: the agent critiques its own output before responding. Improves quality at the cost of latency and tokens. See building with reflection.
  • Parallelisation: run independent tool calls or subagents simultaneously. Cuts latency on independent work. See parallelisation as an agentic workflow.

Anthropic’s “Building Effective Agents” post (December 2024) is still the cleanest reference for these. Most production systems use one or two of the patterns, not all five.

Evaluation

The least glamorous and most important part of agentic systems. Eval suites, golden test cases, structured grading. Without these, agentic systems drift quietly. The model provider updates a snapshot. Your tools change. Your prompts get edited. Without regression catches, by the time you notice, you’ve already shipped degraded behaviour for weeks.

Most teams underinvest here and pay for it later. The teams that ship reliable agents over time have eval infrastructure as substantial as their agent code.

The real barriers

The consultant-fluff list (“data quality, governance, change management”) is real but generic. The specific barriers in 2026:

Tool quality

Most internal APIs surface every flaw when wrapped for agents. Errors that humans tolerate (404 with no body, ambiguous status codes, mutating retries) become reliability problems. Fixing this is unglamorous infrastructure work, and the returns build up over time as the same hardened tools get reused across every agent you ship.

Context management

A 200K context fills up in 30-50 tool turns of meaningful work. Compaction strategies (summarisation, vector retrieval, scratchpads) are essential, and most of them are still immature. A bigger context window buys you more runway before compaction kicks in, and that’s all it does.

Cost

A meaningful agent run can be $0.50-$5 in token spend. Multiply by usage volume and the unit economics often don’t work. Caching, smaller models for routing, and human checkpoints help. Building agents like every step needs the most expensive model is the most common cost mistake.

Trust and observability

Stakeholders won’t approve an agent acting on their behalf without understanding why it does what it does. Trace logs, decision rationales, human approval gates. Skipping these means the agent gets pulled the first time something goes wrong, however rare.

Governance (EU AI Act Article 4)

If your agent acts on behalf of staff in the EU, your organisation is a deployer under the AI Act. You owe staff sufficient AI literacy training. Article 4 has been in force since February 2025; national enforcement begins August 2026. Treat it as a hard requirement with fines attached, because that’s what it is.

I’ve written separately about what Article 4 actually requires.

Workforce impact, honestly

Two narratives dominate the discourse: “AI replaces jobs” and “AI is just a productivity tool.” Both are partial truths.

What’s actually happening in 2026:

  • Junior tasks compress. Code scaffolding, document drafting, data extraction, first-pass analysis. The labour content of these is shrinking fast.
  • Senior judgement still matters. Deciding what to build, evaluating outputs, handling edge cases, owning responsibility for results. These compress slowly or not at all.
  • The bar rises across roles. Less tolerance for mediocre work because the floor is higher.

If your job is “produce the obvious output to a brief,” the floor under you is shifting. If your job is “decide what’s worth doing,” your judgement carries more weight than it did last year.

The middle (“execute the brief skilfully”) is shrinking. This isn’t playing out evenly. Software engineering, customer support, and content production are several quarters into the curve. Healthcare, finance compliance, and legal are slower, because the cost of being wrong is higher and verification is harder.

What to actually do

Five concrete things, ordered by what gives results fastest.

1. Pick one workflow with clear ROI

Skip the “transform the company” framing. Pick one task done by 50 or more people with measurable cost. Build an agent for that. Ship it. Measure the result against a baseline. The case studies that hold up read like this. The PowerPoint roadmaps fall apart the moment a CFO asks for the numbers.

2. Build evals before you build the agent

Twenty test cases that represent your real workflow. Score the agent against them weekly. Without this, you can’t tell if changes help or hurt. With it, you can ship improvements with confidence.

3. Wrap your existing tools properly

Most of the production-grade AI integration work goes into making your APIs, data, and systems agent-readable, long before anyone calls the model. MCP servers are the cleanest way to expose them. The investment compounds because the same tools work with whatever agent you use today and whatever agent ships next year.

4. Pay for a harness

Whether that’s Claude Code or Cursor for engineering work, an internal harness for a specific domain, or one of the browser-control agents. The harness is where the user experience lives. Building your own from scratch is rarely justified.

5. Train your staff (literally)

EU AI Act Article 4 sets the regulatory floor. Beyond compliance, teams that understand how agents fail use them better. The training that builds that understanding is role-specific and hands-on. Generic e-learning won’t get you there.

The next 18 months

Three predictions, all of which I’d happily revise as new evidence comes in.

  1. Coding agents become unremarkable. By late 2026, “I shipped this with an agent” is as ordinary as “I shipped this with autocomplete” was in 2023. The teams that aren’t using them at all start to look slow.
  2. MCP wins. The economics of standardisation are too strong. By end of 2026, MCP servers exist for every major SaaS and most major data platforms. Bespoke per-tool integration becomes a smell.
  3. The “AI agent platform” startups consolidate. Most are wrappers around models. Models with better native agent capabilities will absorb the value. The ones that survive will have proprietary tools, data, or domain expertise that the model can’t replicate.

What stays uncertain: the long-running, autonomous, multi-step agents that business audiences want. Those work in narrow domains and break at the edges. The next breakthrough on this is an open question, possibly a different reasoning architecture entirely, possibly just more reliable tools and better harnesses on the current curve.

Further reading