Plan vs. act: the agent loop everyone gets wrong

I've been called in to audit twelve broken production agents over the last eighteen months. Different companies. Different domains — billing, HR, support, ops. Different vendors and frameworks underneath. Eleven of the twelve failed in the same way.

The user asks for something complex. The agent immediately starts taking actions. Halfway through, it realises step three should have been step five. Or it does step four, which conflicts with step two. Or it does six steps and forgets why it started in the first place. Then it does a seventh step that undoes the first.

The root cause isn't the model. It's the architecture. Agents that work separate planning from acting. Agents that fail conflate them.

The two-step loop

The pattern that survives production:

Step 1 — Plan. Given the user's request and the current state, the agent produces a plan. The plan is explicit, ordered, and reviewable. It says what will happen, in what order, with what tools, and what conditions cause it to revise.

Step 2 — Act. The agent executes the plan one step at a time. After each step, it checks: did this work? Do I need to revise the plan? Am I still on track?

Both steps use the model. They use it differently. Planning is reflective; acting is procedural. Mixing them produces the failure modes that destroy production agents. In code, the loop is roughly this:

async def run_agent(user_request: str, ctx: RunContext) -> AgentResult:
    plan = await generate_plan(user_request, ctx)
    await ctx.persist_plan(plan)  # for audit, replay, and human review
    await ctx.surface_for_review(plan)  # if the plan needs human signoff

    for step_idx, step in enumerate(plan.steps):
        result = await execute_step(step, ctx)
        await ctx.persist_step_result(step_idx, result)

        if result.requires_replan:
            plan = await revise_plan(plan, step_idx, result, ctx)
            await ctx.persist_plan(plan)
            await ctx.surface_for_review(plan)

        if result.is_terminal_failure:
            return AgentResult.failed(reason=result.reason, last_step=step_idx)

        if plan.is_complete(step_idx):
            return AgentResult.complete(outputs=result.outputs)

    return AgentResult.complete()

Two model calls per request, minimum. The first builds the plan. Subsequent calls execute or revise. The structure is the gain — not the cleverness of any single prompt.

Why one-shot fails

The "ask the model to do X" pattern fails because:

Hidden complexity. "X" usually has dependencies the model only discovers mid-execution. Without a plan, the agent can't reason about the dependencies before starting.
Forgetting. Long action sequences exhaust context. The agent forgets why it's doing what it's doing.
Drift. Each action reveals new information. Without a plan to revise, the agent can't react coherently.
Reviewability. Without a plan, there's nothing for a human to review before action begins.
Auditability. When something goes wrong, there's no artifact to inspect. You're left re-reading a transcript trying to reverse-engineer what the agent thought it was doing.

A plan changes all of this. Even a bad plan beats no plan, because a bad plan is visible.

Plan-as-doc

The plan is a document. Not a vibes-level intention. The format that works in production:

{
  "plan_id": "pln_a8c3f0",
  "request_id": "req_4471",
  "goal": "Find all customer accounts that haven't been billed correctly this month and draft refund emails.",
  "steps": [
    {
      "n": 1,
      "verb": "query",
      "tool": "billing_search",
      "inputs": {"period": "2026-04", "status": "invoiced"},
      "expected_output": "list of invoices",
      "depends_on": []
    },
    {
      "n": 2,
      "verb": "filter",
      "tool": "logic",
      "inputs": {"rule": "line items mismatch contract"},
      "expected_output": "list of mismatched invoices",
      "depends_on": [1]
    },
    {
      "n": 3,
      "verb": "draft",
      "tool": "email_template",
      "inputs": {"template": "refund_explanation", "per": "mismatched_invoice"},
      "expected_output": "list of email drafts",
      "depends_on": [2]
    },
    {
      "n": 4,
      "verb": "pause",
      "reason": "human review required before sending refund emails"
    }
  ],
  "decision_points": [
    {"after_step": 2, "condition": "0 mismatches", "action": "complete with no-op"}
  ],
  "stop_conditions": {
    "success": "step 4 reached, drafts persisted, human notified",
    "failure": "any step returns terminal_failure"
  },
  "estimated_cost_usd": 0.18,
  "estimated_duration_s": 90
}

That's a plan a human can read in 30 seconds. It's also a plan a regulator can read in three years if asked. Producing it takes one model call. Reviewing it takes the human about 30 seconds. Executing against it gets significantly more reliable.

Act-as-tool

The acting step is mechanical:

Execute the next planned step.
Read the result.
Compare to expectation.
Revise the plan if needed.
Repeat.

The model's role here is narrower than in planning. It executes the step using tools. It doesn't deliberate. The structure is what makes long-horizon work tractable. A subtle but important point: the act phase often uses a smaller, faster, cheaper model than the plan phase. The plan phase needs reasoning capacity. The act phase needs reliability and speed.

def select_model_for_phase(phase: Literal["plan", "act"]) -> str:
    if phase == "plan":
        return "claude-opus-4-7"        # smarter, more expensive
    return "claude-haiku-4-5"           # faster, cheaper, sufficient for execution

For the billing example above, we'd pay full rate for the planning call (one call, ~3K input tokens, ~1K output) and a much lower rate for the four execution calls. The total cost of a typical run lands around $0.18, two-thirds of which is the plan.

A real loop

The billing example, run end to end:

[plan]   Step 1: query billing for April invoices    → 47 results
[act]    Step 2: filter against contracts            → 3 mismatches
[act]    Step 3: draft 3 refund emails               → 3 drafts
[pause]  Step 4: human review

Total cost: $0.21
Total duration: 84s
Human review time: 4m 12s
Sends approved by: priya@example.com at 14:33 UTC

Without the plan, the agent might have started drafting emails while still discovering mismatches. Edge cases would have been handled inconsistently. The reviewer would have had to reverse-engineer what the agent did. With the plan, every action is traceable to a numbered step in a versioned document.

When something goes wrong — and something always eventually goes wrong — the plan is what you debug. You read it. You see where reality diverged from expectation. You fix the plan template. You re-run the eval. You ship the change. Without the plan, you debug a transcript and guess.

Where most teams plateau

Many teams' first agents are one-shot. They demo well. They survive the simple cases. They fail in production when the cases get complex, and the team blames the model. The blame is misplaced. The model is fine. The architecture skipped a step.

The shift to plan-then-act is more a discipline than a technical change. The model is the same. The architecture is different. The reliability gain is large — typically a 3-5x reduction in catastrophic failure rate, in our experience across deployments.

Close

The plan-vs-act loop is the architectural decision that makes agents production-grade. The plan is reviewable. The acting is procedural. The model is used twice — for different work, with different prompts, often with different model classes. The reliability gain compounds.

If you're building an agent, write down the plan-vs-act split before writing any code. The architecture follows.

Plan vs. act: the agent loop everyone gets wrong

The two-step loop

Why one-shot fails

Plan-as-doc

Act-as-tool

A real loop

Where most teams plateau

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors