Post

Loops: What Every AI Engineer Needs to Know in 2026

Source breakdown of “Loops: What Every AI Engineer Needs to Know in 2026” by Rahul (@sairahul1), which amplifies Peter Steinberger’s “design the loops that prompt your agents” and Addy Osmani’s loop engineering post. Cover image is the article’s own card, pulled from the X preview. Cross-checked against The New Stack’s reporting and Firecrawl’s loop-engineering guide. Everything below is my read as someone who has shipped AI systems into production games for eight years.

🤔 Curiosity: If You’re Still Typing Prompts, What Are You Actually Doing?

A single sentence from Rahul’s article stopped me mid-scroll. He quotes Boris Cherny — the person who built Claude Code at Anthropic — saying:

“I don’t prompt Claude anymore. I write loops, and the loops do the work. My job is to write loops.”

Peter Steinberger (creator of OpenClaw, now at OpenAI) put the same idea more bluntly the day before: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” By that weekend, Google’s Addy Osmani had given the pattern a name — loop engineering — and the whole AI-dev timeline lit up.

Here’s why this hit a nerve for me. In game production, I’ve watched the exact same migration happen with NPC AI. We started by scripting every behavior by hand (if player_near: attack). Then we moved to behavior trees — we stopped writing the actions and started designing the structure that selects actions. The designer’s job changed from “write what the guard does” to “write the loop the guard runs every tick.” Loop engineering is that same promotion, applied to coding agents.

Curiosity: Everyone clipped Cherny’s quote. Almost nobody built what he described. So what is a “loop,” concretely, and what does it actually take to make one that’s safe to leave running while you sleep?

Let me retrieve the actual machinery.


📚 Retrieve: What a “Loop” Actually Is

For two years, working with a coding agent meant typing one prompt, reading the result, typing the next prompt. You were the scheduler and you were the quality gate. A loop inverts that arrangement. Osmani’s framing is the cleanest: loop engineering is “replacing yourself as the person who prompts the agent.”

The closest analogy is moving from operating a lathe to designing the production line the lathe sits on. A loop:

  1. Discovers work on a schedule (instead of you noticing it),
  2. Executes in an isolated workspace,
  3. Verifies the result with a second agent,
  4. Persists state to a file so tomorrow’s run resumes where today’s stopped.
flowchart LR
    Cron["⏰ Trigger<br/>(cron / webhook / file change)"] --> Read["📖 Read state<br/>(CLAUDE.md, JSON, board)"]
    Read --> Discover["🔍 Discover work<br/>(triage, scan, query)"]
    Discover --> Plan["🧠 Plan next action<br/>(model reads state, decides)"]
    Plan --> Exec["🛠️ Execute<br/>(isolated git worktree)"]
    Exec --> Verify{"✅ Verifier sub-agent<br/>(different instructions)"}
    Verify -->|Pass| Commit["💾 Commit + append state"]
    Verify -->|Fail| Plan
    Commit --> Stop{"🎯 Stopping<br/>condition met?"}
    Stop -->|No| Discover
    Stop -->|Yes| Done["🏁 Exit"]

    style Verify fill:#4ecdc4,stroke:#0a9396,stroke-width:3px,color:#000
    style Plan fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style Stop fill:#ffe66d,stroke:#f4a261,stroke-width:2px,color:#000

The 18-month escalator

The thing that makes this feel inevitable is the trajectory. We went up a full abstraction ladder in under a year and a half:

EraThe unit of workWhat you tuneWho pulls the trigger
Prompt engineeringA single messageWording, examplesYou, every turn
Context engineeringThe windowWhat’s in scope (RAG, files)You, every turn
Harness engineeringThe tool/permission surfaceMCP, hooks, allow-listsYou, per session
Loop engineeringThe whole runSchedule, verifier, memoryThe loop

Retrieve: A cron job runs a fixed script. A loop runs a model that reads the current state and chooses its next action. That decision logic in the middle is the entire difference — and the reason a loop can’t just be replaced by a shell script.

Osmani’s six primitives (and why both vendors already shipped them)

The reason the conversation exploded in June 2026 is that the building blocks were already finished. Osmani maps six primitives onto both OpenAI Codex and Claude Code, and the mappings are nearly identical — the assembly was complete, someone just had to point it out.

PrimitiveJob in the loopCodexClaude Code
AutomationsScheduled discovery + triageAutomations tab / triage inboxScheduled tasks, /loop, hooks, GitHub Actions
WorktreesIsolate parallel agentsBuilt-in worktree per threadgit worktree, subagent isolation
SkillsCodify project knowledgeSKILL.md agent skillsSKILL.md agent skills
ConnectorsReach external toolsMCP connectors + pluginsMCP servers + plugins
Sub-agentsSeparate the maker from the checker.codex/agents/.claude/agents/, agent teams
MemoryPersist state between runsAGENTS.md, Memories, LinearCLAUDE.md, auto memory, Linear

Both products even expose a /goal command that keeps an agent working until a verifiable stopping condition holds — with Claude Code using a separate model to grade the result.

Open loop vs. closed loop — pick on purpose

This is the distinction Rahul’s article underweights and the one that will save your token budget:

 Open loopClosed loop
ConstraintsLoose — a goal, wide latitudeBounded — defined path + eval per step
Best forDiscovery, research, “find everything”Production work, recurring maintenance
CostBrutal on tokensCheaper, predictable
StoppingRubric satisfied (fuzzy)Hard check passes (crisp)
Default?Only when you mean it✅ Yes

The verifier is the part that earns trust

If you remember one thing: a model grading its own output is too generous. The single most consequential design choice in a loop is splitting the agent that writes the code from the agent that checks it. A second agent with different instructions catches the failures the first one reasoned itself into.

Osmani’s canonical example: a morning automation triages yesterday’s CI failures, dispatches one sub-agent to draft each fix in an isolated worktree, and a second sub-agent reviews each fix against the project’s tests. Ramp’s Inspect built exactly this shape as bespoke infrastructure six months earlier — now it’s a first-party feature in both ecosystems.

A minimal loop you can actually read

Here’s the skeleton stripped of vendor magic. The model is a subroutine inside your loop, not a chat partner. Note the four non-negotiables: a stopping condition, a no-progress guard, a budget cap, and a separate verifier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Curiosity: what's the smallest honest loop? Discover -> Act -> Verify -> Persist.
import json, subprocess, pathlib
from dataclasses import dataclass

STATE = pathlib.Path("loop_state.json")
MAX_ITERS = 8          # stopping condition: hard cap
TOKEN_BUDGET = 200_000 # stopping condition: dollars/tokens
NO_PROGRESS_LIMIT = 2  # stopping condition: bail if stuck

@dataclass
class LoopState:
    iteration: int = 0
    tokens_used: int = 0
    stale_rounds: int = 0
    done: bool = False
    last_signature: str = ""

def load() -> LoopState:
    # Memory: survive a crash, a context reset, a 3 a.m. process kill.
    if STATE.exists():
        return LoopState(**json.loads(STATE.read_text()))
    return LoopState()

def save(s: LoopState) -> None:
    STATE.write_text(json.dumps(s.__dict__, indent=2))

def maker(state: LoopState) -> dict:
    """Agent #1: read repo state, pick ONE unit of work, produce a patch."""
    # In practice: claude -p "..."  /  codex exec "..."  inside a git worktree.
    return run_agent(role="maker", instructions="Fix the top failing test. Output a diff.")

def verifier(patch: dict) -> bool:
    """Agent #2: DIFFERENT instructions. Never let the maker grade itself."""
    review = run_agent(role="verifier",
                       instructions="Run the suite against this diff. Pass only if all green.")
    return review["all_tests_pass"] and not review["touches_unrelated_files"]

def loop():
    s = load()
    while not s.done and s.iteration < MAX_ITERS and s.tokens_used < TOKEN_BUDGET:
        s.iteration += 1
        patch = maker(s)
        s.tokens_used += patch["tokens"]

        if verifier(patch):                 # the trust gate
            subprocess.run(["git", "apply", patch["diff_path"]], check=True)
            sig = patch["signature"]
            s.stale_rounds = 0 if sig != s.last_signature else s.stale_rounds + 1
            s.last_signature = sig
        else:
            s.stale_rounds += 1             # no-progress guard

        if s.stale_rounds >= NO_PROGRESS_LIMIT:
            print("No progress — exiting before we burn the budget.")
            break
        if patch["goal_satisfied"]:
            s.done = True
        save(s)                             # persist EVERY round

    print(f"Loop stopped at iter={s.iteration}, tokens={s.tokens_used}, done={s.done}")

if __name__ == "__main__":
    loop()

And the trigger is gloriously boring — a 2026 loop’s scheduling layer is just cron:

1
2
# Every weekday at 7am: triage overnight CI, let the loop draft + verify fixes.
0 7 * * 1-5  cd /repo && /usr/bin/python3 loop.py >> logs/loop.log 2>&1

💡 Innovation: What This Means If You Ship Games (or Anything Live)

Coding agents got the headlines, but the loop pattern is really about any system with recurring, verifiable work — which describes a live game almost perfectly. Here’s where I’d point it first.

Game-production loops worth building

flowchart TB
    subgraph LiveOps["🎮 Live-Ops Balance Loop"]
        A1["⏰ Nightly trigger"] --> A2["📊 Read yesterday's<br/>telemetry (win rates, churn)"]
        A2 --> A3["🧠 Maker: propose<br/>tuning deltas"]
        A3 --> A4["🤖 Verifier: re-sim<br/>10k matches"]
        A4 -->|Within guardrails| A5["📝 Draft PR for<br/>human sign-off"]
        A4 -->|Breaks guardrails| A3
    end

    style A4 fill:#4ecdc4,stroke:#0a9396,stroke-width:2px,color:#000
    style A3 fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
LoopTriggerMaker doesVerifier checksStops when
Balance triageNightly telemetryPropose tuning deltasRe-simulate matches vs. guardrailsWin-rate band restored
Content QANew asset committedGenerate edge-case test scenesRender + diff against golden framesZero visual regressions
Crash-fixSentry spike webhookDraft fix in a worktreeRepro test passes, no new crashesCrash rate normalizes
LocalizationString table changeTranslate + lore-checkBack-translation + length-fit checkAll locales pass

The balance-tuning loop is the one I’d ship first, because it has a crisp verifier: re-running the match simulator is a closed, automatable eval. That’s the whole game — if you can’t write the check, you don’t have a loop, you have a runaway agent.

Innovation: In game AI we already trust self-running loops — that’s what a behavior tree tick is. Loop engineering just swaps the hand-authored selector for a model that reads state and chooses. The discipline that keeps NPCs from glitching — bounded ticks, clear transitions, a watchdog — is the same discipline that keeps an agent loop from burning your token budget at 3 a.m.

The honest tradeoffs

Osmani is more cautious than the hype he triggered, and so am I:

  • Runaway cost. Without an iteration cap, a no-progress check, and a dollar budget, an open loop can burn a billing cycle over a weekend. All three guards are in the skeleton above for a reason.
  • Comprehension debt. This is the sharp one. A loop ships code you never read. “Two engineers can run an identical loop and get opposite outcomes — one moving faster on work they understand, the other avoiding understanding altogether.” The verifier protects correctness; nothing protects comprehension except you reading the diffs.
  • It might just be cron in a hat. The skeptics aren’t wrong about trivial loops. The value only appears when the middle step needs judgment — when a fixed script genuinely can’t decide what to do next.

Key takeaways

InsightImplicationNext step
The model got demoted to a subroutineStop optimizing prompts; optimize the loopPick one recurring task with a crisp check
The verifier earns the trustNever let the maker grade itselfWrite the eval before the loop
Memory is boring on purposeA markdown checklist or JSON file is enoughHave every run read at start, append at end
Closed loops are the defaultOpen loops are for funded exploration onlyAdd caps: iterations, no-progress, budget
Portability is the open questionLoop definitions aren’t standardized yetKeep your loop logic vendor-agnostic

New questions this raises

  • The New Stack notes the vendor that makes loop definitions portable takes pole position. Will we get an AGENTS.md-style standard for loops, or will each of us keep hand-rolling loop.py?
  • For games specifically: what’s the right human-in-the-loop checkpoint for a balance loop — sign off every PR, or only when deltas exceed a threshold? Where does autonomy stop being a feature and start being a liability to your players’ trust?
  • If the verifier is what earns trust, who verifies the verifier? When the checker is also a model, the comprehension-debt problem just moves up one level.

References

Primary sources:

Reporting & analysis:

Tooling & docs:

Cross-domain (game AI loops):

This post is licensed under CC BY 4.0 by the author.