khairold
← Back to Systems

Autopilot: unattended AI execution

A shell loop that runs an AI agent repeatedly against a plan — each iteration gets a fresh context window, does one thing, verifies the build, and exits. No human needed.

·
AutomationAI CollaborationWorkflowProtocol

The .plan/ protocol solved context continuity. Four files that let any AI session pick up exactly where the last left off. But there was still a bottleneck: me.

Every session, I’d open the terminal, tell the agent to read the plan files, confirm its scope, watch it work, then tell it to close the session. For a 10-item phase, that’s 10 rounds of me sitting there saying “yes, proceed” and “okay, close it out.” I was the human-in-the-loop for work that didn’t need a human in the loop.

So I removed myself.

The insight

The technique is called “Ralph Wiggum,” coined by Geoffrey Huntley and written up well by AI Hero: instead of one long AI session that degrades as context fills up, run many short sessions where each one starts fresh, does one thing, and exits cleanly.

The .plan/ files already provided the shared memory between sessions. The execute-phase skill already defined the start/execute/close protocol. All that was missing was a loop.

┌──────────────────────────────────────────────┐
│              autopilot.sh (loop)              │
│                                              │
│  while unchecked items remain:               │
│    ┌────────────────────────────────────┐    │
│    │  pi -p (fresh context window)      │    │
│    │                                    │    │
│    │  1. Read .plan/ files              │    │
│    │  2. Pick next item(s)              │    │
│    │  3. Execute                        │    │
│    │  4. Verify build                   │    │
│    │  5. Update .plan/ files            │    │
│    │  6. Exit                           │    │
│    └────────────────────────────────────┘    │
│    git commit                                │
│    build gate (if fails → stop)              │
│    stuck detection (3x → stop)               │
│    sleep, loop                               │
└──────────────────────────────────────────────┘

Each invocation of pi gets a completely fresh context window. No accumulated junk. No degraded attention. The agent reads the plan files — which are updated and accurate because the previous iteration just wrote them — orients itself in seconds, and starts working.

The plan files are the communication channel between iterations. One iteration’s closing step is the next iteration’s briefing.

Two layers

The system has two layers that do very different jobs.

The skill: what the agent does

The autopilot-execute skill is a set of instructions the agent follows every iteration. It’s written in the agent’s voice — direct, autonomous, no ambiguity:

You are running unattended. No human is present. Be fully autonomous.
Do not ask questions. Do not wait for confirmation. Make decisions and log them.

The skill defines five steps:

Orient — Read PLAN.md, MEMORY.md, DRIFT.md. Know where things stand. Print a status line so the logs are scannable.

Scope — Look at the next unchecked item. If it’s small, combine it with the next one or two. If it’s large, do just that item. The agent makes this judgment call itself.

Execute — Do the work. Follow patterns from MEMORY.md. If something unexpected happens, make a decision and log it in DRIFT.md. No stopping to ask.

Verify — Run the build command. If it passes, proceed. If it fails, try to fix it (two attempts). If still broken, leave the item unchecked, log the failure, and exit. Don’t make things worse.

Close — Update all plan files. Check off completed items. Add new decisions to MEMORY.md. Write a session log entry with a handoff note. This step is mandatory — skip it and the next iteration starts blind.

The shell wrapper: what keeps things safe

autopilot.sh is a bash script that runs the loop. It handles everything the agent shouldn’t think about:

Build gate. After each iteration, the wrapper runs the build command independently — it doesn’t trust the agent’s self-reported success. If the build fails, autopilot stops and tells you where to look.

Stuck detection. If three consecutive iterations complete zero new items, something’s wrong. The agent is probably trying and failing at the same task. Autopilot stops and says “human intervention needed.”

Git checkpoints. Before and after each iteration, git commit. Every iteration is a clean commit. If iteration 7 breaks something, git reset --hard back to iteration 6. You never lose more than one iteration of work.

Phase boundaries. The wrapper stops at the end of each phase. Phases are natural review points — you glance at what was built, confirm it looks right, then restart for the next phase. This is the one human checkpoint in the whole system.

Notifications. macOS say and notification center alerts when something needs attention. You can be in another room.

# Configuration is one file
# autopilot.config
BUILD_CMD="bun run build"
PROJECT_DIR="."

The wrapper reads autopilot.config for project-specific settings. That’s the only configuration. No editing the wrapper. No editing the skill. One config file, two values.

Running right now

As I write this, autopilot is running on a real project: a CRA-to-Astro migration of a commercial TV streaming portal. 69 plan items across 5 phases. 138 pages. 22 React islands.

The STATUS file updates in real time:

09:17:53 ═══ AUTOPILOT STARTED ═══
09:17:53 Progress: 42/69 (27 remaining)
09:17:54 🔄 Iteration 1/30 — Phase 4 — 17 items left in phase
09:17:54   🚀 Launching pi (iteration 1)...
09:22:49   ✅ Build passed (138 pages in 63.17s)
09:22:49   📦 Completed 1 item(s) (total: 43)
09:22:54 🔄 Iteration 2/30 — Phase 4 — 16 items left in phase
09:22:54   🚀 Launching pi (iteration 2)...
09:28:27   ✅ Build passed (138 pages in 29.25s)
09:28:27   📦 Completed 2 item(s) (total: 45)

Each iteration: about 5 minutes. Orient, scope, execute, verify, close. The agent ported an OAuth PKCE library in one iteration, then created 4 nanostore files plus an API client in the next — batching small items together because it judged they were related.

Earlier today, autopilot ran Phase 3 of the same project. Four iterations. Eight items completed. The agent batched every iteration into pairs — UpickPage + CatchupPage, AboutPage + BusinessPackPage, FreeViewing + ActivateAppPage — because it recognized they shared similar structure. Zero build failures across all iterations.

I wasn’t watching. I was writing this post.

The scoping judgment

The most interesting design choice is letting the agent decide its own scope. The skill says:

If the item is small (create a helper function, copy assets, add a config value)
→ combine it with the next 1-2 unchecked items in the same phase

If the item is medium (port a single page, implement a feature)
→ do just that item

If the item is large (complex multi-file task)
→ do just that item

In practice, the agent is good at this. During the TV migration, it consistently combined small items (UpickPage + CatchupPage are structurally identical — the agent noticed) while giving complex items like the OAuth auth library their own iteration.

The decisions get logged. Every choice appears in SESSION-LOG.md or MEMORY.md. Here’s a real handoff from one iteration to the next:

## Session 21 — Port UpickPage + CatchupPage (2026-03-05)

**What happened:**
- Created VodRailsSection.tsx — a reusable React island
- Ported UpickPage.js → upick.astro
- Ported CatchupPage.js → catchup.astro

**Issues encountered:**
- None. Both pages share identical structure, making them
  good candidates for combining into one session.

**Handoff to next session:**
- Next: 3.7 — Port AboutPage.js → newunifitv.astro
- 6 items remain in Phase 3

The next iteration reads this, knows exactly where to start, and never asks “what should I do?” It reads the plan, sees item 3.7 unchecked, and starts working.

The dashboard

In a second terminal, autopilot-dashboard.sh shows real-time progress:

  ╔═══════════════════════════════════════════════╗
  ║          🤖  AUTOPILOT DASHBOARD              ║
  ╚═══════════════════════════════════════════════╝

  Overall Progress
  [█████████████████████████████░░░░░░░░░░░░░░░░░] 45/69 (65%)

  Phase Breakdown
  ✅ Phase 1 Scaffold & Layout          15/15 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓]
  ✅ Phase 2 Core Marketing Pages       15/15 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓]
  ✅ Phase 3 Remaining Pages & Support  12/12 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓]
  🔄 Phase 4 Auth & Transactional       3/17 [▓▓▓░░░░░░░░░░░░░░░░░]
  ⬜ Phase 5 Polish & Deploy             0/10 [░░░░░░░░░░░░░░░░░░░░]

  Next Item: 4.4 — Create src/lib/taas.ts

Progress bars. Phase status. Build results. Recent activity log. It refreshes every 2 seconds. You can watch an entire phase get built in real time without touching anything.

Why fresh context windows matter

The temptation with AI coding is to keep one long session running. “The agent already has all the context — why restart?” Because context windows degrade.

After 50,000 tokens of conversation history, the agent’s attention fragments. It forgets instructions from early in the session. It gets confused about which version of a file is current. It starts making mistakes it wouldn’t make at the beginning.

Fresh context windows sidestep this entirely. Each iteration starts with ~2,000 tokens of plan file context (PLAN.md + MEMORY.md + DRIFT.md) and a full, clean attention budget for the actual work. The plan files are a compressed, curated summary — not a raw conversation history full of debugging tangents and dead ends.

The TV migration’s MEMORY.md is 92 decisions deep. That sounds like a lot, but the file is structured — tables, categories, searchable patterns. The agent reads it in seconds and has full context for every decision made across 26 previous sessions. No conversation history could match that density.

What the agent decides on its own

The skill gives the agent real autonomy:

Scope. “This item is small — I’ll combine it with the next two.” Or: “This is complex — just this one.”

Approach. MEMORY.md says “follow the patterns from Phase 1.” The agent reads previous implementations and replicates the approach. No one tells it which file to model.

Recovery. Build fails? The agent reads the error, attempts a fix. Two tries. If it can’t fix it, it marks the item with a warning, logs what it tried, and exits cleanly. The next iteration — or a human — can pick it up.

Drift. Something doesn’t match the plan? The agent notes it in DRIFT.md and keeps going. In the TV migration, the agent discovered that FreeViewing.js was a transactional auth page, not the “mostly static” page the plan described. It logged the drift, created a shell page instead, and moved on. No stopping to ask “should I change the plan?” — it made the judgment, did the work, and documented the deviation.

The decisions get logged. Every choice appears in SESSION-LOG.md or MEMORY.md. Autonomy doesn’t mean opacity — it means deciding now and explaining later.

Safety as architecture

I don’t trust AI to never make mistakes. The system is designed around that assumption.

The build gate is the hard boundary. The agent can write whatever code it wants. But if bun run build fails, the iteration doesn’t count. The wrapper stops. Nothing ships broken. In the TV migration — 6 autopilot iterations, 6 successful builds. 113 pages, then 115, then 138. The page count itself tells the story of progress.

Git checkpoints are the undo button. Every iteration is bracketed by commits. The worst case is rolling back one iteration. You never lose significant work.

Stuck detection is the circuit breaker. Three failed iterations in a row means the problem is beyond what the agent can solve alone. Instead of burning tokens on infinite retries, it stops and says “help.”

Phase boundaries are the review points. The system doesn’t run from start to finish unattended. It pauses at each phase boundary. You review what was built, confirm it’s correct, then start the next phase. The granularity of human review matches the granularity of meaningful progress.

The safety isn’t a single mechanism — it’s layered. Build gate catches broken code. Stuck detection catches repeating failures. Git checkpoints catch anything else. Phase boundaries catch architectural drift. Each layer catches what the others miss.

The evolution

This system didn’t appear fully formed. It evolved through three stages, each removing a bottleneck:

Stage 1: .plan/ protocol. Manual session management. I start each session, tell the agent what to do, close the session, update the files. The bottleneck is me doing the ceremony.

Stage 2: execute-phase skill. The agent manages the session protocol itself — reads plan files at start, updates them at end. But I still watch each session and confirm scope. The bottleneck is me approving things that don’t need approval.

Stage 3: autopilot. The wrapper runs the loop. The agent runs unattended. I review at phase boundaries. The bottleneck is now the phase itself — the natural unit of reviewable progress.

Each stage automated the least-valuable human action from the previous stage. Stage 1 automated context recall. Stage 2 automated session management. Stage 3 automated the start/confirm/close loop.

The TV migration shows this evolution in a single day. Phases 1 and 2 used execute-phase — I watched each session, confirmed scope, reviewed results. By Phase 3, the patterns were established, the MEMORY.md was rich, and I trusted the process. I switched to autopilot and went to do other things. Four iterations later, Phase 3 was done. Now Phase 4 is running while I write about how it works.

This is the work on the system principle applied recursively: the system for building things was itself the bottleneck, so I built a system for running the system.

Setup

The whole thing is three files. No dependencies beyond pi and bash.

# 1. Create a phased plan (once per project)
pi "create a phased plan for this project"

# 2. Add autopilot config
echo 'BUILD_CMD="bun run build"' > autopilot.config

# 3. Run
./autopilot.sh --phase 3

# In a second terminal, watch progress
./autopilot-dashboard.sh

It works with any build system — npm, cargo, pytest, make. Change the build command and the rest is the same. The agent doesn’t care what language you’re using. The wrapper doesn’t care what the agent does. The plan files don’t care who reads them.

That decoupling is the point. The three layers — plan files, agent skill, shell wrapper — each do one thing. They compose into unattended execution without any of them being complex.


The best automation isn’t the kind that replaces you. It’s the kind that lets you be absent for the parts that don’t need you, so you can be present for the parts that do. Autopilot runs the routine. Phase boundaries are where I show up.