Scenario Playbook

Use the workshop techniques on real web projects

Oracle-grounded user stories for the toolkit topics. These are practical situations a web developer can hand to Codex, with browser-proof moves, implementation references, and definitions of done.

Part 1 - Morning

Lesson 01 - Escape the Token Ceiling

Oracle handoff for a huge codebase question

Toolkit artifact

Working web developers routinely hit questions that are too broad for a local coding-agent thread: auth architecture, route ownership, stale data bugs, migration blast radius, or a tangled UI flow. Oracle handoff keeps Codex focused on local implementation while a browser ChatGPT Pro session reads the huge PackX bundle and returns a plan the agent can verify in the running app.

01

Unblock a tangled Next.js auth redirect bug

As a senior full-stack web developer owning a Next.js app

A production-like local Next.js app has a login redirect loop that only appears after mixing middleware, server components, API route auth checks, and client-side session refresh. The issue spans many files, so asking Codex to read everything directly would burn the interactive session budget.

As a senior full-stack web developer, I want Codex to offload the broad auth-flow investigation through Oracle and PackX, so that Codex can spend its local context on implementing and verifying the fix instead of reading the whole repo.
Agent brief

You are my coding agent in this repo. Investigate the auth redirect loop by using the Oracle codebase handoff workflow, then implement only the locally verifiable fix. Do not skip the PackX preview. Create `.notes/` if needed. First run `packx --preview -s "Next.js auth redirect middleware session login callback API route"` and inspect the matched files and token estimate. If the scope is reasonable, run `packx -s "Next.js auth redirect middleware session login callback API route" -f markdown --no-interactive -o .notes/handoff-bundle.md`. Then send the bundle through Oracle browser mode with a 3-5 word kebab slug, and make the prompt start with the same bracketed slug on line 1: `ORACLE_MAX_FILE_SIZE_BYTES=12000000 oracle --engine browser --browser-thinking-time extended -p "[auth-redirect-loop]\n\nWe have a Next.js auth redirect loop. Read the attached PackX bundle and produce: 1) the likely root cause, 2) the exact files/functions to inspect first, 3) a minimal implementation plan, 4) local verification steps using the dev server/browser, and 5) risks or false leads to avoid. Do not propose unrelated rewrites." --slug "auth-redirect-loop" --write-output .notes/oracle-plan.md --file .notes/handoff-bundle.md`. After Oracle returns, read `.notes/oracle-plan.md`, summarize the recommended fix in `.notes/auth-redirect-receipt.md`, implement the smallest safe change, then run the repo's existing typecheck/test/build command discovered from package scripts. Finally start the local dev server and verify the login flow in a browser.

Browser move

Use the running local app as proof, not source-code confidence. In a CMUX browser pane or browser-agent workflow, open the dev server, capture an accessibility snapshot of the login page, click the sign-in control, fill the test credentials or scripted local auth fields, submit, and evaluate `window.location.pathname` plus visible authenticated UI text. Save a screenshot or browser transcript showing the flow reaches the expected post-login route without bouncing back to `/login`.

Implementation refs
  • workshop-followup/toolkit/prompts/01-oracle-codebase-handoff-prompt.md
  • workshop-followup/toolkit/codex-skills/oracle-codebase-handoff/SKILL.md
  • workshop-followup/toolkit/bundles/code-oracle-packx.md
  • packx --preview -s "Next.js auth redirect middleware session login callback API route"
  • packx -s "Next.js auth redirect middleware session login callback API route" -f markdown --no-interactive -o .notes/handoff-bundle.md
  • ORACLE_MAX_FILE_SIZE_BYTES=12000000 oracle --engine browser --browser-thinking-time extended -p "[auth-redirect-loop]\n\n<question>" --slug "auth-redirect-loop" --write-output .notes/oracle-plan.md --file .notes/handoff-bundle.md
Done when
  • .notes/oracle-plan.md exists and contains the browser-side Oracle recommendation
  • .notes/auth-redirect-receipt.md records the selected fix, files changed, and why alternatives were rejected
  • Repo verification command passes, such as test/typecheck/build from package scripts
  • Browser proof shows login reaches the expected authenticated route without a redirect loop
  • Git diff is limited to the auth-related fix and does not include `.notes/` artifacts
Failure modes
  • Mistake: using Oracle before previewing the bundle, causing a bloated or irrelevant handoff. Mitigation: require `packx --preview` first and adjust the search topic before writing `.notes/handoff-bundle.md`.
  • Mistake: implementing the whole Oracle plan as if it were guaranteed correct. Mitigation: treat `.notes/oracle-plan.md` as a plan, then implement the smallest locally verifiable change and prove it in the browser.
  • Mistake: using a slug that Oracle rejects. Mitigation: use exactly 3-5 kebab words and start the prompt with `[same-slug]` followed by a blank line.
  • Mistake: trusting unit tests for a browser-only redirect bug. Mitigation: require browser proof against the running dev server.
02

Find the real owner of a checkout state bug

As a product engineer maintaining a checkout funnel

A React or Remix checkout page loses cart state when the user changes shipping method, navigates back, or refreshes after payment intent creation. The logic is split across route loaders/actions, client state, server API handlers, and payment integration code.

As a product engineer, I want a huge-context Oracle pass to map the checkout state ownership and failure points, so that my local agent can make a targeted fix and prove the checkout flow in the browser.
Agent brief

You are my coding agent. Use Oracle + PackX only for the broad read-heavy diagnosis, then return to local implementation and verification. Create `.notes/` if needed. Run `packx --preview -s "checkout cart state shipping method payment intent loader action API"`. Review the preview for checkout routes, cart/session modules, API handlers, and payment-related files. If relevant, run `packx -s "checkout cart state shipping method payment intent loader action API" -f markdown --no-interactive -o .notes/handoff-bundle.md`. Then run `ORACLE_MAX_FILE_SIZE_BYTES=12000000 oracle --engine browser --browser-thinking-time extended -p "[checkout-state-owner]\n\nRead the attached checkout-related codebase bundle. Identify which layer should own checkout/cart state, where the current flow can lose state during shipping changes, browser back navigation, refresh, or payment intent creation, and propose the smallest fix. Return exact file/function references, a step-by-step implementation plan, and browser verification steps. Avoid broad architecture rewrites unless the current code makes a minimal fix unsafe." --slug "checkout-state-owner" --write-output .notes/oracle-plan.md --file .notes/handoff-bundle.md`. Read the returned plan, write `.notes/checkout-handoff-receipt.md` with the chosen path, implement the minimal fix, then run existing tests/build. If the repo has no browser test, use the dev server plus a browser-agent or CMUX browser pane to verify the funnel manually and record evidence.

Browser move

Use browser verification because this bug is about real user navigation. In the local app, open the checkout route, capture a snapshot of cart line items and total, click through shipping method changes, use back/forward navigation, refresh once after the payment-intent step if safe in the local/test environment, and evaluate the visible cart total plus any checkout state exposed in the page. Save a screenshot or transcript proving the cart survives the exact sequence.

Implementation refs
  • workshop-followup/toolkit/codex-skills/oracle-codebase-handoff/SKILL.md
  • workshop-followup/toolkit/bundles/oracle-codebase-handoff-skill.bundle.txt
  • workshop-followup/toolkit/bundles/code-oracle-packx.md
  • packx --preview -s "checkout cart state shipping method payment intent loader action API"
  • packx -s "checkout cart state shipping method payment intent loader action API" -f markdown --no-interactive -o .notes/handoff-bundle.md
  • ORACLE_MAX_FILE_SIZE_BYTES=12000000 oracle --engine browser --browser-thinking-time extended -p "[checkout-state-owner]\n\n<question>" --slug "checkout-state-owner" --write-output .notes/oracle-plan.md --file .notes/handoff-bundle.md
Done when
  • PackX preview was inspected before bundle creation
  • .notes/oracle-plan.md contains a file/function-level checkout state diagnosis
  • .notes/checkout-handoff-receipt.md names the selected minimal fix and verification sequence
  • Existing repo checks pass
  • Browser proof shows the checkout state survives shipping change, back navigation, and refresh in the local/test flow
Failure modes
  • Mistake: asking Oracle for a generic checkout redesign. Mitigation: phrase the Oracle prompt around state ownership, exact failure paths, minimal fix, and browser proof.
  • Mistake: including live secrets, customer data, or real payment credentials in logs or screenshots. Mitigation: use local/test data only and redact artifacts before saving.
  • Mistake: letting Codex use Oracle for every small follow-up. Mitigation: use Oracle once for the broad read-heavy map, then continue locally unless another huge-codebase question appears.
  • Mistake: verifying only the happy path. Mitigation: require the back/refresh/shipping-change sequence that reproduced the bug.
03

Map a legacy migration before touching code

As a tech lead planning a risky front-end migration

A large web app is moving one feature area from legacy pages to a newer app-router or component architecture. The team needs to know which files, routes, shared utilities, and tests are involved before assigning implementation work.

As a tech lead, I want an Oracle handoff to produce a migration map from the current codebase, so that the implementation agent can work from a grounded plan and avoid guessing across many files.
Agent brief

You are my coding agent. This is a read-heavy migration-mapping task, so use Oracle handoff first and do not edit source files until the returned plan is reviewed. Create `.notes/` if needed. Run `packx --preview -s "legacy feature migration routes components shared utilities tests"`. If the preview includes the expected legacy feature area, run `packx -s "legacy feature migration routes components shared utilities tests" -f markdown --no-interactive -o .notes/handoff-bundle.md`. Then run `ORACLE_MAX_FILE_SIZE_BYTES=12000000 oracle --engine browser --browser-thinking-time extended -p "[legacy-migration-map]\n\nRead the attached PackX bundle for a large web app migration. Produce an implementation-ready migration map: current entry points, route/page ownership, shared components/utilities, test/build commands to run, risky coupling points, safe first slice, and a plan -> implement -> verify sequence. Do not invent tools or recommend a broad rewrite; stay grounded in the attached code." --slug "legacy-migration-map" --write-output .notes/oracle-plan.md --file .notes/handoff-bundle.md`. After Oracle returns, create `.notes/migration-slice-receipt.md` with the first safe slice, explicit files to edit, verification commands, and browser proof needed. Implement only that first slice if it is small enough; otherwise stop with the migration map and receipt for human review.

Browser move

Browser proof is useful only after a concrete migration slice exists. For the planning-only phase, the right proof is `.notes/oracle-plan.md` plus a migration receipt. If implementing the first slice, open old and new routes in a local browser or CMUX browser pane, capture snapshots/screenshots of the same user-visible state, click the primary controls, and confirm the migrated route preserves the expected behavior.

Implementation refs
  • workshop-followup/oracle/01-references.json
  • workshop-followup/toolkit/prompts/01-oracle-codebase-handoff-prompt.md
  • workshop-followup/toolkit/codex-skills/oracle-codebase-handoff/SKILL.md
  • workshop-followup/toolkit/bundles/code-oracle-packx.md
  • packx --preview -s "legacy feature migration routes components shared utilities tests"
  • packx -s "legacy feature migration routes components shared utilities tests" -f markdown --no-interactive -o .notes/handoff-bundle.md
  • ORACLE_MAX_FILE_SIZE_BYTES=12000000 oracle --engine browser --browser-thinking-time extended -p "[legacy-migration-map]\n\n<question>" --slug "legacy-migration-map" --write-output .notes/oracle-plan.md --file .notes/handoff-bundle.md
Done when
  • .notes/oracle-plan.md exists and maps entry points, route ownership, shared utilities, risks, and first safe slice
  • .notes/migration-slice-receipt.md converts the Oracle recommendation into a local implementation decision
  • No source files are changed unless the first slice is explicitly small and verifiable
  • If a slice is implemented, existing checks pass and browser comparison proof exists for old versus new route behavior
  • The final review artifact names what remains out of scope
Failure modes
  • Mistake: turning a migration map into a giant unreviewed rewrite. Mitigation: require a first safe slice and stop for review if the change is not small and verifiable.
  • Mistake: packing the entire repo without topic discipline. Mitigation: preview with a migration-specific search phrase and narrow before writing the bundle.
  • Mistake: treating Oracle's output as fresh repo truth after code changes. Mitigation: once implementation begins, inspect current files locally and verify against the running app.
  • Mistake: using browser proof too early. Mitigation: use browser proof after a concrete UI slice exists; use the Oracle plan and receipt as planning proof before edits.
Agent references

Oracle codebase handoff prompt - Use this as the copy-paste workflow shape for PackX preview, PackX bundle creation, Oracle browser handoff, returned-plan file, and local verification loop.
workshop-followup/toolkit/prompts/01-oracle-codebase-handoff-prompt.md

Oracle codebase handoff Codex skill - Use this when the local agent needs the strict runnable version: preview first, slug rules, prompt prefix, `--write-output`, `--file`, and plan -> implement -> verify.
workshop-followup/toolkit/codex-skills/oracle-codebase-handoff/SKILL.md

Verified Oracle + PackX command surface - Use this to avoid inventing flags. It includes the verified `oracle --engine browser --browser-thinking-time extended` command, `ORACLE_MAX_FILE_SIZE_BYTES`, PackX preview/write commands, token-limit notes, and slug rules.
workshop-followup/toolkit/bundles/code-oracle-packx.md

Lesson 01 validated references - Use this as the taught-source boundary for the Oracle topic: Codex-to-browser quota offloading, ChatGPT Pro Extended as a high-context browser surface, PackX context bundling, MCP/browser bridge, low concurrency, and plan -> implement -> verify.
workshop-followup/oracle/01-references.json

PackX preview command - Run before writing any bundle so the agent can check matched files and token estimate instead of blindly sending irrelevant context.
packx --preview -s "<TOPIC>"

PackX bundle command - Write the relevant codebase context into a model-readable markdown file that Oracle can attach.
packx -s "<TOPIC>" -f markdown --no-interactive -o .notes/handoff-bundle.md

Oracle browser handoff command - Send the PackX bundle to browser ChatGPT with extended thinking and save only the final assistant message to a local plan file.
ORACLE_MAX_FILE_SIZE_BYTES=12000000 oracle --engine browser --browser-thinking-time extended -p "[<3-5-word-kebab-slug>]\n\n<the big analysis question>" --slug "<3-5-word-kebab-slug>" --write-output .notes/oracle-plan.md --file .notes/handoff-bundle.md

CMUX browser/local verification surface - Use a browser pane or browser-agent-style workflow to prove web app behavior after local implementation, especially login, checkout, navigation, and route migration flows.
cmux --json tree; cmux read-screen; local dev server URL from repo package scripts

Implementation notes

  • Use Oracle only for read-heavy, large-codebase analysis. Small local edits should stay in Codex without the browser handoff.
  • Always preview PackX output before writing `.notes/handoff-bundle.md`; the preview is the quality gate for scope and token cost.
  • The Oracle slug must be 3-5 kebab words, and the prompt must start with `[same-slug]`, then a blank line, then the question.
  • Keep browser concurrency low: no more than two or three Oracle browser requests at once.
  • Treat `.notes/oracle-plan.md` as a planning artifact, not a commandment. The local agent still has to inspect current files, implement the smallest safe slice, and verify with tests/build/browser proof.
  • Keep `.notes/` artifacts local working files. Do not commit Oracle bundles or plans unless the repo deliberately wants them.
  • For browser proof, prefer user-visible receipts: screenshot, accessibility snapshot, console-free run, route/path evaluation, or a short transcript of click/fill/evaluate steps.
  • Do not put secrets, customer data, real payment credentials, or private cookies into PackX bundles, logs, browser screenshots, or `.notes/` receipts.
Lesson 02 - Command Your Terminal Like an Orchestra

Command a CMUX manager/worker layout

Toolkit artifact

Web developers already juggle dev servers, test runners, browser verification, and agent chats; this pattern turns that chaos into one visible Manager pane coordinating narrow worker panes. It matters most when a local app needs code changes plus runtime proof, because the Manager can inspect panes, send targeted tasks, read worker output, and keep the human focused on decisions instead of pane babysitting.

01

Triage a broken localhost checkout flow

As a full-stack Next.js developer shipping a checkout fix

A Next.js storefront runs locally, the checkout route intermittently loses cart state after refresh, and the developer needs one agent to coordinate source inspection, dev-server logs, and browser proof without manually switching between terminals.

As a full-stack Next.js developer, I want one CMUX Manager pane to spin up focused workers for server logs, cart-state source review, and browser validation, so that I can fix the checkout bug with visible evidence instead of trusting a single agent transcript.
Agent brief

You are the Manager agent in my focused CMUX pane. Do not implement yet. First inspect live CMUX state with `cmux --json tree` and `cmux list-panes`. Build a visible pane layout using only real CMUX commands: create worker panes with `cmux new-split right` and `cmux new-pane`, then name the workspace tab with `cmux rename-tab "checkout-triage"`. Create three worker roles: `server-log-worker`, `source-worker`, and `browser-proof-worker`. Send each worker a narrow task using `cmux send -- "<task>"`: server-log-worker runs or watches the local dev server and captures errors; source-worker inspects cart and checkout route code only; browser-proof-worker verifies the localhost checkout flow in a browser or browser-agent workflow and reports exact observed behavior. After each worker responds, read its visible terminal output with `cmux read-screen`, summarize conflicts between source, logs, and browser behavior, then propose the smallest fix. Expected output: a Manager report containing pane refs from `cmux --json tree`, each worker assignment, each worker result, the chosen fix plan, and the browser/dev-server proof shape required before merging.

Browser move

Use the browser-proof-worker for runtime proof: open the local checkout URL, take an accessibility-style snapshot of the page state, click through cart → checkout → refresh/back navigation, fill only safe test fields, and report URL, visible cart count, console/runtime errors, and whether state survives refresh. If a CMUX browser pane is available, keep that proof visible beside the Manager; otherwise use the project's existing browser automation or local browser and paste the observation back into the worker pane.

Implementation refs
  • workshop-followup/toolkit/prompts/02-cmux-manager-worker-layout.md
  • workshop-followup/toolkit/codex-skills/cmux-orchestrate/SKILL.md
  • workshop-followup/toolkit/bundles/code-cmux-cli.md
  • cmux --json tree
  • cmux list-panes
  • cmux new-split right
  • cmux new-pane
  • cmux send -- "<task>"
  • cmux read-screen
  • cmux rename-tab "checkout-triage"
Done when
  • Manager report includes the live CMUX pane refs from `cmux --json tree` and the named roles assigned to each worker.
  • Dev-server or test-runner output is copied from a worker via `cmux read-screen`, not summarized from memory.
  • Browser proof includes the exact localhost route, user actions, visible before/after state, and any console/runtime error text.
  • A final review artifact lists the minimal code area to change and the verification commands or browser steps to rerun.
Failure modes
  • The Manager delegates before inspecting live state; mitigate by requiring `cmux --json tree` and `cmux list-panes` before every send/read cycle.
  • Workers drift into broad refactors; mitigate by assigning each worker one narrow responsibility and having the Manager reject unrelated changes.
  • The browser worker reports success without proof; mitigate by requiring concrete URL, visible UI state, action sequence, and error/log evidence.
02

Coordinate a design-drift review before touching CSS

As a frontend engineer maintaining a dashboard UI

A dashboard page has visual drift across cards, filters, and responsive layout after several agent-generated edits, and the developer wants separate workers to inspect current code, compare live UI behavior, and propose a bounded cleanup.

As a frontend engineer, I want a CMUX Manager to coordinate code, browser, and review workers before editing CSS, so that cleanup is based on current source and visible UI proof rather than vague design taste.
Agent brief

You are the Manager agent in my focused CMUX pane. Keep the human interaction surface to you only. Use `cmux --json tree` and `cmux list-panes` first. Create a visible Manager/worker layout with `cmux new-split right` and `cmux new-pane`; name the tab `dashboard-drift-review` with `cmux rename-tab "dashboard-drift-review"`. Create workers named `current-code-worker`, `responsive-browser-worker`, and `cleanup-plan-worker`. Use `cmux send -- "<task>"` to assign: current-code-worker identifies the current dashboard route, component files, and styling sources; responsive-browser-worker opens the local dashboard in a browser and records desktop and narrow-width observations; cleanup-plan-worker waits for the first two worker outputs, then drafts a small numbered cleanup plan with options A/B/C and a recommended pick. Read worker panes with `cmux read-screen`. Expected output: a numbered implementation plan that says which files are source of truth, what the browser showed, which visual issues are real, and which cleanup option the Manager recommends before implementation.

Browser move

Use a CMUX browser pane or local browser verification to inspect the dashboard at desktop and narrow widths. The browser worker should capture a snapshot-style summary of headings, card order, filter controls, overflow, empty states, and any console errors; click filters and navigation only enough to prove whether drift is visual, state-related, or route-specific.

Implementation refs
  • workshop-followup/oracle/02-references.json
  • workshop-followup/toolkit/codex-skills/cmux-orchestrate/SKILL.md
  • workshop-followup/toolkit/bundles/cmux-manager-worker-layout.bundle.txt
  • workshop-followup/toolkit/bundles/code-cmux-cli.md
  • cmux --json tree
  • cmux list-panes
  • cmux send -- "<task>"
  • cmux read-screen
  • cmux rename-tab "dashboard-drift-review"
Done when
  • Manager names the CMUX tab and can list all worker panes with `cmux list-panes`.
  • Current-code-worker identifies the actual source files instead of relying on memory.
  • Responsive-browser-worker provides desktop and narrow-width observations with route and visible-state evidence.
  • Cleanup-plan-worker produces numbered options with a recommended pick, so the human can steer with a terse choice such as `2B`.
Failure modes
  • The agent treats design.md or memory as more authoritative than the current code; mitigate by making current-code-worker inspect the live repo first.
  • The browser check becomes a full QA swarm; mitigate by limiting it to the dashboard route, two viewport sizes, and the specific drift symptoms.
  • The Manager hides worker uncertainty; mitigate by requiring each worker output to include what it verified and what it did not verify.
03

Run a bounded AFK manager check-in on parallel workers

As a senior developer supervising several local agent tasks

Three workers are already running in CMUX on separate subtasks: one is fixing a flaky unit test, one is checking a build failure, and one is investigating docs or dependency usage. The developer wants the Manager to periodically inspect panes and answer simple blocking questions, but not run forever.

As a senior developer, I want a CMUX Manager to check worker panes at a fixed interval with a hard stop condition, so that parallel work keeps moving while I step away without creating an unbounded autonomous loop.
Agent brief

You are the Manager agent in my focused CMUX pane. Start by creating a timestamp in your notes and inspect live CMUX state with `cmux --json tree` and `cmux list-panes`. Do not create new broad goals. Identify the existing worker panes by role, cwd, visible command, or recent terminal text. Every check-in cycle, use `cmux read-screen` to read each worker pane, summarize status, and if a worker is blocked by a simple question, choose the safest reasonable answer only when it stays inside that worker's assigned task. Use `cmux send -- "<answer or nudge>"` only after confirming the target pane from `cmux --json tree`. Stop after the explicit bound I give you, or after 3 check-in cycles if no bound is provided. Expected output: a check-in ledger with timestamp, pane refs, worker status, any answers sent, any panes needing human attention, and final stop reason.

Browser move

Browser proof is only appropriate if one of the existing workers is assigned to a web-app runtime task. In that case, the Manager should ask that worker for the local URL, exact browser action taken, and visible result; otherwise the correct proof is terminal evidence from `cmux read-screen`, test logs, and the check-in ledger.

Implementation refs
  • workshop-followup/oracle/02-references.json
  • workshop-followup/oracle/02-validation.json
  • workshop-followup/toolkit/prompts/02-cmux-manager-worker-layout.md
  • workshop-followup/toolkit/codex-skills/cmux-orchestrate/SKILL.md
  • cmux --json tree
  • cmux list-panes
  • cmux read-screen
  • cmux send -- "<answer or nudge>"
  • cmux notify
Done when
  • The Manager records a start timestamp and a hard stop condition before the loop begins.
  • Each cycle includes pane refs, worker status, and terminal evidence read from the worker pane.
  • Any `cmux send -- "<answer or nudge>"` action is tied to a verified target pane from `cmux --json tree`.
  • The final ledger states whether each worker is done, blocked, still running, or needs human review, plus the stop reason.
Failure modes
  • The loop runs without a bound; mitigate by requiring a duration, target time, iteration cap, or default 3-cycle limit before starting.
  • The Manager answers high-risk questions for a worker; mitigate by only answering simple in-scope blockers and escalating destructive, production, credential, or broad architecture decisions to the human.
  • The Manager sends a message to the wrong pane; mitigate by rechecking `cmux --json tree` immediately before each `cmux send`.
Agent references

Existing CMUX manager/worker prompt - Use this as the copy-paste baseline for Manager behavior, worker delegation, live-state inspection, and the human-led planning guardrail.
workshop-followup/toolkit/prompts/02-cmux-manager-worker-layout.md

CMUX orchestration skill - Use this as the reusable Codex skill contract for triggering CMUX orchestration only inside a CMUX workspace and for staying on the real command surface.
workshop-followup/toolkit/codex-skills/cmux-orchestrate/SKILL.md

Verified CMUX CLI command map - Use this to avoid invented flags and to keep pane creation, state inspection, send/read, focus, and tab naming grounded in the verified commands.
workshop-followup/toolkit/bundles/code-cmux-cli.md

Lesson 02 validated references - Use this as the taught-concept source for agent-first tooling, natural-language terminal orchestration, cross-pane reading, manager-worker topology, and bounded AFK loops.
workshop-followup/oracle/02-references.json

Lesson 02 validation notes - Use this to stay inside the transcript-validated boundary and avoid reintroducing removed or merely mentioned tools as taught workflows.
workshop-followup/oracle/02-validation.json

Implementation notes

  • The Manager should inspect with `cmux --json tree` before acting and again before sending to a target pane; stale pane refs are how agents nudge the wrong worker.
  • Keep worker tasks narrow: one worker for logs, one for source, one for browser/runtime proof is usually better than three agents all editing the same files.
  • Prefer panes over tab groups for visible parallel work, because hidden tab groups are easy to lose track of during agent-heavy sessions.
  • Use `cmux read-screen` as evidence. Do not let the Manager summarize worker progress from memory or from optimistic worker claims.
  • Browser verification should prove live web behavior when the task is a web-app task; for non-browser tasks, terminal logs, test output, and a Manager ledger are stronger proof.
  • AFK-style loops need an explicit interval and stop condition. Without a bound, the workflow becomes token-burning theater.
Lesson 03 - Give Your Agents a Memory

Agent-managed memory folders

Toolkit artifact

Web developers already run local dev servers, browser checks, and coding agents against real repos. Agent-managed memory folders give those agents a visible place to plan, track, and resume work without polluting source control or turning scratch notes into stale project documentation.

01

Plan and track a risky checkout-flow fix

As a senior frontend engineer responsible for a Next.js checkout flow

A production-like checkout bug only appears after several UI states: shipping form edits, coupon entry, back navigation, and payment-step refresh. The developer wants Codex to plan before touching code, keep progress visible, and leave no agent scratch files in Git.

As a senior frontend engineer, I want the agent to create local memory folders, write a numbered implementation plan, and maintain a live tracker while fixing the checkout flow, so that I can steer the work precisely and review progress without rereading the whole agent transcript.
Agent brief

Set up local agent-managed memory for this repo, then use it for the checkout fix. Run: `mkdir -p .notes .goals`. Configure global ignores without editing this repo's shared `.gitignore`: `git config --global core.excludesFile ~/.gitignore_global`; then ensure `.notes/` and `.goals/` are present in `~/.gitignore_global` without duplicating lines. Inspect the current checkout code before planning; do not trust old notes as source of truth. Write `.goals/checkout-flow-plan.md` as a numbered plan where each major item has options A/B/C and your recommended pick. Open a live tracker with `cmux markdown open .notes/tracker.md --direction right`. Keep `.notes/tracker.md` updated with current step, files inspected, commands run, browser proof status, and remaining risks. After I choose plan refs like `2B` or `5C`, implement only the approved path. Expected output: changed source files only where needed, `.goals/checkout-flow-plan.md`, `.notes/tracker.md`, and a final summary that separates committed-source changes from local scratch artifacts.

Browser move

Use the running local app, preferably in a CMUX browser pane or browser-agent workflow. Start at the checkout page, take an accessibility snapshot, fill the shipping fields, apply a coupon, navigate back and forward, refresh on the payment step, and capture console output plus a screenshot or short written browser receipt. If using agent-browser-style tooling, use snapshot/click/fill/evaluate steps and write the result into `.notes/tracker.md`.

Implementation refs
  • workshop-followup/oracle/03-references.json
  • workshop-followup/toolkit/prompts/03-agent-memory-folders-setup.md
  • workshop-followup/toolkit/codex-skills/agent-memory-folders/SKILL.md
  • mkdir -p .notes .goals
  • git config --global core.excludesFile ~/.gitignore_global
  • cmux markdown open .notes/tracker.md --direction right
  • cmux --json tree
Done when
  • `.notes/` and `.goals/` exist locally and are globally ignored, not added to the repo's shared `.gitignore`.
  • `.goals/checkout-flow-plan.md` contains numbered items with A/B/C options and clear recommended picks.
  • `.notes/tracker.md` is usable as a live progress receipt and lists the exact files touched, commands run, and browser verification performed.
  • The checkout flow is verified in the browser with the failing sequence, and the final source diff excludes `.notes/` and `.goals/`.
Failure modes
  • The agent commits `.notes/` or `.goals/`; mitigate by checking `git status --ignored` and confirming those folders are ignored globally before implementation.
  • The agent writes a vague prose plan that is hard to steer; mitigate by requiring numbered items with lettered options and waiting for human refs like `3B` before coding.
02

Resume a paused redesign without stale memory

As a product-minded full-stack developer iterating on a dashboard UI

A dashboard redesign was started yesterday, but the app has changed since then. The developer wants to resume from local agent notes while forcing the agent to re-check current code and verify the UI in the browser.

As a full-stack developer, I want the agent to read prior local goals but treat the current code and running browser as truth, so that yesterday's plan helps me resume without misleading today's implementation.
Agent brief

Use the agent-memory-folder pattern to resume work safely. First inspect `.goals/` and `.notes/` if they exist, but treat them as local scratch context only. Then inspect the current dashboard files and summarize where the existing notes agree or disagree with the current code. If folders do not exist, create them with `mkdir -p .notes .goals` and ensure `.notes/` and `.goals/` are in the global gitignore via `git config --global core.excludesFile ~/.gitignore_global`. Write `.goals/dashboard-resume-plan.md` with numbered implementation items and A/B/C options. Mark any stale assumptions explicitly. Open or update the live tracker using `cmux markdown open .notes/tracker.md --direction right`. Implement only the approved plan choices. Expected output: a resume plan, a stale-assumption section, a live tracker update, a source diff, and browser evidence from the running dashboard.

Browser move

Open the dashboard in a local browser or CMUX browser pane, compare the current rendered UI against the resumed plan, and record proof in `.notes/tracker.md`: current route, visible dashboard sections, any console errors, and a screenshot or browser-agent accessibility snapshot. Use click/fill/evaluate where the redesign affects interactive filters, tabs, or responsive state.

Implementation refs
  • workshop-followup/oracle/03-references.json
  • workshop-followup/oracle/03-validation.json
  • workshop-followup/toolkit/codex-skills/agent-memory-folders/SKILL.md
  • workshop-followup/toolkit/bundles/code-cmux-cli.md
  • cmux markdown open .notes/tracker.md --direction right
  • cmux list-panes
  • cmux read-screen
Done when
  • The agent explicitly distinguishes prior local notes from current source-of-truth code.
  • `.goals/dashboard-resume-plan.md` includes a `Stale or revalidated assumptions` section.
  • The live tracker shows what was resumed, what was discarded, and what was implemented.
  • Browser verification proves the dashboard still renders and the changed interaction works after the source diff.
Failure modes
  • The agent blindly follows yesterday's `.goals/` file; mitigate by requiring a current-code inspection and an agreement/disagreement summary before edits.
  • The live tracker becomes a dumping ground; mitigate by keeping it short: current step, evidence, next action, blockers, and files touched.
03

Keep agent planning out of a shared library repo

As a maintainer of a shared component library used across several web apps

Multiple agents are about to investigate a flaky component test and accessibility regression. The maintainer wants reusable local planning files, but the repo is shared by a team and must not gain private scratch folders or noisy `.gitignore` changes.

As a component-library maintainer, I want each agent to write plans and findings into globally ignored local folders, so that exploratory work stays close to the repo but never leaks into the package or team Git history.
Agent brief

Prepare this shared repo for local-only agent planning. Create `.notes/` and `.goals/` with `mkdir -p .notes .goals`. Use the global ignore boundary: `git config --global core.excludesFile ~/.gitignore_global`; ensure `.notes/` and `.goals/` are present there exactly once. Do not edit the repo `.gitignore` unless I explicitly ask. Write `.goals/component-a11y-investigation.md` as a numbered plan with A/B/C options and recommended picks. Use `.notes/tracker.md` to record each agent's assigned area, commands run, and findings. If CMUX is available, open the tracker with `cmux markdown open .notes/tracker.md --direction right`; otherwise leave the Markdown file ready for any preview pane. Expected output: local planning artifacts, no committed scratch files, a concise investigation plan, and a review summary I can use to decide which findings become real tests or source changes.

Browser move

Browser proof is useful only if the component library has a local preview, demo route, or app fixture. If available, run the local preview, inspect the component in a browser, capture an accessibility snapshot or screenshot, and record the result in `.notes/tracker.md`. If there is no browser fixture, use test output and the rendered component documentation as proof instead of inventing a browser check.

Implementation refs
  • workshop-followup/oracle/03-references.json
  • workshop-followup/toolkit/bundles/agent-memory-folders-setup.bundle.txt
  • workshop-followup/toolkit/codex-skills/agent-memory-folders/SKILL.md
  • git status --ignored
  • git config --global core.excludesFile ~/.gitignore_global
  • cmux markdown open .notes/tracker.md --direction right
Done when
  • Global Git ignore contains `.notes/` and `.goals/`, and `git status --ignored` confirms the folders are local-only.
  • The shared repo `.gitignore` is unchanged unless explicitly approved.
  • `.goals/component-a11y-investigation.md` gives steerable numbered options rather than a single vague recommendation.
  • `.notes/tracker.md` contains a clean investigation receipt that can be reviewed without committing the scratch files.
Failure modes
  • The agent treats local memory as team documentation; mitigate by labeling every file in `.notes/` and `.goals/` as local working memory and excluding it from source diffs.
  • The agent over-invests in permanent tests for unstable UI exploration; mitigate by using the notes to triage first, then promote only stable, valuable checks into real tests.
Agent references

Validated lesson 03 references - Use this as the ground truth for what was taught: local `.notes/` and `.goals/`, global gitignore partitioning, plan-first grounding, numbered option planning, and live Markdown side panes.
workshop-followup/oracle/03-references.json

Existing setup prompt - Use this when generating copy-paste setup instructions for a coding agent.
workshop-followup/toolkit/prompts/03-agent-memory-folders-setup.md

Existing Codex skill - Use this when implementing or checking the repo-local skill for agent-memory folders.
workshop-followup/toolkit/codex-skills/agent-memory-folders/SKILL.md

CMUX live Markdown command - Use this to open the live-refreshing tracker pane beside the agent session.
cmux markdown open .notes/tracker.md --direction right

Global gitignore setup - Use this to keep local agent scratch folders out of every repo without modifying shared `.gitignore` files.
git config --global core.excludesFile ~/.gitignore_global

Verify ignored scratch folders - Use this before final review to prove `.notes/` and `.goals/` are not part of the source diff.
git status --ignored

Implementation notes

  • Keep `.notes/` and `.goals/` boring, local, and close to the repo. They are working files for agents, not production docs.
  • Have the agent inspect current code before implementation. Memory folders help with continuity, but code remains the source of truth when details matter.
  • Use numbered items with lettered options for plans that need human steering. A reply like `5B` is faster and less error-prone than quoting a paragraph back to the agent.
  • Use the live Markdown tracker as a status view, not a transcript clone. The useful shape is current task, evidence, commands, files touched, blockers, and next step.
  • For web app work, pair the tracker with browser evidence: a screenshot, accessibility snapshot, console result, Playwright result, or concise manual browser receipt.
  • Before finalizing, require a source diff review that excludes `.notes/` and `.goals/`.

Part 2 - Midday

Lesson 04 - Source of Truth Over Stale Memory

A reusable design.md standard

Toolkit artifact

Web developers already ship UI inside real repos, so a design.md gives their AI agent a concrete visual contract instead of vague taste prompts. It keeps design direction reusable while still forcing implementation decisions to come from the current codebase.

01

Rescue a drifting SaaS dashboard redesign

As a frontend lead maintaining a Next.js SaaS dashboard

A Next.js app has accumulated one-off AI-generated cards, empty states, tables, and settings screens. The current dashboard works, but every new agent pass makes the UI feel less coherent because there is no shared design contract.

As a frontend lead, I want the agent to inspect the current dashboard code, author a design.md from selected generated inspiration, and refactor one screen from that contract, so that future UI work follows the same visual standard instead of adding more drift.
Agent brief

You are my coding agent in this repo. Treat the current code as the source of truth. First inspect the existing app structure, routes, shared components, styling setup, and one representative dashboard screen. Do not rely on stale memory. Then create or update design.md with these sections: references, style DNA, color, typography, layout, composition, shape, motion, and constraints. Use only the generated inspiration references I provide or the selected direction I name; do not scrape or clone live websites. After design.md exists, refactor exactly one dashboard route or component to follow it. Run the repo's normal lint/typecheck/test command if available, start the local dev server, and verify the edited screen in a browser. Expected output: design.md, a small focused diff to the selected screen/components, terminal test output, and a short note explaining which design.md rules drove the visual decisions.

Browser move

Use the running local app, ideally in a CMUX browser pane or an agent-browser style flow. Open the edited dashboard route, take an accessibility snapshot, verify the main landmarks/headings/buttons are still present, click through at least one existing interactive control, and capture a screenshot or browser note showing the screen follows design.md. If using Playwright, script the same route load and one interaction rather than relying on visual claims only.

Implementation refs
  • workshop-followup/oracle/04-references.json
  • workshop-followup/toolkit/prompts/04-generate-design-md-standard.md
  • workshop-followup/toolkit/codex-skills/design-md-standard/SKILL.md
  • workshop-followup/toolkit/bundles/design-md-standard-skill.bundle.txt
  • https://inspiration-board.pages.dev/
  • local commands: inspect package.json, then run the repo's existing lint/typecheck/test and dev-server scripts
Done when
  • design.md exists and includes references, style DNA, color, typography, layout, composition, shape, motion, and constraints
  • the changed dashboard screen visibly follows design.md and does not introduce unrelated route or architecture changes
  • lint/typecheck/test or the closest repo-local verification command has been run and logged
  • browser proof includes the local URL, the edited route, an accessibility snapshot or equivalent DOM evidence, and a screenshot or concise visual review note
Failure modes
  • The agent writes a generic design.md without inspecting the repo first; mitigate by requiring a short source-of-truth evidence section listing the files/components/routes it inspected before editing
  • The agent treats design.md like permanent stale memory and overrides current implementation constraints; mitigate by stating that code remains the source of truth and design.md is only the visual brief
  • The agent makes broad design-system rewrites; mitigate by limiting implementation to one route or one component family and requiring a small focused diff
02

Pick a landing page direction with disposable route variants

As an indie product developer launching a marketing page

A new product landing page needs a strong visual direction, but the developer does not want to set up Storybook or maintain a gallery of experiments. The app already has a local dev server and route-based pages.

As an indie product developer, I want the agent to turn selected generated inspiration into design.md, generate several temporary landing-page route variants, let me compare them in the browser, and delete the losers, so that I can choose a direction quickly without creating maintenance baggage.
Agent brief

You are my coding agent. Use the current app code as the source of truth for routing, styling, and component conventions. First create design.md from my selected generated inspiration with the required sections: references, style DNA, color, typography, layout, composition, shape, motion, and constraints. Then generate 3 to 5 temporary route-level variants of the same landing page inside the running app. Add simple left/right navigation between variants if it fits the app, such as links or arrow-key handling. Do not create a permanent Storybook-style catalog. Start the dev server, open the variants in the browser, and produce a comparison note that names the strongest variant and why. After I choose a winner, promote that implementation to the real landing route and delete the rejected temporary routes. Expected output: design.md, temporary variants only during review, a promoted final route, deleted losers, and browser proof from the local app.

Browser move

Use a local browser session to compare each temporary route in the real app. In CMUX, keep the dev server visible and open a browser pane on the variant index or first variant. Use browser snapshot/click/evaluate actions to move through left/right navigation, confirm each variant renders, and record the winning route. A screenshot grid or one screenshot per variant is useful proof here.

Implementation refs
  • workshop-followup/oracle/04-references.json
  • workshop-followup/oracle/04-interactive.json
  • workshop-followup/toolkit/prompts/04-generate-design-md-standard.md
  • workshop-followup/toolkit/codex-skills/design-md-standard/SKILL.md
  • workshop-followup/toolkit/bundles/throwaway-route-variants-skill.bundle.txt
  • https://inspiration-board.pages.dev/
  • local commands: run the repo's dev server and verify the temporary variant routes in browser
Done when
  • design.md is created before variants and each variant visibly follows the same contract
  • 3 to 5 temporary route-level variants render in the local app with simple navigation or clear links between them
  • a comparison note identifies the winning variant using concrete design.md criteria
  • the final route is promoted and rejected temporary routes are deleted
  • browser proof shows the variants were reviewed in the running app, not just inferred from source
Failure modes
  • The agent keeps all variants and leaves a maintenance mess; mitigate by making deletion of rejected routes part of the definition of done
  • The agent builds variants before writing design.md, causing random styles; mitigate by requiring design.md first and asking each variant to cite which contract rules it uses
  • The browser review only checks the happy screenshot and misses broken navigation; mitigate by requiring click or keyboard navigation through every variant
03

Standardize AI UI work for an existing component library

As a design-minded full-stack developer maintaining a shared React component set

A repo has reusable components but no durable visual language. Agents keep adding new button, card, modal, and form styles that technically compile but do not feel like one product.

As a full-stack developer, I want the agent to extract stable visual rules from current components and selected generated inspiration into design.md, then update one component example page from that contract, so that future AI-assisted UI work has a practical standard to follow.
Agent brief

You are my coding agent. First inspect the current component files, styling conventions, route examples, and package scripts. Summarize the source-of-truth evidence before writing anything. Then create or update design.md as a design contract with references, style DNA, color, typography, layout, composition, shape, motion, and constraints. The contract should preserve stable conventions already present in the code unless they conflict with the chosen generated inspiration direction. Next update exactly one example/demo route or one small component cluster, such as Button/Card/Form, to follow design.md. Run the repo's available lint/typecheck/test command and start the local dev server. Expected output: design.md, one focused component/example diff, verification logs, and a browser review of the demo route.

Browser move

Open design.md in a live CMUX Markdown pane while the dev server and browser are visible. In the browser, open the component example/demo route, use an accessibility snapshot to confirm labels and roles survived the visual pass, then click/fill one interactive component such as a button, input, select, or modal trigger. Browser proof matters because component styling can compile while still breaking focus states, labels, or layout.

Implementation refs
  • workshop-followup/oracle/04-references.json
  • workshop-followup/oracle/04-validation.json
  • workshop-followup/toolkit/codex-skills/design-md-standard/SKILL.md
  • workshop-followup/toolkit/bundles/generate-design-md-standard.bundle.txt
  • workshop-followup/toolkit/bundles/code-cmux-cli.md
  • cmux markdown open design.md --direction right --focus false
  • local commands: inspect package.json, run existing lint/typecheck/test, run the dev server
Done when
  • the agent records which current component files and routes were inspected before writing design.md
  • design.md captures both selected inspiration and existing repo constraints without becoming a stale architecture memory file
  • one demo route or component cluster is updated from the contract with a focused diff
  • lint/typecheck/test or closest available verification command passes or has a documented failure with exact output
  • browser proof includes a rendered demo route plus at least one accessibility or interaction check
Failure modes
  • The agent invents a design system from scratch and ignores existing components; mitigate by requiring source-of-truth evidence from current component files before design.md
  • The agent edits too many components at once; mitigate by limiting the first pass to one component cluster or one demo route
  • The agent verifies only with tests and skips browser behavior; mitigate by requiring a browser snapshot plus one click/fill interaction on the changed example
Agent references

Validated lesson 04 references - Use this as the ground truth for taught concepts: code as source of truth, lightweight Markdown memory, generated inspiration, design.md, and throwaway route variants.
workshop-followup/oracle/04-references.json

Design.md standard prompt - Use this as the copy-paste implementation prompt shape for authoring design.md and building UI from it.
workshop-followup/toolkit/prompts/04-generate-design-md-standard.md

Design.md Codex skill - Use this when packaging the workflow for Codex so the agent knows when to create or maintain design.md and when not to treat it as stale memory.
workshop-followup/toolkit/codex-skills/design-md-standard/SKILL.md

Throwaway route variants skill bundle - Use this for the anti-Storybook move: generate route-level variants, compare in the app, keep one, and delete the rest.
workshop-followup/toolkit/bundles/throwaway-route-variants-skill.bundle.txt

Inspiration Board - Use this when the human needs generated visual references before asking the agent to write design.md.
https://inspiration-board.pages.dev/

CMUX CLI reference - Use this only for taught local workflow support such as opening a live Markdown design.md pane or inspecting panes while the dev server/browser run.
workshop-followup/toolkit/bundles/code-cmux-cli.md

Implementation notes

  • Require the agent to inspect current repo files before writing or applying design.md; implementation details come from live code, not memory.
  • Keep design.md human-readable and stable: it should describe visual decisions, not duplicate fast-changing component internals.
  • Browser verification is part of the web-dev workflow here. A compile pass is not enough when the task changes layout, interaction, focus, or responsive behavior.
  • When exploring visual direction, prefer temporary route/page variants inside the app and delete rejected variants after selection.
  • Use generated inspiration as direction-setting material. Do not ask the agent to scrape or clone live websites.
  • Keep the first implementation slice small: one route, one page, or one component cluster. A design standard gets trusted by proving it on a narrow diff first.
Lesson 04 - Source of Truth Over Stale Memory

Throwaway route variants

Toolkit artifact

Web developers often need to compare UI directions in the real app, against real routing, styling, auth, responsive behavior, and browser state. Throwaway route variants give an AI agent a concrete way to generate options quickly, prove them in localhost, promote one winner, and remove the experimental surface before it becomes maintenance debt.

01

Landing page direction sprint

As a frontend product engineer working in a Next.js-style marketing app

The team has a design.md for a new homepage, but the visual direction is still unsettled before a stakeholder review. The app already has shared typography, buttons, cards, and layout primitives, and the developer wants to compare several real routes in localhost instead of asking for more static mockups or setting up Storybook.

As a frontend product engineer, I want an agent to create disposable homepage route variants from the current code and design.md, so that I can review real browser-rendered options, pick one, and ship without keeping a permanent variant catalog.
Agent brief

Use the throwaway-route-variants skill only. First inspect the current routing structure, package.json scripts, existing homepage route, shared components, and design.md if present. Do not trust stale memory about the app structure. Create 5 temporary route-level variants of the homepage inside the app using the existing routing convention, for example /variant-a through /variant-e or the equivalent framework-native paths. Reuse existing components where practical, but keep variant-specific layout/styling isolated to the temporary routes so cleanup is simple. Add simple left/right navigation between variants, including visible Previous/Next controls and keyboard ArrowLeft/ArrowRight handling if that fits the app. Run the repo's actual dev command from package.json, usually npm run dev, pnpm dev, or bun dev. Prove every variant renders in the browser at localhost. After the human chooses a winner, promote only that implementation to the real homepage route, delete every losing variant route and temporary navigation wrapper, then run the relevant lint/typecheck/build command from package.json. Expected output: a short comparison table of the 5 variants, the chosen route, browser proof notes, commands run, and a final git diff summary showing only the promoted implementation remains.

Browser move

Use the running local dev server as the proof surface. In a CMUX browser pane or normal local browser, open the first temporary route, then use the variant navigation controls or ArrowLeft/ArrowRight to cycle through all variants. If using an agent-browser-style workflow, collect an accessibility snapshot for each route, evaluate window.location.pathname after navigation, and capture screenshots or concise visual notes for the human review.

Implementation refs
  • workshop-followup/toolkit/codex-skills/throwaway-route-variants/SKILL.md
  • workshop-followup/oracle/04-references.json
  • workshop-followup/oracle/04-validation.json
  • workshop-followup/oracle/04-interactive.json
  • package.json scripts in the target repo
  • design.md in the target repo if present
  • app/page.tsx, pages/index.tsx, src/routes, or the repo's current homepage route discovered by inspection
Done when
  • All temporary routes render successfully in localhost using the repo's real dev command.
  • Browser proof exists for every variant: screenshots, accessibility snapshots, or explicit route-by-route notes with no console-blocking runtime errors.
  • The selected variant is promoted to the real homepage route.
  • All losing route variants and temporary navigation scaffolding are deleted.
  • The final diff does not leave behind a Storybook-like catalog, hidden demo routes, or unused variant components.
Failure modes
  • Mistake: the agent creates polished variants but leaves all 5 routes in the app. Mitigation: require a final cleanup pass and verify with git diff plus route search for variant-a, variant-b, variant-c, variant-d, and variant-e.
  • Mistake: the agent rewrites shared components so every variant leaks into production styling. Mitigation: keep variant-specific changes route-local until the winner is chosen, then promote deliberately.
02

Authenticated onboarding alternatives

As a full-stack web developer responsible for signup and first-run onboarding

A SaaS app has a working auth flow, but the post-signup onboarding page feels confusing. The team wants to compare several onboarding layouts in the real authenticated shell without risking the existing auth guard, database writes, or production onboarding state machine.

As a full-stack web developer, I want an agent to create disposable authenticated onboarding route variants that reuse the current app shell and safe fixture data, so that I can compare real UX options without breaking auth or business logic.
Agent brief

Use the throwaway-route-variants skill for UI exploration only. Inspect the current auth guard, onboarding route, layout shell, data-loading pattern, and package.json scripts before editing. Do not change authentication logic, database schema, real onboarding mutations, or production redirects. Create 3 to 5 temporary route-level onboarding variants under a clearly disposable dev-only path, such as /onboarding/variant-a through /onboarding/variant-e or the framework's equivalent. Each variant should reuse the current authenticated shell and existing components, but it may use safe fixture data or read-only mocks when live data would create side effects. Add simple Previous/Next navigation between variants. Run the dev server using the actual package.json script and verify the routes in a browser. If auth is required, document the exact login path or test-user setup needed; do not assume cookies are shared with a browser agent. After the human chooses a winner, promote only that UI to the real onboarding route and delete every temporary variant path. Expected output: route list, auth assumptions, browser verification notes, commands run, cleanup confirmation, and final diff summary.

Browser move

Use a browser proof that respects auth reality. In a CMUX browser pane or agent-browser-style session, start from the documented login path, click/fill the test login only if the repo provides safe credentials or a local fixture path, then visit each onboarding variant. Capture an accessibility snapshot showing the page heading, primary CTA, progress indicator, and navigation controls. If login is not available, verify the variant routes with a safe mocked/read-only local route and state that authenticated production behavior was not mutated.

Implementation refs
  • workshop-followup/toolkit/codex-skills/throwaway-route-variants/SKILL.md
  • workshop-followup/oracle/04-references.json
  • workshop-followup/oracle/04-validation.json
  • Current auth guard and onboarding route discovered from source inspection
  • package.json scripts in the target repo
  • Existing app shell/layout components
  • Existing onboarding components, forms, and safe fixture or mock-data patterns
Done when
  • Temporary onboarding variants are reachable in the local app without changing auth guards or production redirects.
  • Each variant has browser proof showing the authenticated shell or clearly documented safe mock shell.
  • No database schema, mutation path, or real onboarding state transition is changed for the experiment.
  • The selected UI is promoted to the real onboarding route after human selection.
  • All temporary onboarding variant routes and navigation wrappers are removed.
Failure modes
  • Mistake: the agent makes the variant work by weakening the auth guard. Mitigation: explicitly forbid auth changes and require the final diff to show no auth-guard or session-validation edits unless the human requested them separately.
  • Mistake: the agent assumes the browser agent has the developer's cookies. Mitigation: require an explicit login path, safe test account instructions, or a read-only mock route for browser proof.
03

Responsive dashboard density bake-off

As a senior frontend engineer cleaning up a dense admin dashboard

A dashboard route has grown cluttered after several feature additions. Product wants to compare a compact data-heavy layout, a guided summary layout, and a card-based layout before committing to a redesign. The current code has real components and data boundaries that should be treated as source of truth.

As a senior frontend engineer, I want an agent to generate disposable dashboard route variants using the current code as source of truth, so that I can compare density, hierarchy, and responsive behavior in-browser before replacing the real dashboard route.
Agent brief

Use the throwaway-route-variants skill. Inspect the current dashboard route, layout components, data-fetching boundaries, responsive utilities, and package.json scripts. Create exactly 3 temporary dashboard route variants: one compact/data-dense, one guided/summary-first, and one card-based. Keep data access read-only and preserve existing component contracts. Add simple left/right navigation among the 3 variants. Run the local dev server with the actual package.json command, then verify desktop and narrow viewport behavior in a browser. Do not create Storybook stories, a permanent examples directory, or a dashboard variant gallery. After human selection, promote the winner to the real dashboard route and delete the 3 temporary routes. Expected output: variant intent summary, browser proof at desktop and mobile/narrow viewport, commands run, selected winner, and cleanup proof from git diff or route search.

Browser move

Use browser verification because layout density and responsiveness cannot be proven from source alone. In a CMUX browser pane, local browser, Playwright-backed check, or agent-browser-style workflow, open each temporary dashboard route at a desktop width and a narrow mobile-like width. Use evaluate to record viewport width and current pathname, take screenshots or visual notes, and verify that Previous/Next navigation reaches all three variants without a runtime error.

Implementation refs
  • workshop-followup/toolkit/codex-skills/throwaway-route-variants/SKILL.md
  • workshop-followup/oracle/04-references.json
  • workshop-followup/oracle/04-interactive.json
  • Current dashboard route discovered from source inspection
  • Current shared dashboard/card/table/chart components
  • package.json scripts in the target repo
Done when
  • Three route-level dashboard variants exist temporarily and match the requested density directions.
  • Desktop and narrow-viewport browser proof is captured for all variants.
  • Existing data contracts remain read-only and stable during the experiment.
  • Only the selected dashboard implementation remains after cleanup.
  • A final route search or git diff proves temporary variants were removed.
Failure modes
  • Mistake: the agent judges responsiveness from code without opening the browser. Mitigation: require viewport-based browser proof for every variant before selection.
  • Mistake: the agent turns route variants into a permanent demo gallery. Mitigation: define the routes as disposable from the start and require post-selection deletion as part of done.
Agent references

Throwaway route variants skill - Primary workflow contract for this topic: generate route-level variants in the running app, add simple navigation, let the human pick, then delete losers.
workshop-followup/toolkit/codex-skills/throwaway-route-variants/SKILL.md

Lesson 04 validated references - Grounding source for the taught concepts: code as source of truth, design.md from generated inspiration, and disposable route variants.
workshop-followup/oracle/04-references.json

Lesson 04 validation guardrails - Use this to avoid reintroducing tools and claims that were removed as only mentioned or not taught in this section.
workshop-followup/oracle/04-validation.json

Lesson 04 interactive checklist - Useful for the implementation shape: prompt the agent to inspect current files, generate route-level variants, review in browser, promote one, and delete temporary routes.
workshop-followup/oracle/04-interactive.json

Local dev command - The agent should inspect the target repo and run the real dev, lint, typecheck, and build scripts rather than guessing commands.
package.json

CMUX state inspection - When working in CMUX, the agent can inspect the current pane/workspace before opening or using browser/terminal panes for proof.
cmux --json tree

CMUX pane list - Use when coordinating a terminal pane running the dev server and a browser pane used for route review.
cmux list-panes

Implementation notes

  • Start by inspecting the live repo. Routing conventions, package scripts, app shell, auth boundaries, and component names must come from current code, not memory.
  • Make the temporary routes obvious and searchable. Names like variant-a, variant-b, and variant-c are boring on purpose because cleanup is easier.
  • Keep variant-specific implementation route-local until the winner is selected. Promote only the winning idea into the real route after review.
  • Browser proof matters for this technique. A successful typecheck is not enough when the decision is visual hierarchy, flow, density, responsive behavior, or auth-shell fit.
  • The final cleanup is not optional. The technique only works if the losing variants are deleted instead of becoming a second design system.
  • When an authenticated browser proof is needed, spell out the login/test-user path or use safe local fixture data. Do not assume a browser agent has access to the developer's existing cookies.
Lesson 05 - The 5 Monkeys & Perspective Prompting

5 Monkeys QA swarm

Toolkit artifact

Web developers already trust tests for known requirements, but fast-moving UI work also needs cheap exploratory passes that click through real app state. This technique gives local agents a practical browser-facing QA job: find weird UX and state bugs before users do.

01

Authenticated dashboard weird-path pass

As a frontend engineer shipping an authenticated Next.js dashboard

A Next.js app has a local dev server, a login flow, route-level navigation, filter panels, modal dialogs, and a dashboard state that often changes during product iteration.

As a frontend engineer, I want several perspective-driven agents to explore non-happy-path dashboard behavior, so that I can catch stuck state, broken back navigation, and confusing UI before merging.
Agent brief

Run a 5 Monkeys QA swarm against the authenticated dashboard. First inspect the repo for the dev command and any documented test user or seeded auth flow. Start the app locally using the repo’s normal command, such as npm run dev, pnpm dev, or the command documented in package.json/README. Create .notes/qa-dashboard-swarm.md as a discardable numbered QA file. Assign five lenses: expert power user, brand-new first-time user, grandparent/inexperienced user, impatient power user, and distracted returning user. For each lens, use a browser workflow against localhost and deliberately avoid only confirming the happy path. Try: login, navigate to the dashboard, open filters, toggle modes, click the same controls repeatedly, use browser back/forward mid-flow, abandon a modal halfway through, refresh after changing state, and return to the page. Record each finding with: perspective, exact path, expected behavior, observed behavior, severity, and whether it looks like stable business logic or fast-changing UI. Expected output: .notes/qa-dashboard-swarm.md plus a short final summary of the 3-7 useful findings only.

Browser move

Use an agent-browser-style pass or CMUX browser pane against the local URL. Take an accessibility snapshot after login, click through navigation and controls, fill the login form using explicit test credentials or signup instructions from the repo, use back/forward after reaching nested dashboard states, then evaluate whether visible state, URL, and focused controls still agree.

Implementation refs
  • workshop-followup/toolkit/prompts/05-five-monkeys-qa-swarm.md
  • workshop-followup/toolkit/codex-skills/five-monkeys-qa/SKILL.md
  • workshop-followup/oracle/05-references.json
  • package.json
  • README.md
  • cmux --json tree
  • cmux read-screen
  • cmux markdown open .notes/qa-dashboard-swarm.md --direction right
Done when
  • A discardable .notes/qa-dashboard-swarm.md file exists with numbered findings grouped by perspective.
  • The browser proof includes local URL, login path used, at least one snapshot or screenshot reference, and notes on back/forward, repeated-click, and toggle behavior.
  • The final triage separates useful findings from noise and marks which issues deserve durable tests because they touch stable business logic.
Failure modes
  • The agent only walks the happy path; mitigate by requiring each perspective to perform at least three weird-path moves before writing conclusions.
  • The agent assumes existing browser cookies; mitigate by scripting login/signup explicitly and recording the auth path used.
02

Checkout flow state-bug hunt before release

As a full-stack web developer preparing a checkout or billing flow for release

A local ecommerce or SaaS checkout flow has plan selection, quantity changes, discount code entry, payment-step navigation, validation messages, and browser history interactions.

As a full-stack developer, I want agents to attack the checkout flow from different user perspectives, so that I can find state mismatches and UX dead ends without overbuilding brittle tests for UI that may still change.
Agent brief

Run the 5 Monkeys technique on the checkout/billing flow. Treat stable pricing, totals, validation rules, and submitted payloads as serious business logic; treat copy, spacing, and visual layout notes as flexible UI findings unless they block the user. Create .notes/qa-checkout-five-monkeys.md. Start the local app and document the exact command. Assign lenses: expert buyer, brand-new customer, grandparent/inexperienced user, impatient power user, and skeptical returning customer. In the browser, test non-obvious paths: change plan then press back, enter then remove a coupon, rapidly change quantity, start checkout then abandon before payment, refresh on a later step, use keyboard navigation where controls are visible, and retry after validation errors. Capture console errors, visible state, URL, and any mismatch between displayed total and expected total. Expected output: a discardable QA file, a concise release-risk summary, and recommendations for which findings should become durable automated tests.

Browser move

Use Playwright, a CMUX browser pane, or an agent-browser-style snapshot/click/fill/evaluate loop. Fill checkout inputs, click plan and coupon controls, use browser back/forward between checkout steps, refresh after a partial flow, and evaluate the DOM text for totals, validation messages, and disabled/enabled button state.

Implementation refs
  • workshop-followup/toolkit/prompts/05-five-monkeys-qa-swarm.md
  • workshop-followup/toolkit/codex-skills/five-monkeys-qa/SKILL.md
  • workshop-followup/oracle/05-validation.json
  • workshop-followup/chunks/05/gemini-analysis.v2.md
  • package.json
  • playwright.config.* if present
  • .notes/qa-checkout-five-monkeys.md
Done when
  • A .notes/qa-checkout-five-monkeys.md file lists exact reproduction paths for each useful checkout finding.
  • Browser evidence includes the local URL, console-log notes, visible totals or validation text, and the action sequence that triggered each issue.
  • The final review identifies which findings should become durable tests because they protect stable business logic.
Failure modes
  • The agent treats all UI observations as test-suite requirements; mitigate by explicitly classifying stable business logic separately from fast-changing UI.
  • The agent reports vague concerns without reproduction steps; mitigate by requiring every useful finding to include exact clicks, form inputs, URL/history moves, and observed result.
03

Stress-test a pricing-plan upgrade flow

As a growth engineer owning a SaaS pricing page and upgrade checkout

A pricing page has monthly/yearly toggles, plan comparison cards, a coupon field, and an upgrade checkout modal. Happy-path tests pass, but users keep finding weird state bugs after toggling billing periods, backing out of checkout, and switching plans.

As a trial user comparing plans, I want billing toggles, coupon validation, selected plan state, and checkout recovery to stay consistent, so that I can upgrade without losing trust in the pricing flow.
Agent brief

Run a 5 Monkeys QA pass against the local pricing and upgrade flow. Assign five perspectives: brand-new trial user, impatient mobile user, keyboard-only user, existing customer changing plans, and skeptical QA engineer. Each perspective should explore weird paths: toggle monthly/yearly repeatedly, apply invalid and valid coupons, switch plans after opening checkout, use back/forward, refresh mid-flow, and try keyboard navigation. Each worker writes one discardable QA file under `.notes/qa/pricing-<perspective>.md` with steps, expected result, observed result, screenshot or browser transcript path, severity, and whether the issue is reproducible. Collate findings into `.notes/qa/pricing-summary.md` and recommend only the highest-value fixes.

Browser move

Use a CMUX browser pane, Playwright if already present, or an agent-browser-style snapshot/click/fill/evaluate loop. Open the local pricing route, snapshot plan names and billing toggle state, click monthly/yearly, open checkout for each plan, fill coupon codes, use back/forward and refresh, and capture visible selected-plan text plus console errors after each odd path.

Implementation refs
  • workshop-followup/toolkit/prompts/05-five-monkeys-qa-swarm.md
  • workshop-followup/toolkit/codex-skills/five-monkeys-qa/SKILL.md
  • package.json scripts
  • local pricing route and checkout/modal source files
  • cmux browser snapshot/click/fill/evaluate commands or existing Playwright tests
Done when
  • Five perspective-specific QA markdown files exist under `.notes/qa/`.
  • The summary separates reproducible product bugs from preference notes and one-off observations.
  • At least one browser receipt exists per high-severity finding: screenshot, accessibility snapshot, console output, or Playwright trace.
  • Recommended fixes are ranked by user impact and implementation risk, not by which worker found them first.
Failure modes
  • The workers all follow the same happy path; mitigate by assigning explicit perspectives and weird-path instructions before they start.
  • The agent turns observations into permanent tests too early; mitigate by keeping findings discardable until the team decides which behavior should become stable.
  • The QA pass uses real payment/customer data; mitigate with local fixtures, test cards, or mocked checkout state only.
Agent references

Existing 5 Monkeys prompt - Use it as the canonical copy-paste behavior contract for perspective lenses, weird-path exploration, discardable QA files, and browser-auth guardrails.
workshop-followup/toolkit/prompts/05-five-monkeys-qa-swarm.md

Existing five-monkeys Codex skill - Use it to keep the agent behavior aligned with the toolkit technique instead of inventing a new QA framework.
workshop-followup/toolkit/codex-skills/five-monkeys-qa/SKILL.md

Validated lesson 05 references - Treat this as ground truth for taught concepts: perspective prompting, discardable QA files, 5 Monkeys exploration, CMUX state, browser-driving tools, and video-driven intent extraction.
workshop-followup/oracle/05-references.json

Lesson 05 validation notes - Use it to avoid reintroducing details removed during validation, especially anything only mentioned in chat or not taught as a workflow.
workshop-followup/oracle/05-validation.json

CMUX state and pane commands - Use these commands when the agent needs to inspect the active workspace, read terminal output, or open a live Markdown QA tracker.
cmux --json tree; cmux read-screen; cmux markdown open .notes/qa-dashboard-swarm.md --direction right

Implementation notes

  • Keep the QA files explicitly discardable under .notes/ so agents can regenerate them as the UI changes.
  • For browser verification, tell the agent the login or signup path; do not assume cookies, private session state, or authenticated browser context.
  • The useful receipt is not a large pile of findings. It is a short triage of reproducible weird-path issues, with noisy observations removed.
  • Keep rigorous automated tests for stable business logic such as checkout totals, permissions, data persistence, and validation rules.
  • Use CMUX browser panes, agent-browser-style snapshots/click/fill/evaluate, or Playwright when live UI behavior matters more than source inspection.
  • Grounded in the uploaded scenario bundle. :contentReference[oaicite:0]{index=0}
Lesson 06 - Design With Images & Video

Image-to-design handoff

Toolkit artifact

Web developers already live in running apps, local dev servers, screenshots, and browser proof. This pack turns the taught image-to-design handoff into concrete flows where a selected reference image guides the code diff and the browser proves the result.

01

Dashboard card from selected reference

As a frontend engineer maintaining a Next.js product dashboard

A billing or usage card in a real dashboard route works correctly but looks generic and has drifted from the product direction. You have a screenshot of the current component and either a selected reference image already chosen by the human or permission to generate a sheet of variations and pause for selection.

As a frontend engineer, I want Codex to implement one selected visual direction for the dashboard card while preserving existing data and behavior, so that the UI improves without turning a working component into a risky rewrite.
Agent brief

Copy-paste to Codex: Use the image-to-design handoff workflow for the dashboard card at <ROUTE>, backed by these artifacts: current screenshot <CURRENT_SCREENSHOT_PATH>, selected reference image <SELECTED_REFERENCE_IMAGE_PATH>, and relevant source files <COMPONENT_OR_ROUTE_PATHS>. 1. Inspect package.json and the current component/source before editing. Treat the live code as source of truth for props, data loading, auth, events, and tests. 2. If <SELECTED_REFERENCE_IMAGE_PATH> does not exist yet, ask an image model for one sheet of 5 visual variations of this exact card based on <CURRENT_SCREENSHOT_PATH>, then stop and ask me which direction to use. Do not implement until I pick one. 3. Once the selected reference exists, implement ONLY that direction. Match layout, spacing, color, type, density, border/radius, and visual hierarchy. Do not blend in other variations. Preserve existing copy, links, data-testid values, analytics attributes, and behavior unless a markup change is necessary for accessibility. 4. Run `npm run lint`. Run the nearest existing test command from package.json if one is already present; do not invent a test framework. 5. Run `npm run dev` and prove the result in a browser at `http://127.0.0.1:3000<ROUTE>` or the port printed by the dev server. Expected output: a focused diff, the commands run, a browser screenshot saved as `artifacts/image-to-design-dashboard-card.png`, and a short review note listing inspected files plus any intentional differences from the reference.

Browser move

Use a CMUX browser pane or an agent-browser-style flow: open `http://127.0.0.1:3000<ROUTE>`, take an accessibility snapshot to confirm the card title, values, and CTA names are still present, then use evaluate to capture bounding boxes and computed styles for the card root, heading, primary value, and CTA. Save a screenshot after the DOM proof; a screenshot alone is not enough.

Implementation refs
  • workshop-followup/toolkit/prompts/06-image-to-design-handoff-prompt.md
  • workshop-followup/toolkit/bundles/image-to-design-handoff-prompt.bundle.txt
  • workshop-followup/oracle/06-references.json
  • package.json
  • <COMPONENT_OR_ROUTE_PATHS>
  • <CURRENT_SCREENSHOT_PATH>
  • <SELECTED_REFERENCE_IMAGE_PATH>
Done when
  • Git diff is limited to the target component/route and supporting style files.
  • `npm run lint` passes, and any pre-existing nearest test command either passes or the failure is documented with exact output.
  • `artifacts/image-to-design-dashboard-card.png` shows the local route matching the selected reference direction closely enough for human review.
  • Browser proof includes accessibility snapshot evidence that labels, values, links, and CTA names were not lost.
  • Review note explicitly says which reference image was used and confirms no other variation was blended in.
Failure modes
  • The agent averages several generated directions into a mushy UI; mitigate by removing non-selected references from context and repeating: implement ONLY <SELECTED_REFERENCE_IMAGE_PATH>.
  • The agent rewrites data loading or auth while chasing the visual target; mitigate by requiring source inspection first and rejecting diffs outside the target UI/styling surface.
  • The agent claims visual success from code only; mitigate by requiring browser screenshot plus accessibility snapshot/evaluate evidence from the running dev server.
02

Auth-gated onboarding visual refresh

As a full-stack product engineer responsible for a SaaS onboarding flow

The onboarding flow behind login is functional, but the organization setup step needs to match a selected generated reference image. The proof has to run through the real browser flow because the page depends on auth state, form validation, redirects, and route guards.

As a product engineer, I want the agent to apply the selected image direction to the onboarding step and verify it through login and form interaction, so that the visual refresh does not break the first-run user path.
Agent brief

Copy-paste to Codex: Apply image-to-design handoff to the onboarding screen at <ONBOARDING_ROUTE>. Artifacts: selected reference image <SELECTED_REFERENCE_IMAGE_PATH>, current screenshot <CURRENT_SCREENSHOT_PATH>, explicit login or seed instructions <AUTH_INSTRUCTIONS>, and source hints <AUTH_FILES_AND_ONBOARDING_FILES>. 1. Inspect the route, auth guard, form component, validation messages, and existing tests before editing. Do not assume browser cookies are shared with the automation environment. 2. Implement ONLY the visual direction in <SELECTED_REFERENCE_IMAGE_PATH>. Preserve the onboarding state machine, validation behavior, submit handlers, redirects, copy, and analytics/data attributes. 3. If a selected reference is missing, generate one sheet of 5 visual variations for this onboarding step, then stop for human selection before editing code. 4. Run `npm run lint`. If package.json already defines a relevant test, run that exact script. Do not add a new browser framework just to prove this change. 5. Run `npm run dev`. Verify in a browser from login through <ONBOARDING_ROUTE> using <AUTH_INSTRUCTIONS>. Expected output: focused UI diff, command logs, `artifacts/onboarding-before-or-current.png` if available, `artifacts/onboarding-after-selected-reference.png`, and a short proof note covering login, validation, successful submit or expected blocked submit, and any console/dev-server errors.

Browser move

Use agent-browser-style actions: snapshot the login screen, fill the explicit test email/password or complete the provided seed-login path, click through to <ONBOARDING_ROUTE>, snapshot the form, fill a realistic organization name, click Continue, and evaluate that validation text, focus state, and redirect/blocked-submit behavior still match the original flow. This is the right proof because auth, route guards, and form behavior cannot be trusted from a static screenshot.

Implementation refs
  • workshop-followup/toolkit/prompts/06-image-to-design-handoff-prompt.md
  • workshop-followup/oracle/06-references.json
  • workshop-followup/oracle/06-validation.json
  • package.json
  • <AUTH_FILES_AND_ONBOARDING_FILES>
  • <AUTH_INSTRUCTIONS>
  • <CURRENT_SCREENSHOT_PATH>
  • <SELECTED_REFERENCE_IMAGE_PATH>
Done when
  • Browser path starts from an unauthenticated state and reaches the onboarding screen through the documented login or seed flow.
  • The refreshed onboarding screen visually follows the selected reference image while preserving form labels, validation, focus states, and submit behavior.
  • `npm run lint` passes and relevant existing tests pass or exact failure output is included.
  • Screenshot and browser snapshot artifacts are saved under `artifacts/` with names tied to the route.
  • The final note calls out that cookies were not assumed and names the auth path used.
Failure modes
  • The browser agent gets stuck because it expects the human's normal cookies; mitigate by giving explicit auth/seed instructions and starting proof from a clean browser state.
  • The visual refresh hides validation or keyboard focus states; mitigate by tabbing through the form and evaluating visible focus/validation output before declaring done.
  • The agent changes redirects or submit logic while editing the screen; mitigate by diff-reviewing handlers and requiring the browser path to prove submit or expected blocked submit.
03

Responsive marketing hero handoff

As a growth-focused web developer shipping a marketing route

A landing page hero needs to move from vague direction like 'more premium' to a concrete selected image. The desktop composition is the design target, but the implementation also needs mobile proof because the route is public and responsive.

As a growth developer, I want the agent to translate one selected hero reference into the actual landing page while preserving CTA behavior and responsive layout, so that the shipped page has a clear visual direction and does not break conversion paths.
Agent brief

Copy-paste to Codex: Implement the selected image-to-design direction for the marketing hero at <MARKETING_ROUTE>. Artifacts: selected reference image <SELECTED_REFERENCE_IMAGE_PATH>, current route screenshot <CURRENT_SCREENSHOT_PATH>, and relevant source files <HERO_ROUTE_AND_COMPONENT_FILES>. 1. Inspect package.json, the hero route/component files, and any shared button/link components before editing. Preserve copy, hrefs, tracking attributes, data-testid values, and form/newsletter behavior. 2. If there is no selected reference image, ask an image model for a sheet of 5 hero variations from the current screenshot and product constraints, then stop for human selection. 3. Implement only the selected direction. Match composition, rhythm, spacing, type scale, contrast, image/illustration treatment, and CTA prominence. Do not introduce a maintained design system or broad route-variant workflow for this task. 4. Run `npm run lint` and any existing route/component test script listed in package.json. 5. Run `npm run dev` and prove desktop and mobile in the browser. Expected output: a focused diff, desktop screenshot `artifacts/hero-desktop-selected-reference.png`, mobile screenshot `artifacts/hero-mobile-selected-reference.png`, command output, and a concise implementation note describing how the selected image was translated into code.

Browser move

Use a local browser, CMUX browser pane, or Playwright if already present. Open `http://127.0.0.1:3000<MARKETING_ROUTE>` at 1440x900 and 390x844. Take screenshots at both sizes. Use evaluate to assert `document.documentElement.scrollWidth <= document.documentElement.clientWidth`, the primary CTA is visible without horizontal scroll, and the CTA href/data attributes are unchanged.

Implementation refs
  • workshop-followup/toolkit/prompts/06-image-to-design-handoff-prompt.md
  • workshop-followup/toolkit/catalog.def.mjs
  • workshop-followup/oracle/06-references.json
  • package.json
  • <HERO_ROUTE_AND_COMPONENT_FILES>
  • <CURRENT_SCREENSHOT_PATH>
  • <SELECTED_REFERENCE_IMAGE_PATH>
Done when
  • Desktop and mobile screenshots are both saved and reviewable.
  • No horizontal overflow at the mobile viewport.
  • Primary CTA remains visible, clickable, and wired to the original href or handler.
  • `npm run lint` and any relevant existing test script pass or include exact failure logs.
  • Final review note maps the selected image's visible traits to concrete implementation choices: spacing, type, color, layout, and CTA treatment.
Failure modes
  • The agent overfits the desktop reference and breaks mobile; mitigate by making mobile viewport proof part of done, not a later nice-to-have.
  • The agent replaces tracked links/buttons while restyling; mitigate by preserving hrefs, handlers, data-testid values, and analytics attributes and checking them with browser evaluate.
  • The agent turns a one-screen handoff into a broad design-system rewrite; mitigate by limiting scope to <HERO_ROUTE_AND_COMPONENT_FILES> and the selected image direction.
Agent references

Image-to-design handoff prompt - Use as the base copy-paste workflow: generate a sheet, pause for human selection, then implement only the chosen reference image.
workshop-followup/toolkit/prompts/06-image-to-design-handoff-prompt.md

Lesson 06 validated references - Grounds the scenarios in the taught concepts: CMUX orchestration, model choice by task shape, generated images as design references, video context, and MCP-vs-CLI boundaries.
workshop-followup/oracle/06-references.json

Lesson 06 validation notes - Use this to avoid reintroducing items that were only mentioned or softened, especially around Impeccable, Google Stitch, Executor, Clicklight, and pi.dev.
workshop-followup/oracle/06-validation.json

Packed source bundle - Use as the local bundle snapshot when an agent needs the prompt, references, catalog context, and Gemini analysis in one place.
/mnt/data/scenario-image-design.txt

CMUX browser or pane state - Use when the agent needs to identify the current pane or coordinate browser proof from a CMUX workspace.
cmux --json tree

Browser verification primitive shape - Use for web proof after implementation: snapshot for accessible structure, click/fill for flows, evaluate for DOM/style assertions, and screenshot for human visual review.
agent-browser-style workflow: snapshot -> click/fill -> evaluate -> screenshot

Implementation notes

  • Keep the technique narrow: image sheet, human selection, implement only the selected reference, browser proof.
  • For web apps, the selected image is not enough. The agent must inspect current source first so it preserves data loading, auth, handlers, copy, tests, and route behavior.
  • Use Codex or a persistent coding agent for the implementation pass. Fast Flash-style models are a poor fit for a multi-step UI implementation that may require self-correction.
  • Use deterministic scripts and browser checks for proof. Do not wrap a direct CLI or test command in MCP just because MCP is available.
  • Do not claim pixel perfection. Ask for concrete evidence: screenshots, accessibility snapshots, computed style or bounding-box checks, command logs, and a short diff review note.
  • When auth is involved, do not assume cookies are available. Provide explicit login, seed, or test-account instructions and prove the flow from a clean browser state.
  • Generated reference images can express taste, but the human selection is the decision point. The agent should stop before implementation when no selected reference exists.

Part 3 - Afternoon

Lesson 07 - Hooks, Skills & the Ralph Loop

Ralph Loop stop-hook continuation

Toolkit artifact

Web developers already run agents against real repos, dev servers, browser panes, and local QA loops; the Ralph Loop helps those agents keep working toward a bounded goal instead of stopping after one polite assistant turn. It is especially useful for web tasks where success is observable only after code, tests, logs, and browser behavior line up.

01

Keep fixing a checkout bug until browser proof passes

As a senior frontend engineer responsible for a Next.js checkout flow

A local Next.js app has a flaky checkout path: the agent often fixes one validation error, reports done, and stops before proving the full browser flow from cart to confirmation. The team wants a Ralph Loop plugin that keeps the agent on the original checkout goal until tests and browser evidence prove the flow works.

As a senior frontend engineer, I want a Stop hook to re-point the agent at the original checkout goal until the checkout flow passes, so that the agent does not stop after only a partial code fix.
Agent brief

Build a minimal Ralph Loop Codex plugin for this repo. Inspect workshop-followup/toolkit/prompts/07-ralph-loop-stop-hook-prompt.md and workshop-followup/toolkit/codex-skills/ralph-loop-hook/SKILL.md first. Enable hooks and plugins in ~/.codex/config.toml with hooks=true, plugins=true, plugin_hooks=true, then tell me to restart Codex before testing. Scaffold ralph-loop-checkout/.codex-plugin/plugin.json, ralph-loop-checkout/hooks/hooks.json, and ralph-loop-checkout/hooks/hook.py. Wire only the Stop hook to python3 "${CODEX_PLUGIN_ROOT}/hooks/hook.py" with timeout 5. The hook must read stdin JSON and write exactly one control JSON line. Store tiny local state for the original goal and iteration count. Completion condition: stop when all of these receipts exist in .notes/ralph-loop-checkout/: passing-test.log, browser-proof.md, and final-summary.md, or when iteration_count reaches 5. Do not use suppressOutput in Stop. After scaffolding, run a local simulation command that pipes mock Stop events into hook.py and proves both paths: continue before receipts exist, allow stop after receipts exist or at the cap. Expected output: plugin files, a short README section showing install/test steps, and simulation logs saved under .notes/ralph-loop-checkout/.

Browser move

Use the app's local dev server and a browser agent or Playwright-style flow to visit the checkout route, add a test item, fill the checkout form, submit, and capture the confirmation page state. Save the route, actions, visible confirmation text, and any console errors to .notes/ralph-loop-checkout/browser-proof.md. If running inside CMUX, open or use a browser pane for the local app and keep the terminal pane visible for test output.

Implementation refs
  • workshop-followup/toolkit/prompts/07-ralph-loop-stop-hook-prompt.md
  • workshop-followup/toolkit/codex-skills/ralph-loop-hook/SKILL.md
  • workshop-followup/toolkit/bundles/code-better-plugins.md
  • ~/.codex/config.toml
  • ralph-loop-checkout/.codex-plugin/plugin.json
  • ralph-loop-checkout/hooks/hooks.json
  • ralph-loop-checkout/hooks/hook.py
Done when
  • A valid Codex plugin layout exists with .codex-plugin/plugin.json, hooks/hooks.json, and hooks/hook.py.
  • hooks/hooks.json wires a Stop hook command and does not add unrelated lifecycle hooks.
  • hook.py emits exactly one control JSON line per event and never emits suppressOutput for Stop.
  • A hard completion condition is enforced by receipt files plus an iteration cap of 5.
  • Simulation logs show the hook continues before checkout proof exists and allows stopping after proof or cap.
  • .notes/ralph-loop-checkout/browser-proof.md records the browser flow, visible result, and console-error status.
Failure modes
  • Common mistake: the agent only writes a prompt that says 'keep going' instead of enforcing continuation in a Stop hook. Mitigation: require a real hooks/hooks.json Stop entry and a hook.py simulation log.
  • Common mistake: the loop is unbounded and can churn on tools forever. Mitigation: enforce the iteration cap in hook state and treat missing proof after the cap as a failed receipt, not silent success.
02

Stop-hook guard for visual regression cleanup

As a product-minded UI developer polishing a dashboard redesign

A dashboard redesign has several small layout regressions across authenticated pages. Agents keep fixing the first obvious CSS issue and stopping without checking the rest of the dashboard routes in the running app.

As a UI developer, I want a Ralph Loop to keep the agent cycling through the original dashboard acceptance criteria until browser receipts are written for each target route, so that layout cleanup is proven in the app instead of inferred from code.
Agent brief

Create a Ralph Loop plugin named ralph-loop-dashboard for dashboard visual cleanup. Start by reading workshop-followup/oracle/07-references.json, workshop-followup/toolkit/prompts/07-ralph-loop-stop-hook-prompt.md, and workshop-followup/toolkit/bundles/code-better-plugins.md. Use the taught lifecycle model: UserPromptSubmit, PreToolUse or Tool Use, and Stop; implement only the Stop hook for this scenario. Plugin layout must be ralph-loop-dashboard/.codex-plugin/plugin.json, ralph-loop-dashboard/hooks/hooks.json, and ralph-loop-dashboard/hooks/hook.py. Add a .notes/ralph-loop-dashboard/acceptance.md file that lists the exact dashboard routes to verify and a .notes/ralph-loop-dashboard/proofs/ directory for receipts. Completion condition: allow stop only when acceptance.md has every route marked verified with a matching proof markdown file, or when 4 Stop iterations have happened. Each continuation message should point back to the original dashboard goal and name the next missing route proof. Expected output: plugin scaffold, acceptance/proof file contract, hook simulation, and instructions for installing with codex plugin marketplace add "$PWD" then codex plugin add ralph-loop-dashboard@better-plugins only if this repo is being used as a marketplace-style plugin source.

Browser move

Use a CMUX browser pane, agent-browser-style accessibility snapshot/click/fill/evaluate workflow, or Playwright to visit each listed dashboard route on localhost. For each route, capture the route URL, viewport size, key visible labels, broken layout notes, and console errors. Write one proof file per route under .notes/ralph-loop-dashboard/proofs/ and mark the route verified in acceptance.md only after the browser check has actually run.

Implementation refs
  • workshop-followup/oracle/07-references.json
  • workshop-followup/toolkit/prompts/07-ralph-loop-stop-hook-prompt.md
  • workshop-followup/toolkit/bundles/code-better-plugins.md
  • .notes/ralph-loop-dashboard/acceptance.md
  • .notes/ralph-loop-dashboard/proofs/
  • codex plugin marketplace add "$PWD"
  • codex plugin add ralph-loop-dashboard@better-plugins
Done when
  • The Stop hook can identify the next missing dashboard route proof from acceptance.md.
  • The hook continues while route proofs are missing and allows stop only after all proofs exist or after 4 iterations.
  • Each route has a proof markdown file with URL, viewport, visible labels, and console-error status.
  • The plugin scaffold uses the real Codex plugin layout and keeps hook work fast and boring.
  • The final review artifact explains which routes were checked, which files changed, and whether any proof is still missing.
Failure modes
  • Common mistake: the agent marks routes verified from source inspection alone. Mitigation: require browser proof files and reject completion when a proof file is missing.
  • Common mistake: the hook tries to do browser automation itself and becomes slow or brittle. Mitigation: keep the hook limited to state checks and continuation control; put browser verification in the agent workflow and receipt files.
03

Prototype a safe Ralph Loop before installing it globally

As a developer-tools engineer maintaining local Codex plugin experiments

The team wants to experiment with stop-hook continuation, but the Better Plugins material is prototype-stage and personally configured. Before installing anything globally, they need a toy Ralph Loop with deterministic local simulation and clear guardrails.

As a developer-tools engineer, I want a toy Ralph Loop plugin with local simulations and an explicit completion cap, so that we can learn the hook mechanics without risking a runaway agent session in a production repo.
Agent brief

Build a toy plugin named ralph-loop-toy that demonstrates the Ralph Loop without touching application code. Use only the taught plugin/hook model from lesson 07 and the existing toolkit artifacts. Read workshop-followup/oracle/07-references.json, workshop-followup/oracle/07-validation.json, workshop-followup/toolkit/codex-skills/ralph-loop-hook/SKILL.md, and workshop-followup/toolkit/bundles/code-better-plugins.md. Scaffold ralph-loop-toy/.codex-plugin/plugin.json, ralph-loop-toy/hooks/hooks.json, ralph-loop-toy/hooks/hook.py, and ralph-loop-toy/tests/simulate-stop-events.py. The toy goal is: create .notes/ralph-loop-toy/done.txt containing DONE. The Stop hook should continue until done.txt exists or until iteration_count reaches 3. The simulator must run three cases: no done.txt means continue, done.txt means stop, cap reached means stop with a clear failure note. Expected output: the plugin files, simulator output, and a short README that tells students to review /plugins and /hooks after install and to restart Codex after config changes.

Browser move

Browser proof is not the right proof for this toy plugin because the goal is hook mechanics, not web-app behavior. The useful UI move is optional: if running in CMUX, open a live Markdown tracker with cmux markdown open .notes/ralph-loop-toy/tracker.md --direction right so the human can watch the simulation cases and receipts update.

Implementation refs
  • workshop-followup/oracle/07-references.json
  • workshop-followup/oracle/07-validation.json
  • workshop-followup/toolkit/codex-skills/ralph-loop-hook/SKILL.md
  • workshop-followup/toolkit/bundles/code-better-plugins.md
  • ralph-loop-toy/tests/simulate-stop-events.py
  • ~/.codex/config.toml
  • /plugins
  • /hooks
Done when
  • Toy plugin files exist in the real plugin directory layout.
  • hooks/hooks.json wires the Stop event to hook.py with a short timeout.
  • simulate-stop-events.py proves continue, success stop, and cap stop behavior without requiring a live Codex session.
  • The README includes the required ~/.codex/config.toml feature flags and restart note.
  • The implementation avoids production app edits, broad plugin behavior, and unbounded continuation.
Failure modes
  • Common mistake: installing a prototype globally before proving local behavior. Mitigation: require simulator output before any install instructions are followed.
  • Common mistake: mixing in #human, Toolsmith, Gemini Video, or dashboard behavior. Mitigation: keep this scenario to exactly one technique: Stop-hook continuation.
Agent references

Validated lesson 07 references - Use this as the source of truth for what was taught: compaction/forking, the big three hooks, Ralph Loop, skills versus plugins, and custom grammar triggers.
workshop-followup/oracle/07-references.json

Lesson 07 validation notes - Use this to avoid overclaiming or reintroducing details that were removed from the validated lesson scope.
workshop-followup/oracle/07-validation.json

Existing Ralph Loop prompt - Use this as the implementation-facing prompt baseline for scaffolding a minimal Stop-hook continuation plugin.
workshop-followup/toolkit/prompts/07-ralph-loop-stop-hook-prompt.md

Existing Ralph Loop Codex skill - Use this as the concise reusable skill contract: trigger conditions, prerequisite config, plugin layout, Stop hook wiring, and guardrails.
workshop-followup/toolkit/codex-skills/ralph-loop-hook/SKILL.md

Verified Better Plugins command and layout notes - Use this for the real plugin directory layout, hooks.json shape, control JSON rules, and install commands; do not invent plugin flags.
workshop-followup/toolkit/bundles/code-better-plugins.md

CMUX CLI notes - Use only for optional local workspace proof such as browser panes, live Markdown trackers, reading screens, or inspecting panes.
workshop-followup/toolkit/bundles/code-cmux-cli.md

Implementation notes

  • Keep the Ralph Loop to exactly one toolkit technique: Stop-hook continuation. Browser checks, Markdown trackers, and simulation scripts are proof surfaces, not extra toolkit topics.
  • The hook should be fast and boring: parse event input, inspect small local state or receipt files, emit one control JSON line, and exit.
  • Use a hard completion condition every time: iteration cap, explicit target state, or both. A Stop hook without a bound is not a workshop-quality scenario.
  • For web-app scenarios, require browser receipts because source inspection alone does not prove UI behavior.
  • Keep receipt files local under .notes/ so students can inspect proof without committing noisy agent scratch work.
  • Do not assume browser cookies are available to an agent-browser workflow. Authenticated flows need an explicit login/setup path or a local test fixture.
  • Treat Better Plugins examples as prototype reference material, not polished drop-in infrastructure.
Lesson 08 - Self-Improving Loops & Deterministic Gates

Self-improving log-gated fix loop

Toolkit artifact

Web developers already run local dev servers, browser checks, test scripts, and agent panes; this technique turns those runtime artifacts into proof instead of vibes. It matters because UI bugs, auth regressions, and hook failures are often only obvious in logs, browser state, or terminal output, not in the agent's final message.

01

Auth redirect loop with browser evidence

As a full-stack Next.js developer maintaining an authenticated dashboard

A local Next.js app sometimes redirects logged-in users from /dashboard back to /login after a route middleware or session-cookie change. The agent claims the fix works, but the team needs proof from the running browser and structured logs.

As a full-stack developer, I want the agent to reproduce the auth redirect failure, capture exact runtime evidence, patch the workflow, and gate success on log comparison, so that I do not merge an auth fix based on a cheerful completion message.
Agent brief

You are my coding agent. Run a self-improving log-gated fix loop for the local auth redirect bug. First inspect the current code and existing verification scripts; do not rely on memory. Start the dev server using the repo's documented command from package.json. Reproduce the flow in a browser: visit /login, sign in with the local/dev credentials or seeded test user documented in the repo, navigate to /dashboard, refresh, and confirm whether the browser stays on /dashboard. Capture exact evidence: terminal output with `cmux read-screen`, browser console/network output if available, and a JSON log artifact at `.notes/log-gated/auth-redirect/run-<timestamp>.json`. The log must include goal, routeVisited, finalUrl, redirectCount, authCookiePresent as true/false when observable, serverStatusCodes, consoleErrors, terminalFailureText when present, and pass boolean. If the first run fails, use this repair prompt shape internally: `Goal: authenticated users should remain on /dashboard after login and refresh. Here is the exact failure: <paste terminal/browser/log evidence>. Explain WHY it failed in this scenario, then PATCH the workflow so this case is covered next time.` After patching, rerun the same browser flow and write a second JSON artifact. A supervisor pass must compare expected-vs-actual fields and only pass when finalUrl is /dashboard, redirectCount is 0 after the authenticated refresh, pass is true, and no new console/server errors appear. Finish with the files changed, commands run, log artifact paths, and a terse review note.

Browser move

Use a CMUX browser pane or agent-browser-style workflow to take an accessibility snapshot of /login, fill the login form, click submit, wait for navigation, evaluate `window.location.pathname`, refresh, evaluate it again, and capture console/network errors. Browser proof is required here because the bug is about real navigation, cookies, and route behavior.

Implementation refs
  • workshop-followup/toolkit/prompts/08-log-gated-self-improving-loop.md
  • workshop-followup/toolkit/codex-skills/log-gated-verification/SKILL.md
  • workshop-followup/oracle/08-references.json
  • workshop-followup/oracle/08-validation.json
  • cmux read-screen
  • cmux markdown open .notes/log-gated/auth-redirect/tracker.md --direction right
  • package.json scripts for dev/test commands
  • .notes/log-gated/auth-redirect/*.json
Done when
  • At least two structured JSON run logs exist: one reproducing the original failure when possible and one post-fix verification run.
  • The supervisor comparison explicitly checks finalUrl, redirectCount, status codes, consoleErrors, and pass fields instead of trusting the worker's text.
  • The final browser state after login and refresh is /dashboard with no unexpected redirect to /login.
  • The final response lists exact commands run, files changed, and log paths under `.notes/log-gated/auth-redirect/`.
Failure modes
  • Common mistake: the agent only runs unit tests and declares success. Mitigation: require a browser flow plus JSON log fields proving finalUrl and redirectCount.
  • Common mistake: logs contain tokens, cookies, or full auth headers. Mitigation: redact secrets and store only booleans, route names, status codes, and sanitized error strings.
  • Common mistake: the agent tests a stale dev server. Mitigation: restart or verify the dev server command, capture terminal output with `cmux read-screen`, and include the server start timestamp in the log.
02

Checkout UI regression behind deterministic artifacts

As a frontend product engineer responsible for a checkout or upgrade flow

A pricing or checkout screen was refactored and now one browser path intermittently disables the primary CTA, misprices a plan, or shows the wrong confirmation state. The workflow needs browser validation plus structured artifacts a supervisor can compare.

As a frontend product engineer, I want an agent to verify the checkout path with browser actions and structured scenario logs, so that regressions are caught by observable artifacts instead of a subjective agent summary.
Agent brief

You are my coding agent. Build a log-gated verification loop for the checkout/pricing flow. Inspect the source files for the pricing/checkout route, the relevant components, and existing tests. Start the local app using the repo's package scripts. Create `.notes/log-gated/checkout/criteria.md` with the expected scenario fields: selectedPlan, displayedPrice, ctaEnabled, confirmationVisible, finalUrl, consoleErrors, networkFailures, and pass. Then run three browser scenarios: default plan selection, switching plans twice, and back/refresh during checkout. For each scenario, write `.notes/log-gated/checkout/<scenario-slug>.json` with timestamp, exact steps, expected fields, actual fields, terminal evidence from `cmux read-screen` when a failure occurs, and pass boolean. If any scenario fails, use the exact failure text and artifact contents to patch the code or workflow, then rerun only the failing scenario plus one happy-path scenario. Add or update a verifier script only if the repo already has a natural place for local scripts; otherwise have the supervisor agent compare the JSON files directly. The gate passes only when all expected-vs-actual fields match and the browser proof agrees with the logs.

Browser move

Use Playwright, a CMUX browser pane, or an agent-browser-style accessibility snapshot/click/evaluate sequence: open the pricing route, snapshot the buttons and plan labels, click each plan option, evaluate visible price and CTA disabled state, click the CTA, then evaluate the final confirmation URL/state. This is browser-worthy because visual state, disabled controls, and navigation are the product behavior.

Implementation refs
  • workshop-followup/toolkit/codex-skills/log-gated-verification/SKILL.md
  • workshop-followup/toolkit/prompts/08-log-gated-self-improving-loop.md
  • workshop-followup/oracle/08-interactive.json
  • cmux read-screen
  • cmux markdown open .notes/log-gated/checkout/tracker.md --direction right
  • .notes/log-gated/checkout/criteria.md
  • .notes/log-gated/checkout/*.json
  • package.json scripts for dev/test commands
Done when
  • Three scenario JSON logs exist with expected fields and actual fields.
  • A supervisor comparison marks pass only when selectedPlan, displayedPrice, ctaEnabled, confirmationVisible, finalUrl, consoleErrors, and networkFailures match the criteria.
  • The agent provides browser evidence for each scenario, either through Playwright output, accessibility snapshots, screenshots, console logs, or explicit evaluate results.
  • Any fix is tied back to the exact failing artifact and rerun evidence.
Failure modes
  • Common mistake: the agent overfits to one happy path. Mitigation: require default selection, repeated plan switching, and back/refresh behavior.
  • Common mistake: the agent writes brittle permanent UI tests for a screen still changing daily. Mitigation: keep these as `.notes/log-gated/checkout/` artifacts unless the repo already has a stable test home.
  • Common mistake: the browser log and JSON log disagree. Mitigation: make the supervisor compare both and fail the gate until the artifact mismatch is resolved.
03

Codex hook or local CLI failure hardening

As a developer maintaining a repo-local Codex hook, plugin, or CLI helper

A hook or local CLI command fails under a specific edge case, such as malformed input, missing config, unexpected event JSON, or a command usage error. The agent must capture the exact terminal failure, patch the hook/workflow, and prove the case is covered next time.

As a tooling-focused developer, I want failures from hooks or CLI helpers to become regression artifacts, so that each failure hardens the workflow instead of becoming another one-off terminal mystery.
Agent brief

You are my tooling agent. Run the self-improvement error loop on the failing hook/CLI workflow. First reproduce the failing command or hook scenario exactly. Capture the terminal state with `cmux read-screen` and save the raw text to `.notes/log-gated/tooling/failures/<timestamp>.txt`. Then apply this prompt shape to your own diagnosis: `Goal: <describe the hook or CLI contract>. Here is the exact failure: <paste captured terminal text>. Explain WHY it failed in this scenario, then PATCH the hook/workflow so this case is covered next time.` Patch the smallest relevant hook/workflow code. Add structured run artifacts under `.notes/log-gated/tooling/runs/` containing scenarioName, inputShape, commandRun, expectedOutcome, actualOutcome, errorText, timestamp, and pass. Run at least the original failing case and one nearby non-failing case. A supervisor pass must inspect the run artifacts and compare expectedOutcome vs actualOutcome. Do not claim success unless the artifacts show the original failure case now passes and the nearby case still passes.

Browser move

Browser proof is not the primary proof unless the hook controls browser-visible behavior. For a terminal hook or CLI helper, the right proof is exact terminal capture plus structured run logs. If the hook opens a CMUX browser dashboard or affects a web UI, add one browser snapshot/evaluate step to confirm the visible state matches the log.

Implementation refs
  • workshop-followup/toolkit/codex-skills/log-gated-verification/SKILL.md
  • workshop-followup/toolkit/bundles/log-gated-verification-skill.bundle.txt
  • workshop-followup/toolkit/bundles/log-gated-self-improving-loop.bundle.txt
  • workshop-followup/oracle/08-references.json
  • workshop-followup/chunks/08/gemini-analysis.v2.md
  • cmux read-screen
  • .notes/log-gated/tooling/failures/*.txt
  • .notes/log-gated/tooling/runs/*.json
Done when
  • The raw failure text is saved exactly, not paraphrased.
  • The patch is tied to a stated root cause from the captured failure.
  • At least two structured run logs exist: original failing scenario and nearby control scenario.
  • A supervisor comparison reads the artifacts and marks the gate pass only when expectedOutcome and actualOutcome match.
Failure modes
  • Common mistake: the agent patches the symptom without explaining the failure condition. Mitigation: require a root-cause note derived from the captured terminal text.
  • Common mistake: the agent creates an unbounded self-improvement loop. Mitigation: run a fixed set of scenarios and stop after the gate comparison.
  • Common mistake: the worker agent gets too much project context and wanders. Mitigation: keep the worker scoped to the hook/CLI files, logs, and the contract for this workflow.
Agent references

Validated lesson 08 references - Use as the ground-truth teaching source for terminal text extraction, self-improvement loops, log-based gates, natural-language boundaries, and context pruning.
workshop-followup/oracle/08-references.json

Existing log-gated prompt - Use as the copy-paste source pattern for capturing exact failure text, asking for root-cause hardening, and requiring structured log artifacts.
workshop-followup/toolkit/prompts/08-log-gated-self-improving-loop.md

Existing Codex skill - Use to keep scenario language aligned with the existing toolkit skill and avoid inventing untaught verification machinery.
workshop-followup/toolkit/codex-skills/log-gated-verification/SKILL.md

CMUX terminal text capture - Use to grab exact terminal output as machine-readable context before diagnosis or patching.
cmux read-screen

CMUX live tracker pane - Use when the agent should keep a visible, auto-refreshing progress view next to the running workflow.
cmux markdown open .notes/log-gated/<scenario>/tracker.md --direction right

Lesson 08 validation notes - Use to stay inside the validated taught boundary: logs/files/API calls/object shapes/timestamps/session outputs, not invented broad verifier infrastructure.
workshop-followup/oracle/08-validation.json

Interactive lesson checklist - Use for concrete practice steps and success criteria around failure capture, log comparison gates, and worker boundary contracts.
workshop-followup/oracle/08-interactive.json

Implementation notes

  • Keep the durable artifact small and boring: raw terminal capture, scenario criteria, JSON run logs, and a supervisor comparison.
  • Use `.notes/log-gated/<scenario>/` for local artifacts unless the project already has a committed test-artifact convention.
  • For web app scenarios, require browser proof when the behavior depends on cookies, navigation, visible UI state, console errors, network calls, or disabled controls.
  • For hook or CLI scenarios, terminal capture and structured logs are the primary proof; browser automation is only useful if the workflow changes browser-visible state.
  • The worker agent should patch narrowly; the orchestrator or supervisor should compare artifacts and decide pass/fail.
  • Do not trust the worker's final answer as the gate. The gate is expected-vs-actual comparison over files, logs, terminal output, browser output, or other inspectable artifacts.
  • Redact secrets from logs. Store booleans, shapes, route names, status codes, sanitized errors, and timestamps rather than raw tokens, cookies, or credentials.
  • Grounding note: content is derived from the uploaded scenario-log-gated bundle. :contentReference[oaicite:0]{index=0}
Lesson 09 - Background Daemons & Agent Swarms

Isolated Codex daemon profile

Toolkit artifact

Web developers constantly repeat narrow CLI work around local dev servers, deploy previews, browser QA, GitHub checks, and framework tooling. An isolated Codex daemon profile turns one repetitive tool workflow into a cheap, warm, low-context helper instead of dragging a full general agent through every small task.

01

Local browser QA daemon for a flaky signup flow

As a full-stack Next.js developer shipping an authenticated signup flow

A Next.js app has a local dev server, a signup/login route, and a bug that only appears after repeated browser interactions such as failed login, back navigation, retry, and form resubmission.

As a full-stack Next.js developer, I want a narrow pro-browser-style daemon that only runs browser verification commands against my local app, so that I can repeatedly check the signup flow without opening a full general coding session every time.
Agent brief

Create an isolated Codex daemon profile named pro-browser-qa for the repo's browser verification CLI or script. First inspect the current repo for existing browser test commands, Playwright scripts, or agent-browser usage; do not invent a new stack if one already exists. Write a single executable Bun TypeScript file at daemons/pro-browser-qa that calls runProfile() from lib/isolated.ts. Use the developerInstructions shape: Operating rule, Command map, Workflow, Command rules, Output. The operating rule must require running the browser QA command through exec_command before any final answer. The command map should include only the local commands this repo actually has, such as dev server health check, run signup flow, run login flow, capture console output, capture screenshot, and help. Keep the map literal; do not paste full --help output. Mutations are not allowed: this daemon verifies and reports only. After creating it, run chmod +x daemons/pro-browser-qa, run bun daemons/pro-browser-qa --help, add it to package.json bin if the repo uses that pattern, and run bun link. Expected output: a terse report with command run, route tested, pass/fail, browser evidence path, console errors, and next action.

Browser move

Use a local browser verification move because this is a browser-visible bug: start the app with the repo's dev command, then have the daemon run the existing browser script or agent-browser-style flow to snapshot the signup page, fill the email/password fields, click submit, navigate back, retry, and collect screenshot plus console output. If CMUX is available, open the app in a browser pane and keep the daemon terminal visible beside it.

Implementation refs
  • workshop-followup/toolkit/prompts/09-isolated-codex-daemon-profile.md
  • workshop-followup/toolkit/codex-skills/codex-daemon-profile/SKILL.md
  • workshop-followup/toolkit/bundles/code-codex-daemons.md
  • daemons/pro-<TOOL>
  • lib/isolated.ts
  • package.json
  • chmod +x daemons/pro-browser-qa
  • bun daemons/pro-browser-qa --help
  • bun link
Done when
  • daemons/pro-browser-qa exists, is executable, and calls runProfile() from lib/isolated.ts
  • bun daemons/pro-browser-qa --help runs without crashing
  • The profile prompt has Operating rule, Command map, Workflow, Command rules, and Output sections
  • The daemon performs read-only browser verification before answering
  • A verification receipt includes tested URL, exact command, pass/fail, screenshot or trace path, console errors, and any failed selector or step
  • package.json bin and bun link are updated only if the repo uses that executable profile pattern
Failure modes
  • Mistake: the agent builds a broad browser-and-code-editing assistant. Mitigation: keep this daemon browser-QA-only, no file edits unless explicitly requested, and no web browsing beyond the local app.
  • Mistake: the daemon answers from memory or source inspection without running the browser flow. Mitigation: the operating rule must require exec_command first and the final answer must name the command and artifact path.
  • Mistake: the profile dumps full Playwright or tool help into developerInstructions. Mitigation: use a curated keyword-to-command map and fall back to --help only on syntax errors.
  • Mistake: authenticated flow assumes normal browser cookies exist. Mitigation: script login/setup explicitly or use a seeded test account path documented in the repo.
02

Preview deployment triage daemon

As a frontend lead reviewing preview builds before merge

A pull request deploys a preview URL, but every review requires the same narrow checks: list the current deployment, inspect status, open the preview route, confirm key pages render, and summarize only the blocking issues.

As a frontend lead, I want a pro-deploy-check daemon around my deployment CLI, so that preview triage is fast, repeatable, and limited to deployment status plus browser proof instead of turning into a broad coding session.
Agent brief

Create an isolated Codex daemon profile named pro-deploy-check for the repo's deployment CLI or existing preview-check script. Inspect package.json and repo scripts first. If the repo already uses a specific deploy CLI, wrap that one CLI only. Write daemons/pro-deploy-check as a single executable Bun TypeScript file using runProfile() from lib/isolated.ts. The daemon must start with the narrowest read-only command: deployment status or preview URL discovery. The command map should cover status, current preview, logs, inspect URL, and help, using exact repo commands. For mutations, require explicit target and action; default to read-only. Disable broad behavior in the prompt: no web search, no image generation, no unrelated file edits, no apply_patch unless explicitly asked. After generation, run chmod +x daemons/pro-deploy-check, bun daemons/pro-deploy-check --help, test one real prompt such as 'status for the current PR preview', then add package.json bin and bun link if appropriate. Expected output: current deployment state, preview URL, failing route if any, command evidence, and browser proof path.

Browser move

Use browser proof after the daemon discovers the preview URL: open the preview in a CMUX browser pane or run the repo's browser automation to snapshot the homepage and one changed route, then record status code, visible error text, console errors, and screenshot path. Browser proof is useful here because deployment status alone does not prove the UI renders.

Implementation refs
  • workshop-followup/toolkit/codex-skills/codex-daemon-profile/SKILL.md
  • workshop-followup/toolkit/bundles/isolated-codex-daemon-profile.bundle.txt
  • workshop-followup/toolkit/bundles/code-codex-daemons.md
  • package.json scripts
  • daemons/pro-deploy-check
  • lib/isolated.ts
  • chmod +x daemons/pro-deploy-check
  • bun daemons/pro-deploy-check --help
  • bun link
Done when
  • The profile wraps one deployment or preview-check CLI only
  • The first command is read-only status or preview discovery
  • bun daemons/pro-deploy-check --help succeeds
  • A real prompt returns deployment status, preview URL, and exact commands run
  • Browser evidence exists for at least the homepage and one changed route
  • The final report is terse and separates blocker, warning, and clean result
Failure modes
  • Mistake: the agent creates a general CI/CD bot with many tools and broad permissions. Mitigation: one CLI, one profile, read-only first, mutate only when target plus action are explicit.
  • Mistake: it trusts a green deployment status without checking the rendered app. Mitigation: require browser proof for at least one preview route before final answer.
  • Mistake: it hardcodes a vendor-specific command not present in the repo. Mitigation: inspect package.json and existing scripts first, then map only verified local commands.
  • Mistake: it bloats context with full docs or --help output. Mitigation: keep a small curated command map and use --help only as a retry path.
03

Create a focused accessibility audit daemon

As a frontend platform engineer supporting multiple product teams

Several teams ship React pages with recurring accessibility regressions: missing form labels, buttons without accessible names, skipped headings, and modals that trap focus inconsistently. The work is repetitive and command-driven, but each audit still needs a concise human-readable report.

As a frontend platform engineer, I want a narrow Codex daemon that runs the project accessibility checks and browser proof steps, so that teams get consistent reports without asking a general assistant to rediscover the workflow each time.
Agent brief

Build or use an isolated Codex daemon profile for accessibility triage only. Keep the command map narrow: inspect package scripts, run the existing accessibility or browser-test command if present, start the local dev server only when required, open the target route in a browser, capture an accessibility snapshot, and report missing labels/headings/focus problems with file or route pointers. The daemon should write one markdown report under `.notes/a11y/<route-slug>.md` with commands run, browser route, snapshot summary, issues found, false positives, and recommended next fixes. It must not edit files unless explicitly asked after the audit report is reviewed.

Browser move

Use browser proof because accessibility is runtime UI behavior. Open the target localhost route, take an accessibility snapshot, click or tab through the primary form/modal/menu path, evaluate focus state where useful, and save the route plus snapshot summary in the daemon report. If no browser surface is available, state that runtime accessibility proof is missing rather than claiming success from source inspection.

Implementation refs
  • workshop-followup/toolkit/codex-skills/codex-daemon-profile/SKILL.md
  • workshop-followup/toolkit/prompts/09-isolated-codex-daemon-profile.md
  • package.json scripts
  • target app route/component files
  • cmux browser snapshot or existing Playwright accessibility checks
Done when
  • The daemon profile has one narrow responsibility and a small command map.
  • A markdown report exists under `.notes/a11y/` with exact commands, route, browser proof, and prioritized issues.
  • The daemon did not edit source files during audit-only mode.
  • The report distinguishes confirmed runtime issues from possible source-level concerns.
Failure modes
  • The daemon becomes a broad product assistant; mitigate by keeping its profile limited to accessibility triage commands and reports.
  • The daemon claims accessibility from static source reading; mitigate by requiring browser snapshot/focus evidence for runtime findings.
  • The daemon mutates code during audit; mitigate by making audit-only mode the default and requiring explicit follow-up approval for fixes.
Agent references

Codex daemon profile skill - Use as the implementation contract for when to trigger this technique, profile shape, prerequisites, isolation behavior, and guardrails.
workshop-followup/toolkit/codex-skills/codex-daemon-profile/SKILL.md

Isolated profile prompt - Use as the copy-paste user-facing prompt pattern for generating a new pro-<TOOL> profile.
workshop-followup/toolkit/prompts/09-isolated-codex-daemon-profile.md

Verified codex-daemons command patterns - Use for the real executable file shape, runProfile() call, ProfileConfig fields, isolation keys, token-budget notes, and after-generation commands.
workshop-followup/toolkit/bundles/code-codex-daemons.md

Lesson 09 validated references - Use as the taught-only conceptual boundary for background daemons, low-reasoning iteration, small context, self-improving daemons, SDK graphs, and multimodal verification loops.
workshop-followup/oracle/09-references.json

Lesson 09 validation notes - Use to avoid overclaiming unsupported isolation details as transcript-taught concepts while still allowing verified repo patterns when implementing the toolkit artifact.
workshop-followup/oracle/09-validation.json

Build and verification target - Use after adding the playbook to confirm parse-valid JSON, rebuild the site, and inspect the scenario page.
rebuild site, then verify /scenarios.html

Implementation notes

  • Keep each daemon wrapped around exactly one CLI or one repo-local script family. The whole point is a narrow helper, not another giant assistant.
  • Use the runProfile() shape from lib/isolated.ts: one executable Bun TypeScript file, low reasoning by default, literal developerInstructions, and a curated command map.
  • The prompt order matters: Operating rule first, then Command map, then Workflow, Command rules, and Output.
  • Make the daemon run a command before answering. A daemon that summarizes from memory is just a small hallucination machine with a fancy name.
  • Default to read-only discovery. Require explicit target plus action for mutations, especially deploy, Docker, database, cloud, or file-editing tools.
  • Do not paste full --help output into instructions. Keep the command map tight and use --help only when syntax is uncertain or a usage error occurs.
  • For web app work, pair command receipts with browser receipts when the user-visible app matters: local URL, route, action sequence, screenshot or trace path, console errors, and pass/fail.
  • Treat self-improvement as experimental and narrow. It fits predictable CLI/API failure modes better than broad product implementation.
Lesson 10 - Codex Daemons & Overnight Crons

Overnight emulated-cron loop

Toolkit artifact

Web developers often need long-running investigation without a real scheduler: flaky auth, broken localhost flows, intermittent console errors, or UI drift that only appears after repeated checks. This pattern gives an agent a bounded timestamp loop, runtime evidence, and a hard stop so overnight work produces receipts instead of vague claims.

01

Overnight flaky auth-flow watchdog

As a product web developer maintaining a Next.js app with login-gated dashboards

A local Next.js dashboard sometimes loses authenticated state after refresh, route changes, or idle time. The team has a dev server, a test account, and a browser-capable agent, but no built-in Codex scheduler.

As a product web developer, I want an agent to re-check the login and dashboard flow every 10 minutes overnight, so that I wake up to concrete logs, screenshots, and a narrowed failure theory instead of a one-off reproduction attempt.
Agent brief

Create `.goals/overnight-auth-watch.md` and run it as a bounded /goal-style task. Goal: record a starting timestamp, then every 10 minutes until 7:00 AM local time or 24 iterations, whichever comes first, verify the local auth flow against the running app. Stay read-only unless I explicitly approve a mutation. First inspect the project scripts and start or reuse the dev server. For each iteration, capture the context triad: intent from this goal file, implementation from the smallest relevant auth/session/routes files, and output from browser console logs, network or terminal errors, screenshots, and a short iteration summary. Write artifacts under `.notes/overnight-auth-watch/` using timestamped filenames such as `iteration-003.json`, `iteration-003-console.log`, and `iteration-003-screenshot.png`. At the end, write `.notes/overnight-auth-watch/final-report.md` with pass/fail counts, first failing iteration, suspected mismatch between intent/implementation/output, and the exact evidence files to inspect. Do not declare success without inspecting the written artifacts.

Browser move

Use an agent-browser-style flow against localhost: take an accessibility snapshot of the login page, fill the test credentials, click submit, wait for the dashboard route, evaluate `document.title`, visible route markers, and any session/banner text, then refresh and navigate away/back. Capture console output and a screenshot per iteration. If using a CMUX browser pane, keep the browser pane visible while the terminal pane writes artifacts.

Implementation refs
  • workshop-followup/toolkit/prompts/10-overnight-codex-cron-loop.md
  • workshop-followup/oracle/10-references.json
  • workshop-followup/oracle/10-validation.json
  • .goals/overnight-auth-watch.md
  • .notes/overnight-auth-watch/
  • packx --preview -s "auth session dashboard runtime artifacts"
  • packx -s "auth session dashboard runtime artifacts" -f markdown --no-interactive -o .notes/overnight-auth-watch/debug-bundle.md
Done when
  • A bounded goal file exists with starting timestamp, 10-minute interval, and hard stop at 7:00 AM or 24 iterations.
  • Each iteration writes a JSON/log/screenshot receipt under `.notes/overnight-auth-watch/`.
  • `final-report.md` compares expected auth behavior, relevant implementation files, and actual runtime output, with links or paths to the strongest evidence.
  • No source files are modified unless a separate human-approved implementation step is started.
Failure modes
  • The agent only reports that the flow worked without saving browser evidence; mitigate by requiring timestamped console logs, screenshots, and iteration JSON before any final answer.
  • The loop keeps running because the stop condition is written vaguely; mitigate by including both a target clock time and an iteration cap in the goal text.
  • The browser agent assumes existing cookies are available; mitigate by scripting the login path explicitly with a test account and treating cookies as non-portable.
  • The agent bundles the whole repo instead of the relevant auth implementation plus output artifacts; mitigate by running `packx --preview` first and keeping the bundle focused.
02

Repeated visual-regression and console-smoke loop

As a frontend engineer responsible for a marketing site or app shell

A responsive landing page looks fine on the first pass, but layout drift, hydration warnings, or console errors appear after rebuilds, route changes, or viewport changes. The developer wants an overnight evidence loop, not a permanent test-suite expansion.

As a frontend engineer, I want an agent to periodically open the app in the browser, inspect key routes and viewports, and save proof artifacts, so that visual or console regressions are grounded in screenshots and logs.
Agent brief

Create `.goals/overnight-ui-smoke.md`. Record the starting timestamp. Every 15 minutes, until 6 hours have elapsed or 20 iterations are complete, run a read-only UI smoke routine against the local web app. Inspect `package.json` for the dev and build commands. Start or reuse the dev server. Visit the agreed routes: `/`, `/pricing`, and `/dashboard` if present; if a route does not exist, record that instead of inventing one. For each route, capture console errors/warnings, one screenshot at desktop width, one screenshot at mobile width if the browser tool supports viewport changes, and a short artifact summary. Write files under `.notes/overnight-ui-smoke/`. If an error appears, bundle intent, implementation, and output with PackX: goal file, smallest route/component files, console logs, screenshots, and terminal output. End with `.notes/overnight-ui-smoke/final-report.md` listing stable routes, failing routes, first failure time, and the likely mismatch.

Browser move

Use browser verification as the proof, not source inspection alone: open each localhost route, take an accessibility snapshot to confirm the main landmark/header/CTA exists, evaluate `window.location.pathname` and visible route text, collect console errors, and save screenshots. In CMUX, keep a browser pane beside the agent terminal or open a live Markdown tracker with the latest iteration summary.

Implementation refs
  • workshop-followup/toolkit/prompts/10-overnight-codex-cron-loop.md
  • workshop-followup/oracle/10-references.json
  • workshop-followup/chunks/10/gemini-analysis.v2.md
  • .goals/overnight-ui-smoke.md
  • .notes/overnight-ui-smoke/
  • package.json
  • packx --preview -s "ui smoke screenshots console logs routes components"
  • packx -s "ui smoke screenshots console logs routes components" -f markdown --no-interactive -o .notes/overnight-ui-smoke/debug-bundle.md
Done when
  • The loop is bounded by a 15-minute interval, 6-hour duration, and 20-iteration cap.
  • Every checked route has saved browser evidence, not just a textual claim.
  • Any failure has a focused context triad bundle: goal text, relevant route/component implementation, and runtime output.
  • The final report separates actual browser failures from missing routes or unsupported viewport operations.
Failure modes
  • The agent treats screenshots as optional and only reads source; mitigate by requiring browser artifacts as the primary receipt for this scenario.
  • The agent turns the smoke loop into broad redesign work; mitigate by stating read-only investigation only and deferring mutations to a separate Builder goal.
  • The agent over-tests unstable UI experiments as if they are permanent contracts; mitigate by framing this as disposable overnight evidence, not a maintained regression suite.
  • The bundle is too large to inspect usefully; mitigate by previewing PackX and including only goal, changed or relevant source, logs, screenshots, and traces.
03

Overnight dependency-upgrade investigation brief

As a tech lead preparing a risky dependency or framework upgrade

A repo has a pending upgrade, failing tests, or intermittent build warnings. The lead wants a Researcher/Validator-style overnight investigation that collects evidence and proposes next steps, but does not mutate the repo while unattended.

As a tech lead, I want an agent to periodically run read-only checks and package the strongest evidence, so that the next morning I can decide whether to hand a focused implementation goal to a Builder agent.
Agent brief

Create `.goals/overnight-upgrade-investigation.md`. This is a read-only overnight research and verification loop. Record the starting timestamp. Every 20 minutes until 4 hours have elapsed or 12 iterations are complete, inspect the current dependency-upgrade state. Run only read-only or normal verification commands discovered from the repo, such as package-manager install status checks, typecheck, lint, test, or build commands already present in scripts. Do not edit package files or source files. Save terminal output to `.notes/overnight-upgrade-investigation/iteration-###.log` and a structured summary to `iteration-###.json`. When failures appear, capture the context triad: the goal text, relevant package/config/source files, and actual command output. Use PackX only after previewing the candidate files. Finish with `final-report.md` containing a ranked list of blockers, commands run, evidence paths, and a recommended next goal file for implementation.

Browser move

Browser proof is only useful if the upgrade affects a running web surface. If the app starts successfully, open the local app once per iteration, capture console errors and a homepage screenshot, and save them beside the command logs. If the failure is purely install/typecheck/build-level, browser proof is not the right proof; use terminal logs and PackX bundles instead.

Implementation refs
  • workshop-followup/oracle/10-references.json
  • workshop-followup/oracle/10-interactive.json
  • workshop-followup/toolkit/prompts/10-overnight-codex-cron-loop.md
  • .goals/overnight-upgrade-investigation.md
  • .notes/overnight-upgrade-investigation/
  • package.json
  • pnpm-lock.yaml, package-lock.json, yarn.lock, or bun.lockb if present
  • packx --preview -s "dependency upgrade failing build test lint output"
  • packx -s "dependency upgrade failing build test lint output" -f markdown --no-interactive -o .notes/overnight-upgrade-investigation/debug-bundle.md
Done when
  • The investigation uses a timestamped loop with a 20-minute interval, 4-hour duration, and 12-iteration cap.
  • All verification commands and outputs are saved under `.notes/overnight-upgrade-investigation/`.
  • The final report ranks blockers by evidence strength and points to exact log or bundle files.
  • The repo remains unmodified except for local `.notes/` and `.goals/` artifacts.
Failure modes
  • The agent starts fixing files during an unattended investigation; mitigate by repeating read-only discovery before mutation in the goal and requiring a separate implementation goal later.
  • The agent dumps full command help or the entire repo into context; mitigate by inspecting project scripts first and using PackX preview with a focused topic.
  • The agent reports a generic dependency recommendation without command evidence; mitigate by requiring saved logs and expected-versus-actual comparison.
  • The loop wastes credits rerunning the same failing command with no new information; mitigate by writing a per-iteration summary and stopping early if three consecutive iterations produce identical failure output.
Agent references

Existing overnight cron prompt - Use as the exact taught prompt shape: starting timestamp, repeated interval, hard stop, context triad, and read-only guardrail.
workshop-followup/toolkit/prompts/10-overnight-codex-cron-loop.md

Lesson 10 validated references - Treat as ground truth for what was actually taught: context triad, QA artifacts, short goal files, Oracle-generated goal files, specialized agents, overnight timestamp loops, and loose coupling.
workshop-followup/oracle/10-references.json

Lesson 10 validation guardrails - Use to avoid reintroducing removed or over-specific claims such as warm daemon internals, profile isolation, or unrelated course redesign.
workshop-followup/oracle/10-validation.json

Lesson 10 interactive checklist - Use for implementation-ready receipts: intent.md, execution logs, screenshots, traces, PackX bundle, and fixed stop condition.
workshop-followup/oracle/10-interactive.json

PackX focused bundle flow - Use when a failure needs a model-readable bundle of intent, implementation, and runtime output; always preview before writing the bundle.
packx --preview -s "<topic>" && packx -s "<topic>" -f markdown --no-interactive -o .notes/<scenario>/debug-bundle.md

Agent Browser evidence pattern - Use for web-app runtime proof: console logs, screenshots, and live browser state rather than source-only guessing.
https://www.npmjs.com/package/agent-browser

Uploaded scenario bundle - Use as the local source snapshot for the scenario pipeline and lesson/toolkit context.
/mnt/data/scenario-overnight-cron.txt

Implementation notes

  • Keep the loop as a natural-language goal file, not a production scheduler. The point is bounded overnight agent work, not pretending Codex has cron.
  • Every scenario must include a hard stop: target clock time, elapsed duration, iteration cap, or budget cap. Use at least two bounds for unattended work.
  • For web app scenarios, runtime output means browser evidence: console logs, screenshots, route state, traces, or visible UI checks. Source code alone is not enough.
  • Prefer read-only discovery overnight. Save implementation for a separate morning Builder goal after a human reviews the evidence.
  • Write artifacts to local `.notes/` and goal files to `.goals/`; keep them out of committed source unless the repo intentionally tracks them.
  • Use specialized responsibilities without turning them into ceremony: Researcher/Validator overnight, Builder later if the evidence supports a change.
  • Stop early when repeated iterations produce identical evidence and no new signal; record that early stop in the final report.
  • Redact secrets from logs, screenshots, and browser captures before bundling or sharing.
Lesson 11 - Code as a Throwaway Artifact

Video as requirements

Toolkit artifact

Web developers already debug behavior in running apps, not just in source files. A narrated screen recording gives an agent the missing layer: what the user clicked, typed, expected, noticed, and wanted changed, then turns that into a concrete QA story and implementation brief.

01

Convert a narrated checkout bug into a fixable QA story

As a frontend engineer maintaining a Next.js commerce app

A product manager records a two-minute walkthrough on localhost showing a cart drawer, promo-code input, shipping estimate panel, and checkout button. They narrate that the drawer jumps on mobile, the promo error disappears too fast, and the checkout button should stay disabled until shipping is selected.

As a mobile shopper, I want the cart drawer, promo-code error, shipping estimate, and checkout enabled state to behave predictably, so that I can recover from mistakes and complete checkout without guessing what changed.
Agent brief

You are implementing from a narrated video requirements artifact. First inspect the current app source and existing tests; do not trust stale notes. Use the video-derived QA story as the natural-language contract. Create or update `.goals/cart-video-requirements.md` with: user flow, acceptance criteria, responsive behavior, edge cases, and what evidence will prove the fix. Then implement only the behavior described: stable mobile drawer layout, persistent promo-code error state until corrected or dismissed, shipping selection as the gate for enabling checkout, and no unrelated redesign. Run the project’s normal lint/test/build commands discovered from package scripts. Start the local dev server and verify the flow in a browser. Produce a final review artifact containing the git diff summary, commands run, browser proof, and any remaining risks.

Browser move

Use a local browser or CMUX browser pane against the dev server. Capture an accessibility snapshot of the cart drawer, click the cart trigger, fill the promo-code field with an invalid code, click apply, resize or emulate mobile width, select shipping, and evaluate whether the checkout button disabled/enabled state matches the video-derived contract. Save a screenshot or concise browser log as proof.

Implementation refs
  • workshop-followup/toolkit/prompts/11-video-as-requirements-prompt.md
  • workshop-followup/toolkit/codex-skills/video-as-requirements/SKILL.md
  • workshop-followup/oracle/11-references.json
  • workshop-followup/oracle/11-validation.json
  • workshop-followup/oracle/lessons.json
  • package.json scripts: test, lint, build, dev
  • git diff
Done when
  • A checked-in or local review artifact that restates the video-derived QA story and acceptance criteria
  • Passing lint/test/build output from the repo’s own scripts
  • Browser proof showing invalid promo behavior, mobile drawer stability, shipping selection, and checkout button state
  • A git diff review summary focused on the changed modules and any tests added or updated
Failure modes
  • The agent treats the video as a brittle click recorder and hard-codes selectors or exact timing; mitigate by extracting intent, user actions, expected behavior, responsive behavior, and edge cases before coding.
  • The agent redesigns the cart instead of fixing the contract; mitigate by requiring source inspection first and limiting changes to behavior described by the narrated walkthrough.
02

Turn a dashboard walkthrough into responsive UI requirements

As a full-stack web developer owning a SaaS analytics dashboard

A customer success lead records a narrated walkthrough of a dashboard where they filter by date range, expand a table row, collapse search results on narrow screens, and point out that the summary cards reorder badly on tablet widths.

As an account manager reviewing customer metrics, I want filters, expandable rows, collapsed search results, and responsive summary cards to preserve context across screen sizes, so that I can investigate an account without losing my place.
Agent brief

Use the narrated walkthrough as requirements, not as a polished spec. Extract a QA story with the exact user flow, expected visual/layout behavior, responsive breakpoints observed in the video, and edge cases. Inspect the current dashboard code, routing, state ownership, and any existing browser tests. Implement the smallest change that preserves filter state, keeps expanded row context stable, makes search results collapse predictably on narrow screens, and improves tablet card ordering without changing unrelated dashboard styling. Run the repo’s normal verification commands. Then run the dashboard in a browser and prove the flow works at desktop, tablet, and mobile widths. Final output must include the diff summary plus browser evidence, not a generic success claim.

Browser move

Use Playwright if the repo already has it, otherwise use a local browser or agent-browser-style workflow: snapshot the dashboard, click the date filter, fill or select the target range, expand a row, resize to tablet and mobile, and evaluate whether the visible labels, expanded state, and collapsed search results match the QA story. Do not add a new browser framework just for this scenario unless the repo already uses one.

Implementation refs
  • workshop-followup/toolkit/bundles/video-as-requirements-skill.bundle.txt
  • workshop-followup/toolkit/bundles/video-as-requirements-prompt.bundle.txt
  • workshop-followup/oracle/11-references.json
  • workshop-followup/oracle/11-validation.json
  • workshop-followup/chunks/11/gemini-analysis.v2.md
  • AGENTS.md or agents.md if present
  • package.json scripts
  • git diff
Done when
  • A video-derived QA story with user actions, expected behavior, responsive behavior, and edge cases
  • Passing project verification commands
  • Browser proof at desktop, tablet, and mobile widths
  • A diff-driven review note explaining which state/module boundaries changed and why
Failure modes
  • The agent over-specifies verification mechanics instead of outcomes; mitigate by keeping acceptance criteria outcome-based and using browser proof only as evidence.
  • The agent ignores current code structure and adds duplicate dashboard state; mitigate by requiring source-of-truth code inspection and preserving existing module/state boundaries where practical.
03

Use a video bug report to update guardrails after the fix

As a tech lead supervising agent-assisted feature work

A teammate records a narrated video of a localhost onboarding flow where invite acceptance, workspace creation, and the empty-state screen drift from the intended product contract. Previous agents have fixed similar flows but keep missing the same auth and empty-state constraints.

As a team admin accepting an invite, I want the onboarding flow to preserve auth context, create or join the right workspace, and show the correct empty state, so that first-run setup feels trustworthy.
Agent brief

Treat the video as evidence for a natural-language contract. Extract the intended onboarding states and transitions from the narration: invite opened, auth checked, workspace joined or created, empty state shown, and error or retry states if present. Inspect the current code and identify the modules that own auth, workspace creation, routing, and empty-state rendering. Implement the smallest fix that satisfies the contract. After verification, review the git diff as the primary artifact. Then update `agents.md` or `AGENTS.md` only if the mistake reveals a reusable project-specific guardrail, such as where onboarding state must live or which module owns invite acceptance. Keep the guardrail short and grounded in this actual mistake.

Browser move

Browser proof is useful because onboarding is flow/state behavior. In a local browser, follow the invite URL or mocked invite route, complete the auth path using safe local test credentials or seeded state, click through workspace creation or join, and capture the final empty-state screen. If authenticated browser state is required, use a live browser/CMUX browser pane carefully and do not expose real tokens, QR codes, customer data, or private notifications.

Implementation refs
  • workshop-followup/oracle/11-references.json
  • workshop-followup/oracle/11-validation.json
  • workshop-followup/toolkit/codex-skills/video-as-requirements/SKILL.md
  • workshop-followup/chunks/11/gemini-analysis.v2.md
  • AGENTS.md or agents.md if present
  • git diff
  • package.json scripts
Done when
  • A compact onboarding state contract extracted from the video
  • Passing verification commands and a browser receipt of the invite-to-empty-state flow
  • A git diff summary suitable for fresh-session critique
  • A minimal guardrail update only if it captures a real repeated project constraint
Failure modes
  • The agent dumps a broad rulebook into AGENTS.md; mitigate by allowing only short, project-specific guardrails learned from this actual video-backed failure.
  • The agent uses real authenticated data casually during browser verification; mitigate by using local seeded data or sanitized test accounts and stripping sensitive information from videos and logs.
Agent references

Video as requirements prompt - Use this as the user-facing extraction prompt: record a narrated walkthrough, extract acceptance criteria, user flow, clicks/typing/layout changes, responsive behavior, and edge cases.
workshop-followup/toolkit/prompts/11-video-as-requirements-prompt.md

Video as requirements Codex skill - Use this when packaging the workflow as a reusable local agent skill with guardrails around sensitive video data.
workshop-followup/toolkit/codex-skills/video-as-requirements/SKILL.md

Lesson 11 validated references - Treat this as ground truth for taught concepts: natural-language contracts, video-as-requirements, diff-driven review, modular/state-machine boundaries, living guardrails, and disposable generated code.
workshop-followup/oracle/11-references.json

Lesson 11 validation notes - Use this to avoid reintroducing tools and claims that were mentioned but not taught as workflows in the final lesson.
workshop-followup/oracle/11-validation.json

Lesson 05 browser QA context - Use this only for the supporting browser-work angle: perspective-based QA, weird-path exploration, browser-driving agents, CMUX state, and video-driven intent extraction.
workshop-followup/toolkit/codex-skills/five-monkeys-qa/SKILL.md

CMUX CLI surface - Use when the agent needs a visible local pane or browser-oriented workspace receipt; inspect state before acting and avoid inventing flags.
workshop-followup/toolkit/bundles/code-cmux-cli.md

Local verification commands - Discover and run the repo’s actual scripts instead of inventing a test command.
cat package.json

Diff-driven review - Use the final git diff as the primary review artifact, then hand goal plus diff to a fresh session for critique when the change is risky.
git diff

Implementation notes

  • Start by extracting a natural-language contract from the video before editing code. The contract should describe outcomes, user flow, responsive behavior, and edge cases; it should not become a brittle macro recording.
  • Inspect the live source before implementation. The course material is explicit that code is the source of truth when implementation details matter.
  • For web app work, browser proof matters when the requirement is visual, interactive, responsive, authenticated, or stateful. Use a local browser, CMUX browser pane, existing Playwright setup, or agent-browser-style snapshot/click/fill/evaluate workflow depending on what the repo already supports.
  • Keep generated code disposable but preserve durable assets: the video-derived contract, module/state boundaries, tests or verification artifacts, browser receipts, and the reviewed diff.
  • Strip sensitive data before uploading video. Screen recordings can leak tabs, tokens, paths, notifications, local QR codes, customer data, and auth state.
  • Do not add new permanent QA infrastructure for unstable UI unless the project already has that direction. A short QA story plus browser receipt can be enough for fast-moving product work.
  • Use `agents.md` or `AGENTS.md` as a living guardrail only when a real repeated mistake or project constraint emerges. Keep it small.