How we turned a film script into a graphic novel — and built an AI cold-reader 'reward function' that turns 'is this good?' into a number that points at the exact thing to fix. A playbook your team can steal.
Share this post:
Export:

A deconstruction for the team. Read top-to-bottom for the story; jump to Part V for the part you can steal.
We turned a film script into a 21-page graphic novel — and in doing so we cracked a harder, reusable problem: how do you give an AI a reward function for subjective work?
The trick: a thing is "correct" if a deliberately under-briefed cold reader reconstructs your intent from the artifact alone. A misread is a failing test. Encode intent as an answer key, run the cold reader, score the gap, fix the layer that leaked, repeat. We built it, it caught a real bug three times, and its own failures taught us how to calibrate it. The same recipe works for code review, design review, marketing copy, docs, dashboards — anything where "is this good?" felt unmeasurable.
Toil & Harvest is a satirical allegory: the same extraction bargain — "tribute demanded, protection denied" — recurring across 10,000 years (Neolithic 8000 BCE → Londinium 408 → Jack Cade's commons 1450 → present-day Austin → Ravenna 410). We blocked it out as a graphic novel:
erikbethke.com/toil-and-harvest, subscriber-gated (magic-link), with a mixed-media
Annotated mode (a prose rail of real history + the extraction lens, reader-toggled).Open screenplay/pages/index.html to read it. But the artifact is not the point of this document —
how we made it judgeable is.
Captured as the /storyboard skill. Five stages; you advance one at a time, checking in between.
Load-bearing principle: make taste decidable by generating concrete artifacts, let the human react, and only then commit — then decompose top-down and execute mechanically with the style locked.
The decomposition is fractal: Film → Act → Scene → Shot → Cell (one panel). IDs like
A1-S1-SH9-C1. The whole tree lives as a stub-ledger (screenplay/INDEX.md) + a data-model
(screenplay/SCHEMA.md) + per-scene cell records. You fill it top-down, one act per checkpoint —
never zero-to-hero in one pass.
These are the lessons that make AI image-gen hold its chin up instead of screaming "AI slop." Most generalize to any generative work:
Here's the part to internalize. The whole experiment started from one belief:
AI can do almost anything provided it has a clear reward function — unit tests, a board evaluator, a loss to descend. The open problem is building that for subjective work.
A work is legible if a deliberately under-briefed cold reader reconstructs the intended meaning from the artifact alone. A misread is a failing test. The set of misreads is a stack trace.
We made this concrete on the graphic novel:
The cold reader. A small, context-starved model (Haiku), shown a single panel, image-only, forbidden from reading the script. The weaker and more under-briefed the reader, the more sensitive the test. If a lightly-attached reader can't misunderstand it, a skimming human won't either.
The answer key (intent block). Per cell we encode the semantic ground truth:
intent:
beat: "the bargain is struck — protection sold, then withheld"
era: now
feel: "fragile hope, about to be taken"
readsAs: "people are quietly building something good together"
thesis: "the commons keeps almost building the alternative"
mustNotRead: # ← the NEGATIVE tests / regression suite
- "NOT an achieved utopia — this is YEARNED-FOR and fragile"
- "NOT a literal monster — the horned figure is the State/extractor"
carriedBy: [art, caption] # which layer is responsible for this meaning
mustNotRead — the specific misreadings to prevent — is the most important field. It is your
regression suite.
The reward. pass(cell) ⇔ the cold reader recovers {readsAs} and trips zero
mustNotRead. Global: does the reader recover the thesis across the work?
The harness (grok_test.py). Cold reader (Haiku) → judge (Sonnet) scores reconstruction vs the
answer key → scorecard. Runnable per-cell or whole-suite. CI-able (exit 1 on any fail).
The loop. Run the reader → diff reconstruction vs intent → add the minimum information to the
responsible layer → repeat until pass + zero mustNotRead.
The trap is "make the art carry everything." It can't, and shouldn't. Each layer owns part of the meaning and gets its own test:
| Layer | Owns | How we test it |
|---|---|---|
| **Art** | era-feel, emotional grammar (palette), the literal subject | cold reader, *image-only* |
| **Caption** | the literal proposition / who-said-what | cold reader, *image + caption* |
| **Prose / sequence** | the **thesis**, the cross-era argument, *era itself* | cold reader, *annotated* |
A meaning "leak" is only a bug if it leaks from the layer that was supposed to carry it. The art failing to convey "the 408 rescript" is not an art bug — that's the prose's job. The art letting a hero read as a monster is an art bug.
Scaling to 90 cells produced an alarming raw number (2/86). Reading the failures (not the count) revealed three calibration truths — each a thing the test was wrongly blaming the art for:
mustNotReads ("the dog IS the police") are prose-carried → can't be satisfied
by a single panel; don't grade the art on them.The meta-lesson: the reward function's job is partly to teach you what each layer is responsible for. Its right use is (a) catching egregious regressions and (b) the differential between reading modes — never an absolute pass-rate. A naive scoreboard would have lied to us; the differential told the truth.
This is the transferable recipe. Any time your team says "we can't really measure quality here, it's subjective" — you probably can. Here's how.
intent). Per unit (a PR, a screen, a paragraph, a chart, a ticket reply):
the intended takeaway (readsAs), the desired response (feel), and — crucially — the
mustNotRead: the specific wrong conclusions to prevent.mustNotRead.
Track the differential across contexts, not just an absolute.mustNotRead becomes a standing regression
suite. Wire it into CI for the egregious cases.| Domain | Artifact + unit | Cold reader | `mustNotRead` (the negative tests) |
|---|---|---|---|
| **Code review** | a PR / a function | a model given *only the diff*, no ticket | "must not read as adding auth when it removes it"; "must not look safe-to-merge if it drops a check" |
| **Design / UX** | a screen mockup | a model given the screenshot, no spec | "the primary CTA must not read as secondary"; "must not look like a destructive action is the default" |
| **Marketing copy** | a landing section | a reader given only the page | "must not read as a free product"; "must not imply a medical claim" |
| **Docs / API** | a how-to page | a model that must *complete the task* from the page alone | "must not lead the reader to call the deprecated endpoint" |
| **Dashboards** | a chart | a reader given only the figure | "must not read as growth when it's churn"; "axis must not imply zero-baseline" |
| **Support replies** | a drafted reply | a reader playing the upset customer | "must not read as blaming the user"; "must not promise a refund we won't give" |
mustNotRead is the asset. Positive intent is nice; the enumerated misreadings are what turn
taste into a regression suite. Mine them from real past mistakes.Wrap it like we wrapped /storyboard: a reusable skill that (1) elicits the intent answer key, (2)
runs the cold reader at the right context level, (3) judges against the key, (4) returns a scored,
layer-attributed work-list. Then every artifact your team ships gets unit-tested for meaning.
| Thing | Where |
|---|---|
| The graphic novel (read it) | `screenplay/pages/index.html` · gated: `erikbethke.com/toil-and-harvest` |
| The reward-function theory + every result | `NARRATIVE-REWARD-FUNCTION.md` (§§1–12) |
| The runnable harness | `grok_test.py` (`python3 grok_test.py [cell-id] [--annotated]`) |
| The data model + `intent` spec | `screenplay/SCHEMA.md` |
| The shot-tree ledger | `screenplay/INDEX.md` |
| Per-cell records + answer keys | `screenplay/{act1,act2,act3,climax,bonus}/*.md` |
| The pipeline skill | `.claude/skills/storyboard/SKILL.md` (+ `METHODOLOGY.md`) |
| The GRIN single-page variant | `.claude/skills/grin-dispatch/SKILL.md` |
| Image-gen scripts (429-safe, resumable) | `gen_*.py`, `dl_act1.py`, `regen_*.py` |
| Book bundler for the gated app | `pnpm bundle:toil` → `app/toil-and-harvest/_book.json` |
The deepest result isn't a graphic novel. It's that "is this good?" — for a story, a screen, a PR, a reply — can be turned into a number that points at the exact thing to fix, by asking a stranger what they understood and measuring the gap against what you meant.
Get notified when I publish new blog posts about game development, AI, entrepreneurship, and technology. No spam, unsubscribe anytime.
Loading comments...
Published: June 4, 2026 9:25 PM
Last updated: June 4, 2026 10:40 PM
Post ID: 15ca903d-db8e-4dd0-9ecc-19f673bebfc4