The Making of Toil & Harvest — a reward function for subjective work

June 4, 2026

live document

Erik Bethke

230 views

Toil & Harvest AI Game Design Philosophy

How we turned a film script into a graphic novel — and built an AI cold-reader 'reward function' that turns 'is this good?' into a number that points at the exact thing to fix. A playbook your team can steal.

2,257 words · 12 min read

Share this post:

Export:

The Making of Toil & Harvest — a reward function for subjective work - Image 1

…and a blueprint for building subjective reward functions in any domain

A deconstruction for the team. Read top-to-bottom for the story; jump to Part V for the part you can steal.

TL;DR for the busy reader

We turned a film script into a 21-page graphic novel — and in doing so we cracked a harder, reusable problem: how do you give an AI a reward function for subjective work?

The trick: a thing is "correct" if a deliberately under-briefed cold reader reconstructs your intent from the artifact alone. A misread is a failing test. Encode intent as an answer key, run the cold reader, score the gap, fix the layer that leaked, repeat. We built it, it caught a real bug three times, and its own failures taught us how to calibrate it. The same recipe works for code review, design review, marketing copy, docs, dashboards — anything where "is this good?" felt unmeasurable.

Part I — The Artifact

Toil & Harvest is a satirical allegory: the same extraction bargain — "tribute demanded, protection denied" — recurring across 10,000 years (Neolithic 8000 BCE → Londinium 408 → Jack Cade's commons 1450 → present-day Austin → Ravenna 410). We blocked it out as a graphic novel:

111 cells, ~21 pages, five acts/codas, in a disciplined Mike Mignola ink style.
A palette that means something: slate = the grind · crimson = blood/the bargain · green-gold = the fragile exhale · pulp = propaganda · black = violence-as-sound.
Hosted at erikbethke.com/toil-and-harvest, subscriber-gated (magic-link), with a mixed-media Annotated mode (a prose rail of real history + the extraction lens, reader-toggled).

Open screenplay/pages/index.html to read it. But the artifact is not the point of this document — how we made it judgeable is.

Part II — The Pipeline (words → ranged options → assembled pages)

Captured as the /storyboard skill. Five stages; you advance one at a time, checking in between.

Source. Locate the raw material (here: a 5,600-line ChatGPT export + a film bible). Know what exists and what stage it's at. (That export wasn't typed at a desk — it's a voice session, recorded while I walked the actual hill: Blackheath, in south London, where Act I's 1450 scene is set. Mid-conversation I bent down and pulled a chip of worked flint out of the dirt — a stranger's tool from the deep past, in my hand, on the very ground where the book's First Harvest begins. The source material was made on the hill it's about.)
Narrative. Distill to logline, theme, acts, scene list, motif tracker.
Screenplay / beats. For the target scene: slug, action, shot list, VO/dialogue, sound, color note.
Concept art — STYLE RANGING (a hard gate). Generate one scene in N styles and let the human react. You never ask "what style?" in words — the decisive taste is only sayable in front of real images. (Erik picked Mignola ink over noir/painterly/woodcut by seeing all four.)
Assemble pages. Per-panel art + HTML/CSS comic layout (black gutters, caption boxes, balloons, SFX, page-turn dread).

Load-bearing principle: make taste decidable by generating concrete artifacts, let the human react, and only then commit — then decompose top-down and execute mechanically with the style locked.

The decomposition is fractal: Film → Act → Scene → Shot → Cell (one panel). IDs like A1-S1-SH9-C1. The whole tree lives as a stub-ledger (screenplay/INDEX.md) + a data-model (screenplay/SCHEMA.md) + per-scene cell records. You fill it top-down, one act per checkpoint — never zero-to-hero in one pass.

Part III — The Craft Disciplines (hard-won, transferable)

These are the lessons that make AI image-gen hold its chin up instead of screaming "AI slop." Most generalize to any generative work:

Range, don't specify. Decisive preferences emerge in front of options, not from a prompt.
Kill the "AI-gen tell." Name an authored medium (flat screen-print color, bold ink, hand-inked 2D) and explicitly negate photorealistic / 3D render. The medium-claim + the negation is what reads as intentional.
Palette = emotion. Rationed spot color that means something is both aesthetic and story device.
Temperature = era. Stress-test the look on a tonally opposite scene (warm vs cold) before locking.
Name the technique, never the franchise. "Bold flat ink style of Mike Mignola," not "Hellboy" — naming the IP summons the red demon into your panels.
Negations backfire. "No horns" can summon horns. Assert the positive ("ordinary humans, smooth bare foreheads") and never write the banned noun.
Render signage textless, letter it in HTML — else the model invents gibberish or the franchise word.
When a prompt stalls or hits gore-friction, abstract toward silhouette — more reliable to generate and stronger on the page.
Respect the rate limit as a guardrail. One generator at a time, throttle, download-as-ready, resumable scripts, skip-on-disk. (B4M: 60/min · 1000/day per key — every poll counts.)

Part IV — The Breakthrough: a Reward Function for Subjective Work

Here's the part to internalize. The whole experiment started from one belief:

AI can do almost anything provided it has a clear reward function — unit tests, a board evaluator, a loss to descend. The open problem is building that for subjective work.

The core idea

A work is legible if a deliberately under-briefed cold reader reconstructs the intended meaning from the artifact alone. A misread is a failing test. The set of misreads is a stack trace.

We made this concrete on the graphic novel:

The cold reader. A small, context-starved model (Haiku), shown a single panel, image-only, forbidden from reading the script. The weaker and more under-briefed the reader, the more sensitive the test. If a lightly-attached reader can't misunderstand it, a skimming human won't either.

The answer key (intent block). Per cell we encode the semantic ground truth:

intent:
  beat:        "the bargain is struck — protection sold, then withheld"
  era:         now
  feel:        "fragile hope, about to be taken"
  readsAs:     "people are quietly building something good together"
  thesis:      "the commons keeps almost building the alternative"
  mustNotRead:                       # ← the NEGATIVE tests / regression suite
    - "NOT an achieved utopia — this is YEARNED-FOR and fragile"
    - "NOT a literal monster — the horned figure is the State/extractor"
  carriedBy:   [art, caption]        # which layer is responsible for this meaning

mustNotRead — the specific misreadings to prevent — is the most important field. It is your regression suite.

The reward. pass(cell) ⇔ the cold reader recovers {readsAs} and trips zero mustNotRead. Global: does the reader recover the thesis across the work?
The harness (grok_test.py). Cold reader (Haiku) → judge (Sonnet) scores reconstruction vs the answer key → scorecard. Runnable per-cell or whole-suite. CI-able (exit 1 on any fail).
The loop. Run the reader → diff reconstruction vs intent → add the minimum information to the responsible layer → repeat until pass + zero mustNotRead.

The trap is "make the art carry everything." It can't, and shouldn't. Each layer owns part of the meaning and gets its own test:

Layer	Owns	How we test it
Art	era-feel, emotional grammar (palette), the literal subject	cold reader, image-only
Caption	the literal proposition / who-said-what	cold reader, image + caption
Prose / sequence	the thesis, the cross-era argument, era itself	cold reader, annotated

A meaning "leak" is only a bug if it leaks from the layer that was supposed to carry it. The art failing to convey "the 408 rescript" is not an art bug — that's the prose's job. The art letting a hero read as a monster is an art bug.

What it actually caught (the proof)

An IP bug, three times. Heavy-ink silhouettes read as Hellboy-style demons. The cold reader flagged it even with a disambiguating caption — objective, repeatable, not one person's eye. We redrew; it went green; the regression test now guards it forever.
The thesis A/B. Image-only, the cold reader read an "endurance vs cosmic forces" eco-fable. Add the prose rail → it reconstructed the exact extraction thesis: "surplus → taxation, coercion, the bargain… inescapable from the moment humans learned to keep what can be taken." That proved the prose layer is load-bearing — and justified shipping the mixed-media Annotated mode.

What its failures taught us (the honest part)

Scaling to 90 cells produced an alarming raw number (2/86). Reading the failures (not the count) revealed three calibration truths — each a thing the test was wrongly blaming the art for:

Era is context-carried, not a lone panel's job → make era informational, feed the chapter header in annotated mode.
Thesis-level mustNotReads ("the dog IS the police") are prose-carried → can't be satisfied by a single panel; don't grade the art on them.
"Reads supernatural in isolation" is partly intrinsic to a heavy-ink style viewed cold — not a fixed bug list; chasing it cell-by-cell is whack-a-mole against your own aesthetic. In context it doesn't happen.

The meta-lesson: the reward function's job is partly to teach you what each layer is responsible for. Its right use is (a) catching egregious regressions and (b) the differential between reading modes — never an absolute pass-rate. A naive scoreboard would have lied to us; the differential told the truth.

Part V — For Your Team: Build Your Own Subjective-Reward-Function Review Skill

This is the transferable recipe. Any time your team says "we can't really measure quality here, it's subjective" — you probably can. Here's how.

The 6-step recipe

Name the artifact and its intended effect. What should the recipient understand, feel, or do?
Write the answer key (intent). Per unit (a PR, a screen, a paragraph, a chart, a ticket reply): the intended takeaway (readsAs), the desired response (feel), and — crucially — the mustNotRead: the specific wrong conclusions to prevent.
Pick a cold reader. A deliberately under-briefed model (or a fresh human) given only the artifact. Under-briefing is a feature: it surfaces what the artifact actually transmits.
Identify the layers and what each is responsible for. Test each layer with the context that layer's real audience has — no more.
Score: reconstruction vs intent. Pass ⇔ recovers the takeaway AND trips zero mustNotRead. Track the differential across contexts, not just an absolute.
Close the loop. Fix the layer that leaked, re-run. mustNotRead becomes a standing regression suite. Wire it into CI for the egregious cases.

What this looks like in other domains

Domain	Artifact + unit	Cold reader	`mustNotRead` (the negative tests)
Code review	a PR / a function	a model given only the diff, no ticket	"must not read as adding auth when it removes it"; "must not look safe-to-merge if it drops a check"
Design / UX	a screen mockup	a model given the screenshot, no spec	"the primary CTA must not read as secondary"; "must not look like a destructive action is the default"
Marketing copy	a landing section	a reader given only the page	"must not read as a free product"; "must not imply a medical claim"
Docs / API	a how-to page	a model that must complete the task from the page alone	"must not lead the reader to call the deprecated endpoint"
Dashboards	a chart	a reader given only the figure	"must not read as growth when it's churn"; "axis must not imply zero-baseline"
Support replies	a drafted reply	a reader playing the upset customer	"must not read as blaming the user"; "must not promise a refund we won't give"

The principles that make it work (steal these)

Under-brief the reader on purpose. Sensitivity comes from starvation of context.
mustNotRead is the asset. Positive intent is nice; the enumerated misreadings are what turn taste into a regression suite. Mine them from real past mistakes.
Be layer-aware. Don't blame the diagram for what the caption should say. Assign each meaning to the cheapest layer that can carry it, and test that layer with that layer's real context.
Trust the differential, distrust the absolute. "Score went from X→Y when we added the spec" is signal. "Score is 5/86" out of context is usually a calibration artifact.
The loop is the point. Author intent → cold-read → score → fix the leaking layer → re-run. Taste becomes a test; the test becomes a work-list; the work-list gets done; the test confirms it.

Make it a skill

Wrap it like we wrapped /storyboard: a reusable skill that (1) elicits the intent answer key, (2) runs the cold reader at the right context level, (3) judges against the key, (4) returns a scored, layer-attributed work-list. Then every artifact your team ships gets unit-tested for meaning.

Part VI — Artifact Map

Thing	Where
The graphic novel (read it)	`screenplay/pages/index.html` · gated: `erikbethke.com/toil-and-harvest`
The reward-function theory + every result	`NARRATIVE-REWARD-FUNCTION.md` (§§1–12)
The runnable harness	`grok_test.py` (`python3 grok_test.py [cell-id] [--annotated]`)
The data model + `intent` spec	`screenplay/SCHEMA.md`
The shot-tree ledger	`screenplay/INDEX.md`
Per-cell records + answer keys	`screenplay/{act1,act2,act3,climax,bonus}/*.md`
The pipeline skill	`.claude/skills/storyboard/SKILL.md` (+ `METHODOLOGY.md`)
The GRIN single-page variant	`.claude/skills/grin-dispatch/SKILL.md`
Image-gen scripts (429-safe, resumable)	`gen_.py`, `dl_act1.py`, `regen_.py`
Book bundler for the gated app	`pnpm bundle:toil` → `app/toil-and-harvest/_book.json`

The deepest result isn't a graphic novel. It's that "is this good?" — for a story, a screen, a PR, a reply — can be turned into a number that points at the exact thing to fix, by asking a stranger what they understood and measuring the gap against what you meant.

Toil & Harvest — a graphic novel about the oldest bargain

Ten thousand years of 'tribute demanded, protection denied,' drawn in bold ink. Read the full graphic novel — free for subscribers.

Toil & Harvest

Graphic Novel

Art

You Can't Photocopy a Ghost

The universe won't let you save-scum. Deciding is classical — by theorem. And a third lab just re-derived our definition of a mind by accident.

Consciousness

Physics

A 747 Cannot Fly

Anil Seth says my AI probably isn't conscious. He's right — until he leans on a fifty-year-old word trick. A dragonfly, a jet, and a machine that argu...