This is a dynamic post stored in the cloud and can be updated in real-time.

Give the Model Eyeballs

June 28, 2026

Erik Bethke

6 views

If an agent can clearly see, measure, or hear a task, it's 80% solved — the mechanism, where it betrays you, and why it's the first thing I screen for when I hire.

1,876 words · 10 min read

Share this post:

Export:

If an agent can clearly see, measure, or hear a task, it's 80% solved. Here's the mechanism, where it betrays you, and why it's now the first thing I screen for when I hire.

Ask a frontier model how many R's are in "strawberry" and it will, often, get it wrong. The internet treats this as proof the emperor has no clothes — see, it can't even count.

It's proof of the opposite. The model never sees letters. "Strawberry" arrives as a couple of tokens, not a string of characters, so asking it to count the R's is asking a person to count the rod cells in their own retina by introspection. Hand it one tool — s.count('r') — and it's instantly, perfectly correct. You asked a blind man the color, and concluded he was stupid.

That single misunderstanding is the most expensive mistake people make about these systems — and I mean "expensive" personally.

😖 A personal confession. Every time someone whips out the strawberry "gotcha" as proof the machines are dumb, it hits me like nails on a chalkboard — the technical-literacy equivalent of dunking on a calculator for being bad at poetry. I love these people. I also have to leave the room.

So let me state the thing the misunderstanding hides, because once you see it you can't unsee it:

If you give the model eyeballs — something that can clearly see, measure, or listen to the task — the task is 80% solved.

A genius with no afferent nerves

Here is the mental model that explains everything else.

A large language model is a strong function and a very weak sensor. Give it an accurate observation of the current state and mapping that to the right next action is exactly what it's great at — compressed world knowledge applied to a concrete input. But its only native sense organ is the context window. The weights aren't perception; they're frozen memory — a brilliant prior with no live feed. At inference the model is a brain in a vat that gets exactly one note slipped under the door, and must act on the entire world from that note alone. No clock. No filesystem. No eyes. No persistence between calls.

So "eyeballs" don't add intelligence. They convert an open-loop guess — generate blind, hope it's right — into a closed-loop search — observe, correct, repeat. The 80% you feel is the whole gap between those two regimes. Nearly every "the model is dumb" moment is really "the model is flying blind."

I relearned this last night on a bug that had taxed me, on and off, for six months. A terminal pane would start spraying garbage — 24;30M65;24;30M65 — no usable prompt. reset came back zsh: command not found: 24;30M65. For six months my fix was to rage-quit the whole window. Last night I pasted those exact bytes to my agent. Seconds later: those are mouse-report escape codes. A crashed terminal app left the pane in "report every mouse movement" mode, so the bytes weren't from a process — they were generated by my mouse moving over the pane, and they shredded every reset I typed. Park the mouse off the window, then reset. Fixed forever. The reasoning was never hard. The agent just needed to see the raw bytes. The moment it could, a six-month mystery became a two-second fix.

Two rungs on the ladder

There are exactly two ways to give a system sight, and order matters.

Rung	The move	Example
1 — Give the model eyeballs	Let it observe and close its own loop.	A compiler error. A failing test. The raw mouse-codes I pasted in.
*2 — Give the system* ground truth**	When the model can't observe reliably, a human or a tool supplies the truth and the model consumes it.	A purpose-built editor that records the facts the model would otherwise guess.

Both rungs are the same refusal: never let the model fabricate the sensor reading.

The seam is the tell

I built a game in an evening recently — Strait Sweeper, Minesweeper played over satellite imagery of the Strait of Hormuz, where mines only spawn in water and the coastline becomes your safe ground. Eighteen commits, ~2,000 lines, one night. Almost all of it flew — audio synthesis, a leaderboard, the glue — because none of it required knowing an unobservable truth about this specific build.

But two places made me stop and build something, and those two places are the entire point.

First, telling land from water. I did not let the model guess coastlines from a grid. I built an interactive terrain editor and classified the cells myself against the satellite image. That's Rung 2 in its purest form — the truth was knowable, just not by the model, from that input — so I moved the sensor to where the truth actually lived.

Second, mine distribution. Random placement reasoned fine but measured badly: mines clustered in the open water. The fix wasn't cleverer logic; it was counting. Divide the grid into strips, count the water cells in each, allocate proportionally. I gave the distribution an eyeball instead of trusting that "uniform random" looks even.

The friction points were a perfect map of where the model was blind. In any agentic build, the seams tell you where to add eyeballs.

Why coding agents won first

Not because code is easy — because code has the best eyeballs that exist. Compilers, type checkers, tests, git diffs: instant, honest, high-signal, and very hard to fool. A coding agent lives inside a dense field of sensors, so it spends its time in closed-loop search instead of open-loop guessing.

Finance, project management, operations, legal — those domains don't have that instrumentation yet, and that is the whole opportunity. Building the eyeballs for a domain that lacks them is the product. It's most of what we do at Bike4Mind: deterministic engines that hand the model true measurements so it never has to hallucinate a number, and change-sets with previews so it can see the consequences of an action before it commits. The model resolves intent; the tools compute truth. I am not asking the model how many R's are in "strawberry."

This is practical AGI

Let me say the quiet part plainly: we have practical AGI right now, if you supply a state-of-the-art model with eyeballs and tools.

The "general" was never the bottleneck — generality of reasoning has been here a while. What was missing was the loop: perception in, action out, correction, repeat. So the AGI is the system, not the weights. People spent years staring at the model waiting for a spark, watching the wrong object. The spark was always going to be the architecture wrapped around it.

Two things keep that claim honest. First, "simply" supplying the eyeballs is the frontier — honest, discriminating sensors are most of the engineering. That doesn't weaken the claim; it is the moat. The model is the commodity; the instrumentation is the product. Second, practical AGI is error-correcting, not error-free — which is the exact rebuttal to "make me a million dollars and make no mistakes." Eyeballs don't prevent errors; they make errors cheap and recoverable, because the agent can see and fix its own. Open-loop perfection is impossible for any agent, human included. The achievement isn't a system that's never wrong — it's a system that's wrong constantly and converges anyway.

I have built this before. We called it an NPC.

I've spent 35 years making artificial minds and worlds feel alive — convincingly enough to entertain tens of millions of people. And a believable game character was never "intelligence." It was sensors + actuators + a loop + memory: raycasts and triggers and nav-mesh queries (eyeballs), animation and pathfinding (hands), a behavior tree or state machine (the loop), a blackboard (memory). You wire those together until something feels alive.

Which means the mystical-sounding ingredients aren't mystical at all. Memory is a write tool plus a read sensor over persistent state. The loop is the orchestrator that re-invokes the function on fresh observations. Agency is a goal in context, tools to act, sensors to perceive results, and a loop to persist toward it — point a closed loop at a goal and agency is simply what it looks like from outside. There is no secret sauce left to discover. Every ingredient people are still waiting for is a sensor or a tool you already know how to build.

The only thing that changed in 2025 is that the decision function in the middle went from hand-authored to general. I'm not learning a new paradigm. I'm recognizing my own, with a vastly better brain dropped into a socket I've been building my whole career.

The honest boundary: what's here is general execution. What's not here yet is general imagination — the system will execute anything you can decompose and instrument, but it does not yet decide, on its own, what is worth doing. You are still the imagination in the loop. That's not a gap in the thesis; it's the proof of it — and it relocates the scarce resource from compute to imagination + instrumentation.

How I hire for it now

For years the classic screen was a Fermi brain-teaser: how many windows are there in Seattle? Useful once, for testing whether someone could decompose and estimate under uncertainty.

I don't ask that anymore. Now I ask: "How would you architect an agentic solution to this problem?" — and I listen for one specific instinct.

Weak answer	The hire
Reaches for a better model, a longer prompt, more context, a cleverer chain. Tries to make the model smarter.	Asks "what does the agent need to be able to see or measure here — and how do I give it that?" Reaches for a sensor or a tool. Tries to make the problem observable.

The people who already think in eyeballs don't treat the model as the thing to optimize. They treat it as a strong function waiting for an honest signal, and they go build the signal. That instinct is rare, slow to teach, and it predicts whether someone ships working agents or a pile of demos that fall over on contact with reality. They understand, in their bones, that the AGI was never going to be the model.

The operative question

When an agent is thrashing, the highest-leverage move is almost never a smarter prompt. It's a better sensor. The question that beats nearly every other debugging instinct is simply:

"What can the model not see right now?"

Six months of closing windows. One pasted line of garbage. The whole difference was sight.

Give the model eyeballs. And once you have — once execution is cheap and the bottleneck is your own imagination — the next question gets strange and wonderful: what is imagination, that a machine could be given eyeballs for it too? I think it's a search across a space with hidden ridges we don't yet know how to name. That's the next essay.

A four-part series on intelligence:

1. Give the Model Eyeballs (you are here)
2. Leylines
3. Distillation Attacks on the Universe
4. Beauty Is the Reward

Subscribe to the Newsletter

Get notified when I publish new blog posts about game development, AI, entrepreneurship, and technology. No spam, unsubscribe anytime.