This is a dynamic post stored in the cloud and can be updated in real-time.

Stop Asking Your AI to Count

February 12, 2026

Erik Bethke

14 views

AI Architecture Software Engineering Philosophy

The anti-hallucination architecture: use LLMs for intent, deterministic engines for truth, and stop asking either one to be the other.

1,387 words · 7 min read

Share this post:

Export:

Stop Asking Your AI to Count

There's a disease spreading through the software industry right now, and it's killing products before they ship.

The symptom: developers building AI-powered tools that ask the language model to do everything. Parse the intent. Query the database. Compute the answer. Format the output. Validate the result. All one model, all one prompt, all one prayer to the machine learning gods that today's inference doesn't hallucinate your customer's financial data.

I've been building AI-native enterprise software across multiple product lines simultaneously, and I've arrived at a principle I want to share. It's not subtle. It's not nuanced. It's this:

Use the LLM for what it's good at. Use deterministic tools for what they're good at. And stop asking either one to be the other.

We're not fucking around asking how many R's are in strawberry. We're building real systems.

The Split

Language models are extraordinary at understanding intent. "Move the heavy stuff to someone with capacity." "What if we dropped everything below critical priority?" "Set the market projection to forty-five billion." A human said something fuzzy, and the model understood what they meant. That's the magic. That's the miracle of the technology.

But then builders make the mistake: they ask that same model to compute the answer.

"Based on the current schedule, recalculate the makespan after reassignment." No. Stop. That's what a scheduler does. That's what a DAG engine does. That's what a solver does. You have real tools — topological sort, constraint satisfaction, optimization algorithms — that will give you a correct answer every single time. Why are you asking a stochastic text predictor to do arithmetic?

The architecture I've converged on looks like this:

Natural Language
  → LLM: resolve intent
    → Engine: compute truth
      → Answer

The LLM translates "move the heavy stuff to someone with capacity" into structured operations: which items, which person, what constraints. Then a deterministic engine — a real computation engine with real algorithms — computes the actual result. Every number traceable. Every source cited. Zero hallucination.

The Anti-Hallucination Architecture

Here's a concrete example of the difference.

The wrong way:

User: "What's the projected market size for 2026?"

LLM: "Based on my training data, approximately $580 billion."

Where did that number come from? Training data from when? What methodology? What sources? You have no idea. Your customer has no idea. And if they make a business decision on that number, you're liable for a hallucination.

The right way:

User: "What's the projected market size for 2026?"

LLM: calls query tool with pattern matching

Engine: reads cells, applies computation rules, returns result with provenance

LLM: "The projected market size for 2026 is $612.4B, based on third-party data from [source] with an 8.2% growth assumption by [analyst] on [date]."

Every number is computed from ground truth. Every source is tracked with provenance — did this come from external data, an analyst's assumption, or a formula? The LLM never produced a number. It produced understanding, and then called tools that produced truth.

Pull-Work-Push

I wrote previously about the Pull-Work-Push paradigm — the idea that the cloud inverts from being where work happens to being where finished work is published. The same principle applies at the architectural level inside AI-native applications:

Pull: Query ground truth from wherever it lives — databases, APIs, external services
Work: LLM resolves human intent into structured operations. Deterministic engines compute results. Neither pretends to be the other.
Push: Apply the result — persist it, sync it to external systems, present it to the user

The atomic artifact that flows through this pipeline is what I call a changeset — a structured description of what should change, computed from what's true, triggered by what the human meant. Every interface (chat, CLI, API, autonomous agent) produces changesets. The same backend consumes them. Same preview. Same provenance. Same undo.

The Three-Tool Pattern

After building this across multiple product lines, I've found that the ideal MCP (Model Context Protocol) surface for any AI-native domain is exactly three tools:

Query: Ask anything about the current state. Read-only. Grounded in real data.
Propose: Describe what you want to change in natural language. Get back a preview — a diff of what would happen, computed by real engines, before anything changes.
Apply: Execute a previously proposed changeset. Optionally sync to external systems.

That's it. Not fifteen CRUD operations. Not a tool per database table. Three intent-level tools that let the LLM do what it does (understand language) and the engines do what they do (compute reality).

The "propose" step is the critical insight. It means every mutation is a simulation first. The user sees what would change before it changes. The AI never modifies state without permission. And the preview itself is computed by deterministic engines, so it's trustworthy.

Why Coding Agents Won First

Here's something I haven't seen anyone else articulate, and I think it matters.

AI coding assistants are the most successful agentic AI products on the planet right now. Not chatbots. Not image generators. Not search engines. Coding agents. Why?

The conventional answer is "code is well-structured, so it's easy for AI." That's wrong. Code is one of the hardest things to get right — a single misplaced character can crash a system. The real answer is simpler and more profound:

Software development already had the Pull-Work-Push infrastructure.

Think about what git actually is. It's a changeset system. Every commit is a changeset — a structured diff of what changed, authored by whom, with a message explaining why. Every branch is a proposal workspace. Every pull request is a propose step — "here's what I want to change, review it before it goes live." Every code review is the human-in-the-loop approval. Every CI pipeline is automated validation of the proposed changeset.

git pull     → PULL    (get current truth)
write code   → WORK    (create changes)
git diff     → PROPOSE (preview the diff)
code review  → APPROVE (human validates)
git push     → PUSH    (apply to shared state)

Developers have been doing Query-Propose-Apply for twenty years. They just called it "version control" and thought it was specific to code. It's not. It's the universal pattern for how intelligence — human or artificial — should interact with any structured system.

When an AI coding agent reads your codebase, proposes an edit, shows you the diff, and waits for approval before writing — that's not a feature of the AI. That's the AI plugging into infrastructure that already existed. The discipline was already there. The changeset pipeline was already there. The review process was already there.

And this is why most "AI for X" products outside of software development feel clunky. They're trying to build agentic AI for domains that don't have git. Finance doesn't have version-controlled changesets. Project management doesn't have pull requests. Document editing doesn't have diffs and approvals.

So what do you do? You build it. You give every domain the same disciplined changeset pipeline that made software development ready for AI agents. Structured proposals. Preview before apply. Provenance tracking. Undo via inverse changeset. Human-in-the-loop approval at every mutation.

That's what I'm building. Not AI wrappers around existing tools. The infrastructure that makes AI agents trustworthy in domains that never had version control.

Why This Matters

Most "AI-powered" enterprise products I see are making the same mistake: they're wrapping a language model around a CRUD database and calling it intelligence. The LLM generates SQL. The LLM formats the chart. The LLM summarizes the data. And when it hallucinates — and it will — the entire system is compromised because there's no layer of computational truth between the model and the user.

The companies that win in AI-native enterprise software will be the ones that understand the split: language models for intent, engines for truth. The LLM is the interface. The engine is the brain. And the changeset is the contract between them.

Every number auditable. Every source provenance-tagged. Every mutation previewed before committed.

That's not an AI wrapper. That's an architecture.

Build systems where the AI understands what you mean, and the math proves what's true. And give every domain the infrastructure that made software ready for AI: the disciplined changeset pipeline. Stop asking your AI to count.

REVERSE CONSCIOUSNESS

Or: What If You're the NPC? A chapter from the forthcoming cognitive autobiography of Erik Bethke.

Consciousness

Philosophy

The Future of Software is Local

How Claude Code reveals the next paradigm of cognitive work - and why the cloud is about to become the consumption layer, not the workbench

Claude Code

Software Engineering

Leylines: On Discovery, Creation, and Navigating the Hyperdimensional Universe

Everything that can exist, does exist—somewhere in the vast hyperdimensional universe. The question isn't whether to discover or create, but how effic...

Philosophy

Science

Subscribe to the Newsletter

Get notified when I publish new blog posts about game development, AI, entrepreneurship, and technology. No spam, unsubscribe anytime.

Stop Asking Your AI to Count

Stop Asking Your AI to Count

The Split

The Anti-Hallucination Architecture

Pull-Work-Push

The Three-Tool Pattern

Why Coding Agents Won First

Why This Matters

Related Posts

Subscribe to the Newsletter

Comments