← Back to blog

Your LLM is a generator, not a calculator

You asked the model something simple. Total these few numbers. Work out which of two dates is later. Check whether this value is inside the limit. It answered instantly, with total confidence — and it was wrong.

The instinct is to think the model isn't smart enough yet, and wait for the next version. That instinct is the bug. A smarter model will not fix this. You handed a generator a calculator's job.

An LLM does exactly one thing

An LLM generates. That is the whole tool. Every output it produces — a sentence, a decision, a function, a yes-or-no — is generated: assembled, plausibly, from patterns. It is not computed. Ask the same question twice and you can get two different answers, because generation is nondeterministic by nature. That isn't a defect waiting on a patch. It is the definition of the thing.

So "the LLM makes a decision" and "the LLM writes code" are not two different talents. They're the same act — generation — aimed at different outputs. There is one job inside an LLM, and it is a generative one.

Which means it cannot compute

Here's the consequence people step right past. A generator cannot do deterministic work. Not "does it poorly" — cannot, the way a poet cannot be a calculator.

Addition has exactly one correct answer; an LLM produces a likely one. Comparing two dates, reading a clock, checking a number against a bound — every one of those is a deterministic operation, and every one is something an LLM will answer fluently and sometimes wrongly, because it is generating a plausible response, not running the operation.

The trap is that plausible looks exactly like correct. The model is right often enough that you ship it. Then one ordinary day it isn't — and there was no error, no exception, no signal. Just a confidently generated wrong number, moving downstream.

When all you have is a hammer

None of this would matter if we didn't keep doing it. We do, for a plain reason: the LLM is the most exciting tool most of us have ever been handed, and a thrilling new tool makes every job look like its job. When all you have is a hammer, everything looks like a nail.

But you don't only have a hammer. You have the other tool — deterministic code. Seventy years of it. It is exact, instant, effectively free, and it returns the same answer every single time. There are two kinds of work in any real system, generative and deterministic. There are two tools. The entire skill is matching them.

The inverse case shows how obvious that normally is. You could produce English with a deterministic random-character generator — run it long enough and it eventually types a coherent sentence, and eventually the complete works of Shakespeare. Nobody does that. It's absurd on its face: a deterministic tool grinding away at a generative job. We would never. Yet the mirror image — a generative tool doing deterministic work — we ship every day and call it an AI feature.

Just because it can

The reason we make the mistake is seductive. The LLM can answer "what's 847 plus 2,933." It will often be right. It can probably tell you whether the tests passed by reading the log. It can usually remember the rule you wrote into the prompt.

Every "can" in that paragraph is the trap. That the tool can produce an answer doesn't mean it should be the thing producing it. You could ask a poet to total your invoice line items. They might even get it right. There is a calculator three inches away. Using an LLM for deterministic work isn't reaching for power — it's choosing a nondeterministic process for a question that already had a correct, instant, free answer waiting in the other tool.

The deterministic job we hand over without noticing

Arithmetic is the obvious case. Here is the pervasive one.

Look at a real agent's system prompt, past the persona and the task. It is a rulebook. "Always check the ticket exists before deploying. Never refund above the limit without approval. Retry a failed step twice, then stop."

Enforcing a rule is deterministic work. A rule is a guarantee — it holds every time, or it was never a rule. Writing it into a prompt hands that guarantee to a nondeterministic generator, and a generator cannot deliver a guarantee. It can only generate behavior that probably matches the rule. A guardrail that holds "probably" is not a guardrail. It's a hope with good formatting.

That is the same category error as the arithmetic — a deterministic job given to a generator — scaled up to the part of the system whose entire purpose was to keep you safe.

Give each job to its tool

The fix is not a better model. It's assigning each piece of work to the tool built for it — and mcp-flowgate makes that assignment explicit, one step at a time.

Every transition in a workflow is tagged with an actor. actor: deterministic means this step is computation: the runtime performs it, no model involved. actor: agent means this step is generative — a real judgment, handed to the LLM. The tag forces the question to be answered out loud, for every step in the system: is this work generative, or deterministic?

ship.yaml
states:
  test:
    transitions:
      run_tests:
        target: ready_to_ship
        actor: deterministic     # computation — the runtime runs it
        executor: { kind: cli, command: test-runner }
  ready_to_ship:
    transitions:
      ship:
        target: shipped
        actor: agent             # judgment — the LLM decides

And the deterministic machinery runs everywhere else the model would otherwise be quietly drafted in:

  • A guard checks a precondition — an expr comparison, or evidence that a step happened — by computing it. You never ask the model "do you think the tests passed?" A guard checks that the tests_passed evidence exists.
  • branches route on a real outcome. The CLI executor captures an exit code as data (treatNonZeroAsFailure: false) and the runtime picks the path — the model doesn't read a log and form an opinion.
  • inputSchema validates input. Deterministically. The model is never asked whether the arguments look acceptable.
  • The state machine tracks where the workflow is, with a version counter. The model doesn't remember its place — the runtime knows it.

Every one of those is deterministic work, done by code, returning the same answer every time. What's left for the LLM is the generative work: which capability to reach for, what to write, and — given results it can now actually trust — whether to proceed.

Then both shine

Match the tools to the work and each does what it is great at. The code is exact. The LLM is extraordinary. The system is reliable because the parts that are supposed to be reliable actually are.

Mismatch them and you get the worst of both. Hand the LLM deterministic work and it looks unreliable — so you blame the model, and wait for a smarter one, and the smarter one is still a generator and still gets it wrong. The model was never the problem. The job assignment was.

A deploy workflow makes the split concrete. Running the tests is deterministic — a cli executor, the exit code captured as data. Deciding whether to ship, given a clean run and a diff that looks riskier than usual, is generative — a genuine judgment, actor: agent, the LLM's actual job. And "tests must pass before deploy" is enforced by the shape of the workflow itself: the deploy transition simply isn't reachable until the test state has been cleared. It is not a sentence in a prompt the model has to remember to honor.

Contrast the version we reach for by default: "run the tests, tell me if they passed, and deploy if so." One model turn, asked to execute a deterministic operation, evaluate a deterministic result, honor a deterministic rule, and make one generative decision — all at once, all by the single tool that's only good at the last of the four.

The line isn't always clean

Real work doesn't always sort itself into two tidy bins. A task usually has a generative core wrapped in deterministic structure: deciding what to deploy is generative; checking the tests passed first is deterministic; and they live inside the same story. The skill isn't sorting whole tasks. It's decomposing a task into its pieces and routing each piece to its tool — rather than handing the whole blob to whatever tool you happened to pick up first.

And yes, there's a cost. Today, it is faster to write one more sentence in the prompt than to model a step properly. But the sentence was never enforcing anything. You were borrowing reliability you did not actually have.

Let the calculator calculate

When your agent flubs something deterministic, don't file a ticket for a smarter model. A smarter generator is still a generator, and it will still, eventually, generate the wrong number with total confidence.

Reach for the other tool — the boring, exact one you've had all along — and give the LLM only the work it is, genuinely, the best tool in the world for. Let the calculator calculate. Let the generator generate.

For the mechanism that runs the deterministic steps without ever waking the model, read Deterministic chaining; the workflows guide shows how the actor split is wired in practice.