June 25, 2026

Eval as an Input, Not a Dashboard: Building Self-Healing LLM Systems

There's a quiet assumption baked into how most teams evaluate LLM applications: that an eval is something you look at. You build a test set, run it, get a score, put the score on a dashboard, and feel reassured or alarmed. The dashboard tells you that something regressed. A human still has to read it, work out why, and go fix it by hand. The eval measures; the human repairs.

The most interesting shift happening in agent engineering right now is to break that assumption entirely — to stop treating the eval as a report you read and start treating it as an input that rewrites the system. The eval's output doesn't go on a chart for a person to interpret. It goes straight back into the loop as the signal that drives automatic optimization.

“The eval becomes an input, not a dashboard.” Run traces flow to a judge (verdicts → fix suggestions → a human gate that feeds an updated setup back to the agent) and, in parallel, to “dreamer” agents that write to memory the agent draws on. The single human gate is the only step in the loop that isn't a model. (Redrawn from a talk on evaluating LLM agents.)

From scoreboard to control signal

A static test suite is honest but passive. It can tell you the pass rate dropped from 84% to 79%; it cannot tell you which prompt to edit, and it certainly can't edit it. So the value of a dashboard is capped by the human attached to it — someone has to notice, diagnose, and act, and that someone is the bottleneck.

Making the eval an input removes that ceiling. Instead of a number a person reads, the evaluation becomes structured feedback the system consumes: this run failed, here, for this reason, and here is the change that would fix it. The eval stops being the end of a measurement and becomes the start of an optimization.

The loop: trace, judge, fix, gate

The mechanism starts with what gets evaluated. You don't judge the agent's final answer alone — you capture its entire execution trace, the whole DAG of steps it took to get there: every retrieval, every tool call, every intermediate decision. That trace, not the output, is the unit of evaluation, because that's where the failure actually lives.

The trace flows to independent judge models that produce verdicts: not just "this run was bad," but "the error is at this node." Then a fix-suggester goes one step further than any dashboard ever could — it proposes a concrete change. A localized prompt edit, a tweaked instruction on the offending node, an adjustment aimed exactly where the verdict found the fault. The eval has gone from "something's wrong" to "here's the patch."

And then, the one deliberate piece of friction: a human gate. A person reviews the proposed fix and approves or rejects it. Approved, it becomes the updated setup and feeds back into the agent. The loop closes — run, trace, judge, suggest, approve, update, run again — and the system gets better without anyone hand-writing the fix.

Why it can't grade its own homework

The tempting shortcut is to skip the independent judges and have the agent reflect on its own run. It sounds elegant and it doesn't work. An agent reviewing its own trace shares the exact blind spots that produced the error in the first place. Its flawed reasoning looks correct from the inside — that's what a logical hallucination is — so self-reflection sails right past the very mistakes you need it to catch.

This is why the traces have to go to external models with fresh context. The doer and the critic must be separate minds, because the critic's whole job is to see what the doer couldn't. It's the same reason a query bot needs a dedicated judge rather than trusting its own first answer — taken all the way up to the level of the entire execution trace, and pushed past judging into proposing the fix.

"It's AI all the way down" — except for one gate

Look closely at this loop and something striking emerges: almost none of it is human. The agent is a model. The judge is a model. The judge's criteria are model-written. The fix suggestions are model-written. The single human approval gate is the only thing in the entire loop that isn't a model.

That one gate is not a formality — it's the load-bearing safety mechanism. An all-AI optimization loop with no human in it can drift, over-fit to its own judge's quirks, or poison itself: a plausible-but-wrong "fix" gets applied, changes behavior, and the next round of model-written evaluation now grades against a corrupted baseline. The human gate is the circuit breaker against that compounding failure, and it's where accountability lives — a person signs off on every change that reaches production.

Notice what this does to the human's role. They're no longer writing test cases or staring at dashboards. They're reviewing proposed fixes and holding the gate. The bottleneck moves from authoring to approving — a far higher-leverage place for human judgment to stand, and one where a single person can oversee an optimization loop that would have taken a team to run by hand.

The dreamer branch: offline analysis into memory

The judge-and-fix loop is only the top half. Running in parallel, on a slower offline cadence, is a second branch: "dreamer" agents that comb back through past interactions. They don't sit in the live request path; they work in the background, distilling what really happened, synthesizing new test scenarios out of real traffic, and writing their findings into a memory store. The main agent then reaches into that memory for context on future requests.

This is the offline half of the system — the part that runs when no one is waiting, turning yesterday's real conversations into tomorrow's test cases and tomorrow's context. It's the same separation we keep coming back to: an ahead-of-time, background module that prepares knowledge, feeding an online agent that consumes it. Here the background module isn't just precomputing schema — it's dreaming up evals and growing the agent's memory from lived experience.

Shadow deployments: real data, no consequences

One problem remains: to evaluate an agent against reality, you need it running on real production data — but you can't let a half-finished agent actually act on that data. The answer is a shadow deployment. The agent runs in production, on live data, strictly read-only. When it decides to take an action — to write, to send, to change something — middleware intercepts the call and fakes success, handing back a 200 OK for an operation it never really performed. The agent proceeds as though it acted; nothing actually happened.

What the engineers study, then, isn't the output — it's the agent's decision trajectory. You get the full realism of production traffic with none of the risk, evaluating how the agent reasons in real situations long before it's trusted to touch anything. It's the read-only safety principle turned into a development methodology: let it think against the real world, just don't let it act yet.

What changes when the eval is an input

Put it together and the picture inverts. The eval is no longer the thing you check at the end; it's the engine that improves the system. Judges read traces and pass verdicts. Fix-suggesters write patches. Dreamers mine the past into memory. Shadow deployments supply real-world pressure without real-world damage. And a single human stands at the one gate that keeps an otherwise fully automated loop honest.

"Self-healing" is a fair name for the result, as long as you remember where the healing actually comes from. It is not an agent magically debugging itself — we saw why that fails. It's a loop of independent models proposing changes, a memory that compounds, and a human who approves. The dashboard was never going to fix anything; it could only ever tell you something was broken. The leap is to make the evaluation the input that does the fixing — and to keep one human hand on the gate while it does.

Still treating your evals as a dashboard you check after the fact? The shift to eval-as-input — traces, independent judges, proposed fixes, and a human gate — is how agent systems start improving themselves. Let's scope what that loop looks like for your stack.