June 13, 2026

Lazy Resolution and Right-Sized Models: Spend Intelligence Where It Counts

Watch where a data agent actually spends its time and money, and a pattern jumps out: the overwhelming majority of its work is not hard. It's looking things up. Matching "female" to a code value. Classifying whether a message is a real question or just hello. Pulling "Prague" out of a sentence. These are not feats of reasoning — they're clerical. And yet the naive design routes every one of them through the same large, slow, expensive model it uses for the genuinely difficult work.

That's the waste this article is about, and it has two fixes that work beautifully together: defer the lookups, and right-size the model for each task.

The context-flooding trap

Start with a concrete failure. The agent needs to write a query filtering on a coded column — say, an industry classification with over a thousand possible values. The instinct is to give the generation model everything it might need, so the system dutifully loads all thousand codes into the prompt and asks it to write the query.

Now the prompt is 95% classification codes. The one fact that matters — which code means what the user asked for — is buried in noise, and the model, asked to find a needle in a haystack it was handed, loses the thread. We've named this before: hallucination via distractors. The irony is sharp. You flooded the context to be helpful, and the flood is exactly what broke the answer.

The deeper problem is that you've conflated two different jobs — writing the query and resolving the value — and forced one model to do both at once, badly.

Lazy resolution: write the query, defer the value

Split the jobs. Let the generation model write the query structure and, wherever it needs a coded value, emit a placeholder instead of guessing an ID. Something legible like @ENUM_PERSON_GENDER_FEMALE — "here goes the gender code that means female, fill it in later."

The generation model never sees the thousand codes. It does the thing it's good at — structure — and punts the lookup downstream. Then a separate, tiny resolution step takes over:

It spots the placeholder and reads off the table and column.
It fetches the complete list of values for just that one column — all of them, because now precision matters and the list is small.
It hands a small model one focused question: "Here is the concept 'female' and here are the valid values: 1 = male, 2 = female. Which ID?"
It gets back 2, substitutes it into the query, and moves on.

Three things make this better than the flooded approach. There's no context flooding — the big model never drowns in codes. There's precision — the resolver sees the full list and concentrates on exactly one mapping, which is a task it basically can't get wrong. And there's robustness you get almost for free: because the resolver matches on meaning, not a hardcoded ID, the day someone renumbers the database so female becomes code 5, the system just finds it. It was never relying on the number. It was relying on the word.

Right-sized models: stop hiring a professor to add 1 + 1

Lazy resolution sets up the second idea, because that little resolution step is the perfect example of work that doesn't need a powerful model. Matching a concept to a short list is low-intelligence work. So don't spend high-intelligence money on it.

The principle: route each subtask to the cheapest model that can do it reliably. Intent classification, entity extraction, value matching, simple yes/no checks — these go to fast, inexpensive models. The heavy reasoning — planning a multi-step query, diagnosing a subtle failure — stays with the expensive one. You wouldn't call a structural engineer to change a lightbulb; you'd send an apprentice with clear instructions. Same logic, applied to a token budget.

The thing that makes this safe — and it's the part teams skip — is structured outputs. A cheap model left to write prose will ramble and err. The same cheap model constrained to return a single value from a fixed set, or a small JSON object with named fields, is suddenly reliable, because you've removed its room to wander. It's not composing an answer; it's filling in a form. Constrain the shape and a small model will match values, classify intents, and extract entities at a quality that holds up — at a fraction of the latency and cost.

This is offloading in the truest sense: take the large volume of trivial decisions, push them to cheap models with tight output contracts, and free your best model to think about the one or two things that genuinely require thought.

Fork, delegate, join

The same instinct scales up from "which model" to "how many at once." When an agent hits a stage with several independent sub-questions, it doesn't have to work through them in a line. It can fork — much like a process forks on Linux — into a set of sub-agents, hand each a small task, and run them all at the same time.

One sub-agent verifies that a particular join is valid. Another resolves a value against a lookup. A third counts rows to sanity-check a table. They work in parallel; the main agent waits at a checkpoint — a join in the threading sense — until they all report back, then integrates the results and continues. If a sub-task is itself complex, it can fork its own sub-agents, to a sensible depth.

Fork/join buys you two things at once. Speed, because independent work happens concurrently instead of in sequence. And cost, because each small, well-scoped sub-task is exactly the kind of low-intelligence job you can hand to a cheap model with a structured output. Parallelism and right-sizing aren't two separate optimizations — they're the same idea, applied to time and to money.

A word on "how long does it take?"

This architecture reframes a question clients always ask: how fast is it? The honest answer is that processing time isn't a fixed number — it's a function of two things: the size of the schema and the complexity of the question. A trivial question against a thousand-table database is still fast, because precomputation means the agent isn't trawling the whole schema, and parallelism means independent work collapses into the time of its slowest branch. A genuinely complex, multi-part question takes longer because there's genuinely more to do.

That's not a dodge; it's the truth about the system, and it's worth stating plainly. What you can promise is that the time is spent where it should be — on the hard part — and that the easy ninety percent has been made fast and cheap by deferring what can wait, shrinking what can be shrunk, and parallelizing what can run side by side.

Paying frontier-model prices for clerical work — or watching latency balloon on big schemas? Right-sizing the pipeline is usually the fastest win we find. Let's look at where your bot is overspending its intelligence.