Notes on production AI — context engineering, agentic orchestration, and what it takes to ship bots that do not hallucinate their way into trouble. Written by someone who has done it at scale.
Most teams treat evaluation as a scoreboard — a number you glance at and feel good or bad about. The frontier idea is to wire the eval back into the system as an input that rewrites it: traces go to independent judges, fixes get proposed automatically, and one human approves. It's AI all the way down, with a single gate that isn't.
Before a bot can answer well, it has to know what kind of question it's facing. Skip that step and chitchat gets a database query, 'last month' goes to SQL untranslated, and every request runs the same dumb path. The fix is a cheap, decisive first move: classify the intent, extract the entities, resolve the time.
Your knowledge base is enormous and the model's context window is tiny. The trick isn't a bigger window — it's an hourglass: compress everything ahead of time, then expand only what one question actually needs.
You compressed your knowledge into clean indexes. Now comes the moment everything hinges on: given a question, can the bot actually find the handful of tables it needs among a thousand? Miss the right one and nothing downstream can save you — so at the retrieval step, recall beats precision.
Hand-writing the association layer is the right move for your first dozen tables. It does not scale to a thousand — or to a new client schema every month. The answer is to let the model draft the glossary from real data, then have a human curate it.
A user asks for 'employees.' Your database calls them H_OSOBA. No model bridges that gap on its own. The fix is the highest-leverage, least glamorous artifact in the whole system: a living glossary that maps human language to your schema.
An association layer teaches a bot your vocabulary — 'employee' means this table. But some questions aren't about words; they're about relationships and business concepts no single table spells out. That's where a flat dictionary ends and a knowledge graph begins.
Hand an agent total freedom and it wanders off and wrecks the task. Lock it in rigid if-else and it can't handle anything real. The production sweet spot is a fixed pipeline with autonomous pockets — rails the agent can't leave, and real decisions inside them.
A bot that doesn't learn from use is frozen at launch quality — and launch quality is the worst it should ever be. The difference between a bot that goes stale and one that sharpens every week is a feedback loop that turns each interaction into signal.
A query that runs cleanly can still be wrong. The fix isn't a smarter model — it's a loop: generate, probe the data, let a judge decide if the answer makes sense, and feed concrete failures back until it does.
The judge loop is about being right. This is about the prior question: what does a bot do when it isn't sure? The worst ones barrel ahead and answer anyway. A good one has a repertoire — clarify, default, hedge, or hand off — and knows which to reach for.
Most of what an agent does is trivial. Resolving a code, matching a value, extracting an entity — none of it needs your most expensive model. Defer the lookups, route the grunt work to cheap fast models, and save the genius for the genuinely hard part.
The best customers for a data bot — hospitals, banks — are the ones who legally can't let data leave the building. That constraint shapes everything: where the model runs, what the query layer is allowed to do, and how you prove the whole thing behaves.