June 23, 2026
The Knowledge Hourglass: Deflate Everything, Then Inflate on Demand
The single hardest problem in building a knowledge bot has nothing to do with the model. It's a mismatch of scale. On one side you have a knowledge base that is effectively unbounded — a production database with a thousand tables, a decade of support tickets, a wiki nobody has fully read. On the other side you have a context window that, however large the vendors make it, is always finite and always expensive to fill.
The naive instinct is to wait for bigger windows. That's the wrong bet. Even when everything fits, stuffing it all in makes answers worse, not better — a model asked to find one fact inside ten thousand irrelevant ones reliably gets distracted. We call this failure mode hallucination via distractors, and it's the reason "just paste the whole schema" never works in production.
The pattern that does work is shaped like an hourglass.
The shape of the solution
Picture three layers stacked vertically.
At the top, the wide bulb: all of your raw knowledge. Huge, messy, authoritative, far too big to reason over directly.
At the neck, the narrowest point: a single user question. A few words of natural language — "how many engineers do we have in the Prague office?" That tiny signal is the only thing the system actually starts with.
At the bottom, the wide bulb again: the precise, reconstructed context the model needs to answer this question — and nothing else.
The whole craft of a knowledge bot is moving information through that neck without losing what matters. It happens in two motions, and they run at completely different times.
Deflate: compress the world before anyone asks
The first motion happens ahead of time, as a background job, long before a user shows up. We take the enormous top bulb and squeeze it into something compact and queryable. Crucially, we are not summarizing the data — we are building an index of its essence.
For a database agent, deflation precomputes:
- Embeddings of every table and column name, plus human-written descriptions, so meaning can be searched by similarity rather than exact string match.
- The relationship map — foreign keys, join paths, which tables hang off which — so the bot never has to rediscover structure at runtime.
- Lookup contents — the code tables that translate
1into "Engineer" — kept ready but not yet injected. - Cheap statistics — row counts, fill rates, distinct-value counts — the kind of thing that lets a bot sanity-check itself later.
Why pay this cost up front? Because the alternative is querying the system catalog live, on every request. That's slow when the schema is large, and it's surprisingly error-prone. Precomputing turns a recurring runtime expense into a one-time preprocessing step. The model stops getting handed the whole library and starts getting handed a cheat sheet.
Deflation is ahead-of-time compression. You trade a nightly background job for fast, clean, low-noise context at the moment it actually matters — when a human is waiting for an answer.
The output of this phase is two stores: a vector index for semantic search, and a metadata store holding the precomputed relationships and lookups. Refresh them on a schedule — nightly is plenty for most systems — and the freshness problem mostly disappears. Knowledge changed? Re-index. No retraining, no fine-tuning, no model touched at all.
Inflate: reconstruct just enough, layer by layer
The second motion happens at runtime, and it runs in the opposite direction. The bot receives the tiny question at the neck and progressively inflates it back into rich context — but only along the path the question actually requires.
This is the part teams get wrong. They try to assemble the full context in one shot. The hourglass says: expand in stages, and let each stage decide what the next one needs.
- Find the candidates. Use the question to search the vector index for relevant tables. You're casting for recall here — better to surface a few too many than to miss the one that mattered.
- Pull the structure. For those candidates, retrieve columns, types, and — following the precomputed map — the related tables and lookups they depend on.
- Explore the values. Only now, and only for the columns in play, fetch the specific lookup values the question touches. Not all thousand industry codes — just the handful this query needs.
- Assemble and answer. Hand the model a tight, purpose-built packet and let it do the one thing models are genuinely great at: reasoning over well-chosen context.
Each layer is an act of progressive disclosure. The model never sees the whole schema. It sees a small, growing, relevant slice — and because the slice is small, the reasoning is sharp.
The neck is the whole point
It's worth dwelling on the narrowest part of the glass, because it contains the deepest idea here. At the neck, the entire system is operating on almost nothing — a few words. Everything below the neck is reconstructed from that small signal, the way a compressed file is expanded back into a full document.
That reframes what a knowledge bot actually is. It is not a search engine that returns documents. It is a decompression engine that takes a sparse human intention and rebuilds, on demand, the exact slice of your world needed to satisfy it. Design the neck well — clean precomputed indexes feeding disciplined, staged expansion — and everything downstream gets cheaper, faster, and more accurate at the same time.
Keep the packet lean
The most common regression we see is a packet that quietly bloats. A column references a lookup table, so the bot dutifully attaches all of that table's values — and if that lookup happens to be a list of ten thousand classification codes, your carefully built context is now 95% noise and the one fact about office location is buried. The model loses the thread, and you get a confident wrong answer.
The fix is a discipline, not a model upgrade: cap how much any single source can contribute, and attach lookups only when the question actually reaches for them. A lean packet isn't a nice-to-have. It is the difference between an agent that's right and one that's plausible.
Why this beats a bigger window
Bigger context windows are real and useful, but they don't dissolve this problem — they just change where it bites. Cost scales with tokens. Latency scales with tokens. And accuracy, past a point, falls with tokens as distractors pile up. The hourglass attacks all three at once by never putting more than the question needs in front of the model.
It also makes the system honest about freshness in a way fine-tuning never can. Your knowledge lives outside the model, in indexes you control and refresh on your own schedule. When the world changes, you re-deflate. The model stays exactly as it was — and stays current anyway.
Designing a bot that has to reason over a large, messy knowledge base? The hourglass is where we start every build. A 30-minute scoping call will tell you what to precompute, what to expand, and where your current setup is leaking accuracy.
