Retrieval Is Not Enough — Federico Bottino

Why Organizational AI Needs Epistemic Infrastructure

Federico Bottino, Carlo Ferrero, Nicholas Dosio, Pierfrancesco Beneventano

Paper, Preprint, arXiv preprint (cs.AI), 2026

Abstract

Retrieval-Augmented Generation has become the default way of plugging an LLM into a company's documents. Inside real organizations, retrieval alone is not enough. An organization needs to know what it holds to be true, what is contested, what it is committed to, and — above all — what it doesn't know. We present OIDA (Organizational Intelligence and Decision Architecture), a framework that represents organizational knowledge as typed objects with epistemic properties: commitment strength, contradiction status, provenance. The central piece is QUESTION-as-modeled-ignorance: a primitive that turns 'we don't know' into a first-class object the system can plan around, instead of filling the gap with a plausible-sounding sentence. We also propose an Epistemic Quality Score (EQS) to evaluate not how fluent an answer sounds, but how honest it is about what it doesn't know. In our experiments OIDA reaches an EQS of 0.530 with 3,868 tokens, against 0.848 for a full-context baseline that uses 108,687 tokens — about 28× more. The QUESTION mechanism shows a statistically significant effect on outcome quality (Fisher exact, p = 0.0325).

K-score trajectories over 28 days under stationary inputs for five epistemic classes: QUESTION rises from 0.5 to 0.556, OBSERVATION decays from 0.5 to 0.435, DECISION near-stable, EVIDENCE and HYPOTHESIS in between. — Figure 1 — K-score dynamics over 28 days (paper Theorem 1 / Figure 2). With identical initial conditions, QUESTION KOs gain urgency from the first update while OBSERVATION decays. What is unknown ends up outranking what is stale.

EQS sub-score comparison from Table 5 of the OIDA paper: Minerva (OIDA RAG, 3868 tokens) vs Cowork (full-context baseline, 108687 tokens) on ECA, CP, CR, EC, DE and composite EQS, plus QUESTION declaration callout 10/10 vs 5/10 (Fisher p=0.0325). — Figure 2 — EQS sub-score breakdown (paper Table 5, n=10 response pairs). Composite EQS 0.530 vs 0.848 at a 28.1× token-budget gap. The unconfounded architectural signal is the QUESTION declaration rate: 10/10 for OIDA vs 5/10 for the baseline.

Log-log plot of token budget to reach context sufficiency CS@8 ≥ 0.8 across corpus sizes 30 to 3M docs, for BM25, Dense, Hybrid, OIDA, Whitelist. OIDA stays near 1,000–1,500 tokens; BM25 / Dense / Hybrid blow past the 16K budget by 3K–30K docs. — Figure 3 — OIDA Bench v2 (forthcoming). Token budget needed to reach CS@8 ≥ 0.8, by corpus size. OIDA stays flat near 1,000–1,500 tokens regardless of how big the corpus gets; BM25, Dense and Hybrid scale poorly and hit the never-reached zone (≥ 16K tokens) by 3K–30K documents.

Horizontal bar chart of token budget to reach CS ≥ 0.5 on the clean_big corpus (30K docs): BM25 ~1684 tok, Dense not reached, Hybrid ~1684 tok, OIDA ~842 tok, Whitelist ~900 tok. — Figure 4 — OIDA Bench v2 (forthcoming). Budget at the clean_big corpus (30K docs), interpolated from the multi-budget sweep. OIDA reaches CS ≥ 0.5 with ~842 tokens — about 2× more efficient than BM25 or Hybrid (~1,684 tok). Dense never reaches the 0.5 threshold at any tested budget.

I work with companies that try to put AI on top of their internal knowledge. Funds, R&D teams, public bodies. The model is almost never the part that breaks. The part that breaks is how the knowledge around the model is represented. Standard RAG returns paragraphs that look right and quietly hides the things the organization actually depends on: open contradictions, weak commitments, the questions nobody has answered yet. That is the gap this paper is about.

Standard RAG treats every chunk of text as equally true. OIDA splits this into three things that should never have been the same thing: what the organization knows, what it is committed to, and what it doesn't know. Each one becomes a typed object the system can reason about. The QUESTION-as-modeled-ignorance primitive is the part I care about most. It lets the system say 'I don't know, and here is the shape of what I don't know', instead of producing a confident answer.

I co-designed the framework, worked on the QUESTION primitive, and built the EQS evaluation. EQS is a strict metric on purpose: it does not reward fluent confidence. The headline number is that OIDA gets to about two-thirds of a full-context baseline at less than 4% of the token cost — and surfaces the ignorance the baseline was hiding.

Inside an institution you usually don't want a system that always answers. You want one that knows when not to, and exposes the gap so a person can close it. That's the direction I'm pushing on with this work.