Universal Document Intelligence Platform

Overview

Most “chat with your documents” demos are a thin wrapper around an LLM and a single vector search — and nobody ever measures whether the answers are actually right. This is the opposite of that. The Universal Document Intelligence Platform is a configurable retrieval-and-reasoning system that answers hard questions over dense professional documents — 10-Ks, contracts, clinical guidelines — and treats evaluation as a first-class part of the system, not an afterthought.

Every design choice is judged the same way: does it beat the baseline on a fixed set of eval questions, measured, not asserted?

The problem

Professional documents are exactly where naive RAG breaks. The answer to “what were total net sales in fiscal 2023?” might live in a footnote, a table, or be split across two pages. A single dense-embedding lookup misses it constantly, and a raw LLM hallucinates a confident-sounding number. In finance, legal, and healthcare, a plausible-but-wrong answer is worse than no answer.

Approach

Hybrid retrieval — dense embeddings (semantic) combined with BM25 (exact keyword/number matching), so figures and defined terms aren’t lost to fuzzy vector similarity.
Cross-encoder reranking — a Cohere reranker re-scores the candidate passages, pushing the genuinely relevant context to the top before it ever reaches the model.
Eval-first methodology — a held-out set of eval questions with known answers scores retrieval and generation on every change. The hybrid + rerank pipeline reached 92% retrieval accuracy — a +22-point gain over the vanilla baseline across 120 eval questions.
Observable end-to-end — LangSmith traces every retrieval and generation step, so a wrong answer can be debugged back to the exact passage that failed.

Stack

Python, Claude Sonnet for generation, OpenAI embeddings, a Qdrant vector store, FastAPI for the service layer, and a Streamlit interface — with LangSmith wired through for observability.

What I’m learning

This is my flagship for a reason: it’s where I’m proving the parts of AI engineering that wrappers hide — retrieval quality, evaluation rigor, and observability. The biggest lesson so far is that “reliable AI” is mostly an evaluation problem. Once you can measure whether a change helped, the engineering gets honest.

What’s next

Expanding the eval set across all three domains, tightening the chunking strategy against the benchmark, and publishing the eval methodology so the quality story is fully reproducible.