result Jun 18, 2026 legalize-pe

Measuring whether our legal search actually works

We built a hybrid-RAG assistant over 21,000 Peruvian legal norms. Then we measured it, and the conclusion moved three times as the evaluation got more rigorous.

Anyone can build a RAG system. Few measure whether it works. This is the log of measuring ours, and of watching the answer change as the measurement got better.

The system

legalize-pe is an open, git-versioned corpus of Peruvian legal norms: ~21,000 markdown documents, national plus 26 regional jurisdictions. On top of it we built amicus, a legal research assistant that answers questions in plain Spanish and cites the norm.

Retrieval is the hard part. The pipeline stacks three pieces:

query → [expand] → [hybrid search: FTS + embeddings, fused by RRF] → [rerank] → answer

The question this log answers: which of these pieces actually carries the result?

The trap of the small evaluation

The first time we measured, we had a gold set of 19 query→norm pairs, annotated by one person. The ablation table said something striking: the full pipeline in production (best) was not the best configuration. Query expansion alone beat it.

That is a great tweet. It is also, it turned out, wrong.

The gold set was too small and annotated by a single judge. So we scaled it, and changed the method to defend against our own bias.

Two annotators, blind

We built a sheet of 50 query→norm candidates across six strata (colloquial, technical, multi-norm, core-vs-regulation, subnational, out-of-scope). Then two models annotated it independently and blind. Neither saw the expected answer, nor the other’s annotation:

Where they agreed, the pair became gold automatically. Where they diverged, a human arbitrated by reading the norm text (not by picking a favorite). The agreement between the two annotators is itself a measurement of how hard each stratum is.

StratumInter-annotator agreement
technical-legal100%
out-of-scope100%
colloquial92%
multi-norm86%
core-vs-regulation86%
subnational22%

That 22% is a finding, not a failure. Two competent annotators, reading the same corpus, agree on the right subnational norm only twice in nine tries. The regional corpus is intrinsically ambiguous: many ordinances cover generic matters (“declaration of public interest”), reuse numbers across years and bodies, and have no single correct answer. We measured the ambiguity instead of assuming it.

The conclusion moved

Here is the honest part. We ran the same ablation three times, on progressively better gold sets:

Config (MRR)N=19, 1 judgeN=28, 2 judgesN=35, +subnational
fts (keywords only)0.0920.0900.089
vec (embeddings only)0.4950.5380.511
rrf (hybrid)0.3670.4000.401
rrf+expand0.7550.6560.605
rrf+rerank0.6570.8190.792
best (full pipeline)0.6570.8620.761

Read the rrf+expand and best rows top to bottom. The story rewrote itself:

The viral finding from the first run (your shipped pipeline isn’t optimal) was an artifact of a small, single-annotator gold set. It died on scaling. If we had published it, we would have published noise.

What survived every run

Two results held across N=19, 28, and 35. Those are the ones worth trusting:

  1. FTS alone is nearly useless on natural language (MRR ~0.09). A full-question query forces every keyword to match at once; recall collapses. FTS only earns its keep on out-of-scope queries, where it correctly abstains 60% of the time while every other config returns something.

  2. Embeddings alone beat the naive hybrid (vec 0.51 > rrf 0.40). Fusing a strong semantic retriever with a broken keyword one degrades the result. RRF without weights assumes the two retrievers are comparable; when one is far weaker, it drags the good one down. This is the counter-intuitive one, and it’s the most robust.

The lesson

Tune your baselines until it hurts. Ablate until you know which component carries the result. It’s usually one, and it’s usually not the one you’d guess.

We ran the experiment three times. The conclusion only stopped moving when the baseline stopped being noisy. The first number felt like a result. It was a measurement of our gold set, not of our system.

Honest limitations

Next milestone: scale the gold past 100, get legal validation on the needs_lawyer set, and re-run. The corpus, the eval harness, and the gold set are all open.

Railly Hugo, Crafter Research