Most RAG tutorials use Wikipedia or PDF whitepapers as their corpus. Regulatory documents are a different problem. They are dense with cross-references, defined terms that override common meanings, and section hierarchies that carry legal weight. A chunk boundary in the wrong place can return an answer that is technically incorrect under the regulation.
Why Regulatory Text Is Hard
Defined terms. "Occupancy" in IBC Chapter 3 means something specific that differs from plain English. Chunking across the definitions section loses this.
Cross-references are load-bearing. "As required by Section 1604.3.1" is the actual rule. A chunk capturing the obligation without the referenced section returns an incomplete answer.
Hierarchical structure. IBC sections run Chapter to Section to Subsection to Exception. Exceptions frequently override the parent rule. Ignore this hierarchy and you miss exceptions.
Embedding Model Selection
| Model | Recall@5 | Latency (ms) |
|---|---|---|
| text-embedding-3-small | 0.71 | 45 |
| bge-large-en-v1.5 | 0.84 | 38 |
| e5-mistral-7b | 0.87 | 890 |
BGE-Large was the clear choice: near-SOTA recall at inference latency that will not bottleneck a synchronous API.
Chunking Strategy
Structural chunking first: parse the document heading hierarchy and chunk at section boundaries. Then add one paragraph of semantic overlap at boundaries. Every chunk gets tagged with corpus, chapter, section_id, effective_date, and jurisdiction for filtered retrieval.
Vector Store: Qdrant
Qdrant over Milvus for two reasons: on-premises deployment is first-class, and payload filtering at the ANN search level is composable without post-retrieval overhead. The corpus currently holds approximately 2.1M vectors across all regulatory sources.
What We Are Still Solving
Amendment tracking. Regulations get amended. Our current pipeline treats the corpus as a snapshot. Version-aware retrieval with effective dates is the next major infrastructure piece.