Building RAG for German Legal Documents: When Hallucination Isn't an Option

Context

Roughly 1.3 million people of Balkan origin live in Germany, many of them navigating the social system in a language they don't fully read. A typical scenario: a letter arrives from the Jobcenter with a seven-day deadline, written in formal Amt-Deutsch, referencing paragraphs of SGB II or SGB III that even native speakers struggle to interpret. Getting it wrong has real consequences — benefits cut, appointments missed, appeals lost.

KlarAmt is a tool built for this gap. Users ask questions in BCS (Bosnian/Croatian/Serbian). The system retrieves relevant paragraphs from German social law and returns an answer with verifiable citations — in formal Amt-Deutsch.

The technical challenge isn't "build a chatbot." It's "build a retrieval system over legal text where every answer must be verifiable, the user's language differs from the corpus language, and a wrong answer has real-world cost."

Problem

Generic LLMs hallucinate paragraph numbers. Ask ChatGPT "what does §7 SGB II say about job-seeker obligations?" and you get plausible-sounding but often incorrect text. Paragraph numbers are invented. References point to paragraphs that don't exist. In a domain where citations matter, this is unusable.

Legal text resists naive chunking. German social law is a graph. §7 references §11. §11 references §20. Chunk it by tokens and you lose the reference chain. Chunk it by paragraph and you miss context that lives in §§ cross-references.

The user's language isn't the corpus language. Questions come in BCS. The corpus is German — specifically, formal legal German. A monolingual retriever fails the moment the question uses a word like "pomoć" (help) and the corpus uses "Leistung" (benefit).

Failure is invisible without evaluation. Without an eval harness, you have no idea whether the system works. You have vibes. In a domain where wrong answers have stakes, vibes are malpractice.

Approach

Corpus ingestion with paragraph-aware chunking. The ingestion pipeline parses SGB II, SGB III, and SGB X from gesetze-im-internet.de, preserving paragraph boundaries and §§ cross-references as metadata. Each chunk carries its paragraph number, its parent statute, and a list of referenced paragraphs — so retrieval can expand context when a chunk references another.

Bilingual retrieval. Sentence Transformers multilingual models (tested several, chose based on eval results) embed both BCS questions and German paragraphs into a shared vector space. Hybrid retrieval combines dense vectors with keyword matching on legal terminology (§§ numbers, specific statutory terms that translate predictably).

Citation-forced generation. The prompt template is strict: the model receives retrieved paragraphs as context and is instructed to cite at least one source paragraph for every claim. If no retrieved paragraph supports an answer, the model must say "I don't know — consult a Rechtsanwalt." No hedging. No plausible-sounding prose without a source.

Evaluation harness as first-class infrastructure. 50+ golden questions with expected source paragraphs and acceptable answer shapes. Metrics: precision@5 and recall@5 on retrieval, faithfulness (does every answer claim have a citation?), and "no-source" compliance (does the system correctly refuse when it shouldn't answer?). The harness runs on every corpus change, every prompt change, every model swap.

Production architecture. FastAPI backend, Next.js frontend, PostgreSQL for session state, ChromaDB for vectors. Kept deliberately boring — this is a system where reliability matters more than novelty.

Outcomes

Citation grounding works. Every answer traces back to a verifiable SGB paragraph on gesetze-im-internet.de — or the system refuses. Users can click a citation and land on the exact statutory paragraph.
Retrieval quality measurable and improving. The eval harness turns "did the change help?" from a hunch into a number. Precision@5 is meaningfully above a dense-only baseline via hybrid retrieval + re-ranking (representative).
Known failure modes documented. Irony in questions, recently-amended paragraphs, questions that require common sense rather than statute — these fail modes are catalogued and users see explicit warnings.

Stack / Constraints

Stack: Python, FastAPI, ChromaDB (vectors), Sentence Transformers for bilingual embeddings, OpenAI and Anthropic models (evaluated both; domain dictates choice), Next.js frontend, PostgreSQL for session state.

Constraints: Single-developer project, zero tolerance for confident hallucination, bilingual retrieval requirement, and real users with real stakes.

Decisions & Tradeoffs

Hybrid retrieval over pure dense. Dense alone missed citation-heavy queries where exact paragraph numbers matter. Hybrid is more code but dramatically better eval numbers.
Boring production stack. No LangChain orchestration soup. FastAPI + a few well-understood components. Easier to debug at 3am when something breaks.
Refusal over hedging. Prompt forces "I don't know" when retrieval fails, rather than letting the model produce a plausible-sounding guess. Users lose a little convenience, gain a lot of trust.
Eval harness before frontend polish. Built the harness before investing in UI. Reversed order would mean shipping a product I can't measure.

What I'd Do Differently

Start the eval harness on day one. I built retrieval and generation first and retrofitted the harness. I should have started with the golden questions — they shape every downstream decision.

More aggressive negative examples. The harness has "expected correct answers" but initially was light on "expected refusals." Adding explicit no-source questions caught several over-confidence bugs.

Invest earlier in a legal review loop. Every production answer should ideally be reviewable by someone who reads German law for a living. This is still a gap — a structured feedback channel with a lawyer is the next milestone.

The deeper lesson: reliability engineering doesn't stop at the platform. Evaluation harnesses, failure-mode analysis, explicit refusal patterns — these are the same disciplines I've applied to distributed systems for 15 years, now applied to LLMs. The substrate is new; the rigor isn't.