January 20, 2026

Citation-grounded answers • 50+ golden eval set

Building RAG for German Legal Documents: When Hallucination Isn't an Option

RAG system for German social law that helps the Balkan diaspora navigate Jobcenter bureaucracy — with forced citations, bilingual retrieval, and an evaluation harness that actually catches failures.

AI EngineeringRAGLLMEvaluationProduction AI
TL;DR
  • Problem with stakes: German legal text (SGB II/III/X) is dense, cross-referenced, and unforgiving. Generic LLMs hallucinate paragraph numbers that don't exist. Wrong answers can cost users their benefits.
  • Approach: RAG with bilingual embeddings (BCS questions → German corpus), paragraph-aware chunking that preserves §§ cross-references, and citation-forced generation that refuses to answer without a grounded source.
  • What makes it work: An evaluation harness with 50+ golden questions and explicit failure-mode tracking. Without the harness, the system is just vibes.

Scope & constraints

Role
Founder & Engineer
Scope
RAG architecture, corpus ingestion, evaluation harness, production system
Constraints
wrong answers can cost users their benefits — no tolerance for hallucination

Role & Scope

  • Project: personal side project serving Balkan diaspora in Germany
  • Responsibility: end-to-end system design, corpus engineering, evaluation, production deployment
  • Stakeholders: real users (family, friends, community) with real stakes
  • Constraints: wrong answers can cost users their benefits — no tolerance for hallucination

Outcome metrics

  • Retrieval: precision@5 meaningfully above dense-only baseline via hybrid + re-ranking (representative)
  • Faithfulness: every generated answer includes at least one verifiable SGB citation or explicit 'I don't know'
  • Coverage: ingested targeted paragraphs of SGB II, SGB III, and SGB X from gesetze-im-internet.de

Context

Roughly 1.3 million people of Balkan origin live in Germany, many of them navigating the social system in a language they don't fully read. A typical scenario: a letter arrives from the Jobcenter with a seven-day deadline, written in formal Amt-Deutsch, referencing paragraphs of SGB II or SGB III that even native speakers struggle to interpret. Getting it wrong has real consequences — benefits cut, appointments missed, appeals lost.

KlarAmt is a tool built for this gap. Users ask questions in BCS (Bosnian/Croatian/Serbian). The system retrieves relevant paragraphs from German social law and returns an answer with verifiable citations — in formal Amt-Deutsch.

The technical challenge isn't "build a chatbot." It's "build a retrieval system over legal text where every answer must be verifiable, the user's language differs from the corpus language, and a wrong answer has real-world cost."

Problem

Generic LLMs hallucinate paragraph numbers. Ask ChatGPT "what does §7 SGB II say about job-seeker obligations?" and you get plausible-sounding but often incorrect text. Paragraph numbers are invented. References point to paragraphs that don't exist. In a domain where citations matter, this is unusable.

Legal text resists naive chunking. German social law is a graph. §7 references §11. §11 references §20. Chunk it by tokens and you lose the reference chain. Chunk it by paragraph and you miss context that lives in §§ cross-references.

The user's language isn't the corpus language. Questions come in BCS. The corpus is German — specifically, formal legal German. A monolingual retriever fails the moment the question uses a word like "pomoć" (help) and the corpus uses "Leistung" (benefit).

Failure is invisible without evaluation. Without an eval harness, you have no idea whether the system works. You have vibes. In a domain where wrong answers have stakes, vibes are malpractice.

Approach

Corpus ingestion with paragraph-aware chunking. The ingestion pipeline parses SGB II, SGB III, and SGB X from gesetze-im-internet.de, preserving paragraph boundaries and §§ cross-references as metadata. Each chunk carries its paragraph number, its parent statute, and a list of referenced paragraphs — so retrieval can expand context when a chunk references another.

Bilingual retrieval. Sentence Transformers multilingual models (tested several, chose based on eval results) embed both BCS questions and German paragraphs into a shared vector space. Hybrid retrieval combines dense vectors with keyword matching on legal terminology (§§ numbers, specific statutory terms that translate predictably).

Citation-forced generation. The prompt template is strict: the model receives retrieved paragraphs as context and is instructed to cite at least one source paragraph for every claim. If no retrieved paragraph supports an answer, the model must say "I don't know — consult a Rechtsanwalt." No hedging. No plausible-sounding prose without a source.

Evaluation harness as first-class infrastructure. 50+ golden questions with expected source paragraphs and acceptable answer shapes. Metrics: precision@5 and recall@5 on retrieval, faithfulness (does every answer claim have a citation?), and "no-source" compliance (does the system correctly refuse when it shouldn't answer?). The harness runs on every corpus change, every prompt change, every model swap.

Production architecture. FastAPI backend, Next.js frontend, PostgreSQL for session state, ChromaDB for vectors. Kept deliberately boring — this is a system where reliability matters more than novelty.

Outcomes

  • Citation grounding works. Every answer traces back to a verifiable SGB paragraph on gesetze-im-internet.de — or the system refuses. Users can click a citation and land on the exact statutory paragraph.
  • Retrieval quality measurable and improving. The eval harness turns "did the change help?" from a hunch into a number. Precision@5 is meaningfully above a dense-only baseline via hybrid retrieval + re-ranking (representative).
  • Known failure modes documented. Irony in questions, recently-amended paragraphs, questions that require common sense rather than statute — these fail modes are catalogued and users see explicit warnings.

Stack / Constraints

Stack: Python, FastAPI, ChromaDB (vectors), Sentence Transformers for bilingual embeddings, OpenAI and Anthropic models (evaluated both; domain dictates choice), Next.js frontend, PostgreSQL for session state.

Constraints: Single-developer project, zero tolerance for confident hallucination, bilingual retrieval requirement, and real users with real stakes.

Decisions & Tradeoffs

  • Hybrid retrieval over pure dense. Dense alone missed citation-heavy queries where exact paragraph numbers matter. Hybrid is more code but dramatically better eval numbers.
  • Boring production stack. No LangChain orchestration soup. FastAPI + a few well-understood components. Easier to debug at 3am when something breaks.
  • Refusal over hedging. Prompt forces "I don't know" when retrieval fails, rather than letting the model produce a plausible-sounding guess. Users lose a little convenience, gain a lot of trust.
  • Eval harness before frontend polish. Built the harness before investing in UI. Reversed order would mean shipping a product I can't measure.

What I'd Do Differently

Start the eval harness on day one. I built retrieval and generation first and retrofitted the harness. I should have started with the golden questions — they shape every downstream decision.

More aggressive negative examples. The harness has "expected correct answers" but initially was light on "expected refusals." Adding explicit no-source questions caught several over-confidence bugs.

Invest earlier in a legal review loop. Every production answer should ideally be reviewable by someone who reads German law for a living. This is still a gap — a structured feedback channel with a lawyer is the next milestone.

The deeper lesson: reliability engineering doesn't stop at the platform. Evaluation harnesses, failure-mode analysis, explicit refusal patterns — these are the same disciplines I've applied to distributed systems for 15 years, now applied to LLMs. The substrate is new; the rigor isn't.