Quality

Quality Measurement and Benchmarks

How we measure Legal-AI quality — and where we are honest about limits. No marketing numbers without methodology.

Our Measurement Principles

No number without methodology: Every accuracy figure is documented with test set, clause type and evaluation criterion.
External validation: Internal measurements are supplemented by external benchmark references — no self-congratulation.
Edge case disclosure: We indicate areas where the system is uncertain — so users can review with full information.
Update cycle: Benchmarks are renewed with every major model update. Date always shown in the header.

External Benchmark Reference

The DACL Benchmark (arXiv:2601.06181) is an external scientific dataset for evaluating AI systems on German-language court rulings. The study shows that specialised models significantly outperform generic LLMs in judicial classification accuracy. Clausa draws on DACL methodology for internal evaluations — the published study figures are third-party data, not Clausa's own measurements.

Internal Quality Methodology

Tenancy law test set: Curated collection of annotated lease agreement clauses with BGH-validated rulings by legal experts.
Employment law test set: Annotated employment contract clauses with KSchG/TzBfG compliance labels.
Precision / Recall per clause type: Separate measurements for all relevant clause types from tenancy and employment law.
Human-in-the-loop validation: Sample review by legal advisors at each model update.

Known Measurement Limits

New clause types without test data cannot be assessed reliably.
Measurements apply to standard contract types — heavily individualised contracts may deviate.
Current internal benchmark publication follows with launch of the first pilot phase (Q3 2026).

Frequently Asked Questions

Why are no specific percentages given?

We publish numbers only after external validation and methodological completeness. Numbers without context are marketing — that is not what we want.

When will public benchmark results be available?

With launch of the first pilot phase (Q3 2026) we will publish initial validated results — including test set description and methodology documentation.

Can I contribute my own test data?

Yes — pilot partners can contribute anonymised clause samples. This improves test set breadth and thus the meaningfulness of the benchmarks.

Receive Benchmark Updates

Pilot partners receive benchmark reports before the public launch.

Join the Waitlist