Large-Scale Data Extraction from a National Clinical Registry

In this project, our team collaborates with the Kompetenznetz für Angeborene Herzfehler — one of Germany’s largest clinical registries for congenital heart disease.

The registry contains tens of thousands of detailed medical reports collected over many years, each written in free text by different clinical teams across the country. Our goal was to transform this vast collection of unstructured clinical documents into a structured, analysis-ready database — without compromising patient privacy or the original richness of the data.

From PDFs to reliable tables—without hallucinations.

We developed a hybrid AI pipeline that combines large language models (LLMs) with rule-based postprocessing and strict schema validation. This approach allows us to automatically extract key clinical parameters, diagnoses, procedures, and time points from narrative reports while maintaining high precision and traceability.

Sensitive steps are performed locally, ensuring full compliance with European data protection regulations (GDPR). Full local analysis is possible with our on-site hardware.

We built the system in a modular way, so that model components can be replaced or updated easily — supporting both cloud-based and local LLMs. The resulting database now enables large-scale scientific analyses, registry-based studies, and the integration of structured data into clinical dashboards and AI pipelines.

“From narrative text to trustworthy data — turning documentation into knowledge.”

We believe this approach represents a blueprint for similar registries and research networks seeking to unlock the value of their clinical data securely and transparently.