Header image for Vitalis 2025

Ensuring Ethical and Trustworthy Secondary Use of Health Data: Insights from the Dorieh Platform

Wednesday May 21, 2025 13:00 - 13:15 A3

Lecturers: Michael Bouzinier, Francesco Pontiggia

Track: Health data

The vast proliferation of health data presents an enormous opportunity for research and policy-making but also poses significant challenges in trust, efficiency, and regulatory compliance. The secondary use of health data requires robust setups to ensure data is accurately harnessed for insights while meeting ethical and legal standards. This paper explores the integration of advanced data management, using tools such as descriptive workflow languages and domain-specific languages (DSLs), to create more trustworthy and efficient infrastructures in health data utilization.


The primary focus of the research was the Dorieh Data Platform, developed by Harvard University Research Computing in partnership with the Harvard T.H. Chan School of Public Health. Dorieh embodies a sophisticated data management approach that incorporates descriptive dataflow operators, enabling granular tracing of data transformations. By doing so, it addresses critical aspects of data provenance — the ability to trace and validate the lineage of every data element. Dorieh is deployed in the Harvard University FISMA-compliant Trusted Research Environment (TRE) leveraging Open OnDemand infrastructure. Dorieh is being used to prepare and document research datasets for National Studies of Air Pollution and Health.


Central to this work is employing a domain-specific language for data modeling, to allow for explicit definitions of transformations and enhance reproducibility and accountability in the secondary use of health data. Through integration with descriptive workflow languages, we create comprehensive frameworks that better adapt to the demands of modern data science, particularly in healthcare where regulatory compliance is rigorous.


The application of these methodologies on Medicare data highlighted data inconsistencies and underscored the effectiveness of Dorieh's approach in maintaining data quality. By providing detailed data lineage and error logging, Dorieh bolsters the trustworthiness and regulatory adherence of data-driven projects. We advocate adopting similar DSL tools across diverse health-related domains, ensuring data lineage is meticulously documented, thereby reinforcing the reliability and validity of research outcomes.


While the methodologies discussed were developed within a tightly controlled environment, they are positioned for scalability to more complex ecosystems like the European Health Data Space (EHDS). By addressing multimodal regulatory requirements, including FISMA and EMA-HMA Data Quality stipulations, this approach stands to become a pivotal element in modern data governance, ensuring that AI models and health policy decisions are based on transparent and scientifically sound data processing methods.

Language

English

Topic

Data and Information

Seminar type

Live + On site

Lecture type

Presentation

Objective of lecture

Tools for implementation

Level of knowledge

Intermediate

Target audience

Management/decision makers
Technicians/IT/Developers
Researchers

Keyword

Actual examples (good/bad)
Innovation/research
Apps
Government information
Informatics/Interoperability

Lecturers

Michael Bouzinier Lecturer

Harvard University

Francesco Pontiggia Lecturer

Francesco Pontiggia is a Sr Director of Harvard University Research Computing