Abstract
Background: Integrated health data are foundational for secondary use, research, and policymaking. However, data quality issues-such as missing values and inconsistencies-are common due to the heterogeneity of health data sources. Existing frameworks often use static, 1-time assessments, which limit their ability to address quality issues across evolving data pipelines. Objective: This study evaluates the AIDAVA (artificial intelligence-powered data curation and validation) data quality framework, which introduces dynamic, life cycle-based validation of health data using knowledge graph technologies and SHACL (Shapes Constraint Language)-based rules. The framework is assessed for its ability to detect and manage data quality issues-specifically, completeness and consistency-during integration. Methods: Using the MIMIC-III (Medical Information Mart for Intensive Care-III) dataset, we simulated real-world data quality challenges by introducing structured noise, including missing values and logical inconsistencies. The data was transformed into source knowledge graphs and integrated into a unified personal health knowledge graph. SHACL validation rules were applied iteratively during the integration process, and data quality was assessed under varying noise levels and integration orders. Results: The AIDAVA framework effectively detected completeness and consistency issues across all scenarios. Completeness was shown to influence the interpretability of consistency scores, and domain-specific attributes (eg, diagnoses and procedures) Conclusions: AIDAVA supports dynamic, rule-based validation throughout the data life cycle. By addressing both dimension-specific vulnerabilities and cross-dimensional effects, it lays the groundwork for scalable, high-quality health data integration. Future work should explore deployment in live clinical settings and expand to additional quality dimensions.
| Original language | English |
|---|---|
| Article number | e75275 |
| Number of pages | 15 |
| Journal | JMIR Medical Informatics |
| Volume | 13 |
| DOIs | |
| Publication status | Published - 2025 |
Keywords
- data quality
- knowledge graph
- ontology
- health data
- data quality dimensions
- data quality assessment
- secondary use
- data quality framework
- fit for purpose
- CLINICAL-RESEARCH
- RECORDS