Healthcare data has economic value and is evaluated as such. Therefore, it attracted global attention from observational and clinical studies alike. Recently, the importance of data quality research emerged in healthcare data research. Various studies are being conducted on this topic. In this study, we propose a DQ4HEALTH model that can be applied to healthcare when reviewing existing data quality literature. The model includes 5 dimensions and 415 validation rules. The four evaluation indicators include the net pass rate (NPR), weighted pass rate (WPR), net dimensional pass rate (NDPR), and weighted dimensional pass rate (WDPR). They were used to evaluate the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) at three medical institutions. These indicators identify differences in data quality between the institutions. The NPRs of the three institutions (A, B, and C) were 96.58%, 90.08%, and 90.87%, respectively, and the WPR was 98.52%, 94.26%, and 94.81%, respectively. In the quality evaluation of the dimensions, the consistency was 70.06% of the total error data. The WDPRs were 98.22%, 94.74%, and 95.05% for institutions A, B, and C, respectively. This study presented indices for comparing quality evaluation models and quality in the healthcare field. Using these indices, medical institutions can evaluate the quality of their data and suggest practical directions for decreasing errors.
Keywords: healthcare data; OMOP CDM; multisite study; data quality assessment
Healthcare data is evaluated as data with economic value; subsequently, it attracts global attention from observational studies and clinical studies alike [[
Data quality studies continue to use tools and assessment approaches to improve the quality of EHR data [[
Data quality assessment is vital in multicenter studies. The Observational Health Data Sciences and Informatics (OHDSI), a representative multicenter research network, is a network in which multiple stakeholders collaborate by analyzing large-scale medical data converted into the Common Data Model (CDM) format [[
This study presents a new conceptual model that can be applied to the quality of healthcare data through the review of existing data quality literature. In addition, this study selected medical institutions that established a large-scale cohort based on OHDSI's CDM. A data quality evaluation was then performed. This study contributes to the improvement of healthcare data quality by checking the difference in quality for each medical institution after a quality evaluation.
The development and evaluation of clinical epidemiological information linked to human materials quality model involves the following four steps:
- Develop a model for healthcare data quality evaluation;
- Define the quality evaluation rules to be applied to quality evaluation;
- Define the evaluation method using the quality evaluation model;
- Verify the model by using it to evaluate the CDM data of hospitals.
This study used Google Scholar to compare and review eight data quality papers from 1990 to 2020. This allowed us to build a conceptual model of healthcare data quality. Based on the literature on information system data quality dimensions, we selected five dimensions as the evaluation criteria [[
We conducted a systematic review to examine 17,800 pieces of related literature and then reviewed eight studies that suggested an original data quality measurement method for research purposes. The studies did this by collecting expert opinions and conducting in-depth reviews. We dimensioned a total of 5 DQ4HEALTH.
We excluded 5251 articles that were either related to data quality for unstructured data or contained a database for the purpose of operating an information system, not a database for research purposes. Additionally, some data quality measurement methods were applicable only to a specific clinical domain, and cases in which it was difficult to use as a generalized evaluation item. This caused us to exclude an additional 9033 cases. Finally, 3506 documents were excluded from the data quality measurement method as they were difficult to derive by applying survey and analysis methods.
Completeness is a criterion for evaluating whether data are missing in the process of expressing real-world data as a system. For example, if a patient visits a hospital and undergoes an examination, the patient, examination, and visit information should not be omitted from the data's point of view.
Validity is a criterion for assessing whether the data included in the system complies with the acceptable range and format. For example, January to December can be accepted for the month of birth, and the format must be an integer.
Accuracy evaluates whether real-world data are accurately represented by the system. For example, (i) the end date of a patient's prescription medication must be equal to one day subtracted from the start date of the medication prescription, and (ii) the medication cannot be prescribed after the patient's date of death.
Uniqueness is a dimension that examines a unique data value according to the characteristics of the data and evaluates the word for the data value. For example, a unique prescription number should be assigned to each prescription of medication, measurement, or procedure.
Consistency assesses the relationship between the inside and outside of the system by evaluating whether the data are consistent according to the system structure. For example, if a patient visits a hospital as an outpatient and receives a medication prescription, the value of their assigned patient number should exist in the visit information and medication information. This rule examines whether different data are linked to the foreground key in the database.
The importance of validation rules for seven clinical experts, data managers, and data quality experts with more than three years of experience in the above five dimensions were evaluated. The weights developed using these are reflected in the evaluation index (Table 1).
Five dimensions were applied to OMOP CDM V5.3.1 to develop a total of 415 evaluation rules that allowed us to assess the quality of medical data (Appendix B Table A2). The evaluation rules were divided into errors and warnings based on their importance (Table 2). An error must be cleansed and corrected when it is confirmed in the validation rule. A warning is an error that does not need to be cleansed, even if it is confirmed in the validation rule. The developed rule evaluates its importance with expert advice, and the weight developed using it is reflected in the evaluation index.
The evaluation results were quantified when the data quality was evaluated. For this purpose, the result is expressed as the ratio of error data with the data that passed or did not pass the rule, and the error rate that reflects the weight of each data rule evaluated by experts. Four indices were developed as an evaluation index: NPR, WPR, NDPR, and WDPR.
- NPR: this is a data quality evaluation index that does not reflect the weights for data errors. The NPR is calculated by subtracting 100 from the total error rate, which is the result of the data quality evaluation for each institution, and by adding the error rates of error and warning;
- WPR: this is a level evaluation index of data in which weights are applied to errors or warnings among data errors;
- NDPR: this is a data level evaluation index that does not reflect the weight of each index for each data error after quality is evaluated by the five dimensions. The NDPR is calculated by subtracting the error rate for each total dimension error rate, which is the data quality evaluation result for each dimension, 100;
- WDPR: this is a data level evaluation index that reflects the weights for data errors in each of the five dimensions. The WDPR evaluates the importance of each dimension by experts and sets the weight after finding the average value. Thereafter, the value obtained by multiplying the total dimension error rate by each dimension of each weight was calculated as the WDPR.
To study data quality research using the OMOP CDM, we selected three medical institutions that allowed access to their data to build the OMOP CDM V5.3.1 model (Table 3). The institutions that agreed to the evaluation were medical institutions located in the metropolitan areas of Korea that serve more than two million people. In addition, the analysis was performed using the chi-square method to determine whether the difference between the data quality evaluation results of the three medical institutions was significant.
DQ4HEALTH was applied and evaluated for three medical institutions and built by OMOP CDM V5.3.1.
Without weighting the data error, the NPR was 96.58%, 90.08%, and 90.87% for institutions A, B, and C, respectively (Figure 1 and Table 4). The WPR result, which is an evaluation index that experts employed to weigh error and warning, was 98.52%, 94.26%, and 94.81% for institutions A, B, and C, respectively. Compared to that of the NPR, the WPR scores were higher; however, the difference in scores between the institutions was similar.
The difference in quality between institutions is due to the influence of weights reflecting expert evaluation of verification rules, as classified into "error" and "warning." For example, regarding the quality of patient information, the Patient ID rule that has a value of not null is an error. As this is a rule that has an important influence on quality, experts evaluated it with a weight of 0.64. However, the patient's racial classification was evaluated with a weight of 0.36, as it was identified as a rule that did not affect data quality. We confirm that the importance can be different even within tables that collect the same information.
It was possible to verify the overall errors of healthcare data quality, and the following five types of errors were identified:
- We identified a type of error in which the inspection result value of the inspection information table is not an integer greater than 0, but a negative number. Obviously, there cannot be a case where a test result value exists as a negative number. As a result of tracing the source data, it was confirmed that the unmeasurable value was defined as −9999, owing to an error in the inspection machine;
- A type of error was revealed that is caused by a problem with the source data value (source_value), and it is a type of error that includes missing spelling such as "Test Name ("88888_Drug Name", mi(misssing spelling) minor salivary gland")". This type of error suggests that meaningless data can be loaded, and the reliability of the data can be reduced;
- Error types that deviated from the standard term values owing to input errors between concept_id and code data values were found. In addition, the problem of mapping values to nonstandard values was also derived. The importance of mapping international standard terms is mentioned often in existing healthcare studies, suggesting that it may be a problem in multicenter studies that use OMOP CDM;
- A type of error regarding chronological relationships was also identified. This violates the precedence relationship between the patient's date of birth and death and the observation period of each clinical information. This type of error draws attention to the implications of refining as errors that occur in the ETL process and errors that may occur in actual EHR systems;
- A type of error that violates referential integrity was revealed. This is the type identified with most errors in this study. This error occurred because of the reference relationship between patient information and the treatment/diagnosis information table in the structure of the OMOP CDM. In other words, most of the data were loaded abnormally even though the data were updated, or the patient ID was present but could not be used for actual research because there was no examination history.
The data quality was examined based on five dimensions and the error rate was evaluated for each dimension. For each error type, the evaluation results were compared using the NDPR, which assessed only the number of errors, and the WDPR, which provided the weight of that type of error.
When checking the DQ4HEALTH dimension result, the quality level of the four dimensions was close to 99% or higher. Furthermore, the consistency dimension had the highest error rate, at 70.06% (1,338,817,961 records), of all the error data. As for the results of quality analysis, NDPR, which does not reflect the weight of consistency, was 90.66%, 76.52%, and 78.64% for institutions A, B, and C, respectively. When weights for each dimension evaluated by experts were provided, 98.22%, 94.74%, and 95.05% of institutions A, B, and C, respectively, showed results (Figure 2 and Table 5).
Depending on whether the experts' weights were reflected, the difference in results was due to the following factors. Experts gave low weight to the consistency dimension in the case of tables that did not affect analysis and medical concepts that are hard to map using standard medical terms.
We adopted the chi-square analysis method to verify whether there is a level difference according to the quality results of all medical institutions and conducted a subsequent analysis. The result was p < 0.001, which confirmed that there was a difference in the quality of data from each hospital.
Additionally, we performed a chi-square analysis to determine whether there was any difference in quality for each OMOP CDM table at each medical institution. This allowed us to check the factors that affected the overall results. Of the 195 variables, all three medical institutions had no errors. Either that or two medical institutions excluded variables without errors from the analysis and we performed a chi-square analysis on 96 variables. The analysis confirmed that there was a difference in the quality of healthcare data between institutions.
Consequently, it was confirmed that institution A had the highest quality data of the three medical institutions. Comparatively, institution B possessed low-quality data. Regarding institution B, the error derived from the consistency dimension was the highest of all three institutions. The consistency dimension was confirmed to be a factor with low quality.
This study differs from previous studies on data quality because it developed an index that can evaluate the quality of multiple institutions using a large cohort.
Existing healthcare data quality studies suggest a conceptual model that can be applied to healthcare data through a literature review; however, few studies verify the proposed model using actual healthcare data [[
In addition, an evaluation method was developed to compare the impact of errors on the healthcare quality results. The existing literature on data quality evaluation presents the net error rate and error distribution according to the quality dimension owing to the application of the data quality conceptual model. In this study, we propose a data quality evaluation method to review the causes of errors that affect healthcare data through multicenter quality comparisons according to the researcher's quality study design by expanding the results of the net error. In other words, the quality evaluation method refers to four evaluation criteria (NPR, WPR, NDPR, and WDPR) for easy access to expert reviews in evaluating healthcare data.
Finally, when utilizing the opinions of experts, we can properly weight errors according to the degree of influence on the quality of medical institutions. Existing literature on data quality assessment emphasizes the importance of documentation and methods by which experts can review data quality results reports [[
Our study has several limitations. Since the DQ4HEALTH model proposed in this study confirms and verifies the overall quality of OMOP CDM, more detailed and specific quality verification rules should be expanded when conducting research on specific diseases and medications. For example, Veronica Muthee conducted a healthcare data study centered on the HIV care data-based routine data quality assessment (RDQA) model [[
Despite these limitations, this study analyzes the types of errors by presenting a new model that can be applied to the OMOP CDM after considering and integrating healthcare data quality studies and applying it to multiple institutions. This can be utilized in future studies.
In this study, we developed a validation rule that can be applied to OMOP CDM by selecting frequent values through a review of previous studies on the existing information system quality and healthcare quality dimensions. Additionally, we proposed a new DQ4HEALTH model for OMOP CDM data quality management as a result of receiving expert advice based on the developed validation rule. The developed DQ4HEALTH model was applied to three institutions with more than two million CDM data to conduct an empirical healthcare data quality evaluation study.
As a result of analyzing the multicenter data quality error results with more than 2 million cohorts using the chi-square method, we confirmed that there is a difference in the quality of CDM data between hospitals. This means that even though the same OMOP CDM was applied, there was a difference in quality for each hospital. There was also a significant difference for each table. The types of errors presented in this study suggest that the analysis results may be affected when conducting joint research using a common data model.
In the future, it will be necessary to expand research to intuitively confirm the degree of data quality improvement through comparison before and after cleansing the error data derived from the data quality result. It is also necessary to expand the study on the effects of analysis results before and after comparison [[
Graph: Figure 1 Comparison of NPR and WPR according to error and warning weights.
Graph: Figure 2 Comparison of NDPR and WDPR by consistency weights.
Table 1 DQ4HEALTH dimensions of specialist group review.
Dimension Sub-Dimension Definition Importance Weight Completeness This rule verifies that there are no missing required columns. 9.6 0.23 Validity Range This rule checks whether a data value is within a given range. 7.5 0.18 Format This rule checks whether a data value conforms to the data type. Accuracy Calculation This rule verifies whether the values of different columns are correct. 7.7 0.18 Timeline This rule verifies the precedence of time. Business Rule This rule verifies the hospital business rules. Uniqueness This rule verifies the value corresponding to the primary key. 9 0.22 Consistency Standard If an international standard code is used, this rule verifies the standard code. 7.9 0.19 Relationship If there is a referential relationship between tables, this rule verifies the referential integrity.
Table 2 Type definition of specialist group review.
Type Definition Importance Weight Error This error is one that must be cleansed and corrected once it is identified in the validation rule. 9.3 0.64 Warning This error is one that was identified in the validation rule but does not need to be corrected. 5.3 0.36
Table 3 Characteristics of A, B, C hospital.
Center Provider Type Region The Number of Bed Hospitals The Number of OMOP CDM Person A Tertiary Hospital Seoul approximately 1400 3,598,955 B General Hospital Gyeonggido approximately 900 2,279,292 C General Hospital Seoul approximately 400 2,077,837
Table 4 Multicenter OMOP CDM data quality summary results.
Total Error Rate Error Rate Warning Rate NPR WPR A 3.42% 0.89% 2.53% 96.58% 98.52% B 9.92% 7.73% 2.19% 90.08% 94.26% C 9.13% 6.79% 2.34% 90.87% 94.81% <0.001 <0.001 <0.001 <0.001 <0.001
Table 5 Multicenter OMOP CDM data quality assessment specific results.
Center DQ4HEALTH Dimension Total Dimension Data Count Total Dimension Error Count Total Dimensions Error Rate NDPR WDPR A Completeness 5,460,723,980 8276 0.01% 99.99% 99.99% Validity 1,360,559,053 22,801,212 1.67% 98.33% 99.70% Accuracy 3,570,299,098 59,288,628 1.66% 98.34% 99.69% Uniqueness 840,625,891 239,985 0.03% 99.97% 99.99% Consistency 5,005,238,125 467,936,657 9.34% 90.66% 98.22% B Completeness 2,619,120,230 1,399,297 0.05% 99.95% 99.99% Validity 644,669,318 11,173,281 1.73% 98.27% 99.69% Accuracy 1,847,001,586 333,479 0.02% 99.98% 99.99% Uniqueness 412,280,539 0 0% 100% 100 Consistency 2,835,935,266 816,059,524 28.77% 71.23% 94.74% C Completeness 1,826,576,516 1,545,055 0.08% 99.92% 99.98% Validity 430,638,422 7,014,267 1.62% 98.38% 99.71% Accuracy 1,270,385,522 302,273 0.00% 99.99% 99.99% Uniqueness 291,598,022 0 0% 100% 100% Consistency 2,003,506,197 522,758,437 26.09% 73.91% 95.05%
Conceptualization, K.-H.K. and I.-Y.C.; methodology, K.-H.K. and I.-Y.C.; software, S.-H.C., K.-H.K. and S.-J.K.; validation, S.-J.K. and K.-H.K.; formal analysis, S.-H.C.; investigation, D.-J.K. and I.-Y.C.; resources, I.-Y.C. and D.-J.C.; data curation, I.-Y.C., D.-J.C. and Y.-W.C.; writing—original draft preparation, W.C. and K.-H.K.; writing—review and editing, I.-Y.C., J.-K.K. and W.C.; visualization, W.C. and K.-H.K.; supervision, I.-Y.C.; project administration, D.-J.K. and I.-Y.C. All authors have read and agreed to the published version of the manuscript.
This research was funded by the Technology Innovation Program (20004927, Upgrade of CDM-based Distributed Biohealth Data Platform and Development of Verification Technology) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea).
The study was conducted in accordance with the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of the Catholic Medical Center (protocol code XC20RNDI0161 and 6 July 2021).
The requirement for written informed consent was waived by the Research Ethics Committee of the Catholic Medical Centre, and this study was conducted in accordance with relevant guidelines and regulations.
Data sharing was not applicable to this study. Data supporting the findings of this study are available from each hospital.
The authors declare no conflict of interest.
Table A1 The Literature Review Result of Information System Dimension.
DQ4HEALTH Dimensions Definition DQ Terminology Authors Completeness - Evaluate missing data in the process of representing data in the real world as a system. Completeness [ Null Values [ Incompleteness [ Validity Range Evaluate whether it allows the scope of the data in the system. Scope [ Value out of range [ etc. [ Format Evaluate whether the format specified in the system is correctly expressed. Format [ Correctness [ etc. [ Accuracy Calculation Evaluate whether the calculation formula for items that are composed of multiple items is correct. Accuracy [ Computation Conformance [ Timeliness Evaluate time among data values expressed in the real world. Timeliness [ Currency [ etc. [ Business Rule Evaluate whether business relevance (knowledge) among data values expressed in the real world is correctly expressed. Accuracy [ (Atemporal) Plausibility [ etc. [ Uniqueness - Evaluate whether duplicate values are allowed in the system. Uniqueness (Plausibility) [ (Non)duplication [ Consistency Standard It does not evaluate the value of structural data within the system but evaluates the value of data outside the institution. Value Conformance [ Incompatibility [ etc. [ Relational Evaluates whether data in the system complies with specified relational constraints. Consistency [ Relationship Conformance [ Etc. [
Table A2 DQ4HELTH (Data Quality for Healthcare) Model Development Result.
Dimensions Definition OMOP CDM Rules Example Type Rule count Completeness - This rule verifies that there is no omission in a required column. a. The patient number (person_id) column in the Person Table must not have a null value. E 85 b. The Specimen Concept ID column in the Specimen table must not have a null value. Validity Range This rule verifies that a data value is within a given range. a. The Measurement Result Value of measurement table should have a value greater than 0. W 10 b. The month of the patient's date of birth must have a value between 1 and 12. E Format This rule verifies that a data value conforms to the data type. a. The year of birth in Person table should have a value in the format of a 4-digit number. E 9 b. The column of Measurement Time in the Measurement table should have a value in the format of 24H:MM:SS. Accuracy Calculation This rule verifies that multi-column values are the same. a. Drug_exposure_end_date must be equal to Drug_exposure_start_date minus a value of −1. E 1 Timeline This rule verifies the precedence of time. a. The value of the year of birth (YYYY) in the date of birth (Birth_Datetime) of the patient information and the value of the year of birth (year_of_birth) must have the same value. W 58 b. The Procedure_date in the Procedure table must occur after the date of birth and before the date of death. E Business Rule This rule verifies the hospital business rules. a. If one's gender is female, they cannot have a diagnosis code for male disease. E 145 b. The visit concept id should have a value of type of inpatient, outpatient, emergency, clinical trial, and medical examination. W Uniqueness - This rule verifies the value corresponding to the primary key. a. The person id in the person table must have a unique value. E 14 Consistency Standard If an international standard code is used, verify the standard code. a. The Condition concept id of the Condition table must comply with the standard mapping of Domain = Condition, Standard concept = S, of Voca table W 34 Relationship If there is a referential relationship between tables, referential integrity is verified. a. Location id of Person table should have the value of Location id of Location table. E 44
By Ki-Hoon Kim; Wona Choi; Soo-Jeong Ko; Dong-Jin Chang; Yeon-Woog Chung; Se-Hyun Chang; Jae-Kwon Kim; Dai-Jin Kim and In-Young Choi
Reported by Author; Author; Author; Author; Author; Author; Author; Author; Author