Early detection of nasopharyngeal carcinoma through machine‐learning‐driven prediction model in a population‐based healthcare record database

Chen, Jeng‐Wen ; Lin, Shih‐Tsang ; et al.

In: Cancer Medicine, Jg. 13 (2024), Heft 7, S. n/a-n/a

Online academicJournal

Zugriff:

Volltext (PDF)

Abstract Objective Early diagnosis and treatment of nasopharyngeal carcinoma (NPC) are vital for a better prognosis. Still, because of obscure anatomical sites and insidious symptoms, nearly 80% of patients with NPC are diagnosed at a late stage. This study aimed to validate a machine learning (ML) model utilizing symptom‐related diagnoses and procedures in medical records to predict nasopharyngeal carcinoma (NPC) occurrence and reduce the prediagnostic period. Materials and Methods Data from a population‐based health insurance database (2001–2008) were analyzed, comparing adults with and without newly diagnosed NPC. Medical records from 90 to 360 days before diagnosis were examined. Five ML algorithms (Light Gradient Boosting Machine [LGB], eXtreme Gradient Boosting [XGB], Multivariate Adaptive Regression Splines [MARS], Random Forest [RF], and Logistics Regression [LG]) were evaluated for optimal early NPC detection. We further use a real‐world data of 1 million individuals randomly selected for testing the final model. Model performance was assessed using AUROC. Shapley values identified significant contributing variables. Results LGB showed maximum predictive power using 14 features and 90 days before diagnosis. The LGB models achieved AUROC, specificity, and sensitivity were 0.83, 0.81, and 0.64 for the test dataset, respectively. The LGB‐driven NPC predictive tool effectively differentiated patients into high‐risk and low‐risk groups (hazard ratio: 5.85; 95% CI: 4.75–7.21). The model‐layering effect is valid. Conclusions ML approaches using electronic medical records accurately predicted NPC occurrence. The risk prediction model serves as a low‐cost digital screening tool, offering rapid medical decision support to shorten prediagnostic periods. Timely referral is crucial for high‐risk patients identified by the model.

Early detection of nasopharyngeal carcinoma through machine‐learning‐driven prediction model in a population‐based healthcare record database

Objective: Early diagnosis and treatment of nasopharyngeal carcinoma (NPC) are vital for a better prognosis. Still, because of obscure anatomical sites and insidious symptoms, nearly 80% of patients with NPC are diagnosed at a late stage. This study aimed to validate a machine learning (ML) model utilizing symptom‐related diagnoses and procedures in medical records to predict nasopharyngeal carcinoma (NPC) occurrence and reduce the prediagnostic period. Materials and Methods: Data from a population‐based health insurance database (2001–2008) were analyzed, comparing adults with and without newly diagnosed NPC. Medical records from 90 to 360 days before diagnosis were examined. Five ML algorithms (Light Gradient Boosting Machine [LGB], eXtreme Gradient Boosting [XGB], Multivariate Adaptive Regression Splines [MARS], Random Forest [RF], and Logistics Regression [LG]) were evaluated for optimal early NPC detection. We further use a real‐world data of 1 million individuals randomly selected for testing the final model. Model performance was assessed using AUROC. Shapley values identified significant contributing variables. Results: LGB showed maximum predictive power using 14 features and 90 days before diagnosis. The LGB models achieved AUROC, specificity, and sensitivity were 0.83, 0.81, and 0.64 for the test dataset, respectively. The LGB‐driven NPC predictive tool effectively differentiated patients into high‐risk and low‐risk groups (hazard ratio: 5.85; 95% CI: 4.75–7.21). The model‐layering effect is valid. Conclusions: ML approaches using electronic medical records accurately predicted NPC occurrence. The risk prediction model serves as a low‐cost digital screening tool, offering rapid medical decision support to shorten prediagnostic periods. Timely referral is crucial for high‐risk patients identified by the model.

Keywords: head and neck cancer; machine learning; nasopharyngeal carcinoma; prediagnostic

The LGB‐driven NPC predictive model showed optimal predictive power (AUROC: 0.83), and effectively differentiated patients into high‐risk and low‐risk groups (HR: 5.85). It serve as a low‐cost digital screening tool, offering rapid medical decision support to shorten prediagnostic periods. Timely referral is crucial for high‐risk patients identified by the model.

INTRODUCTION

Nasopharyngeal carcinoma (NPC) arises from the epithelial lining of the nasopharynx,[[1]] frequently in the pharyngeal recess (Rosenmüller's fossa).[3] Despite sharing similar cell types with other head and neck cancers, NPC exhibits different risk factors and geographical distribution.[[4]] NPC is relatively uncommon.[6] According to the International Agency for Research on Cancer, approximately 133,354 new cases of NPC were identified in 2020, comprising only 0.7% of all malignancies diagnosed that year.[7] However, in endemic areas, particularly southern China and Southeast Asia,[1] the prevalence rate and health‐care burden remain high.[8]

Early diagnosis and treatment of NPC are vital for better prognosis,[[9]] but because of obscure anatomical sites and insidious symptoms, nearly 80% of patients with NPC are diagnosed at a late stage.[[11], [13]] Patients with NPC have relatively long intervals between the first appearance of symptoms and the final diagnosis (the prediagnostic period).[11] Lee et al.[12] reported a mean prediagnostic period of 8 months in patients with NPC, with some patients presenting more than 36 months after the first symptom. Patients with late presentations tended to have advanced presenting stages and had significantly poorer disease‐specific survival than those presenting earlier.[[10], [12], [14]]

Accredited population‐based screening tools in NPC‐endemic regions remain lacking.[15] Given the close association between NPC and Epstein–Barr virus (EBV) infection, anti‐EBV IgA serological tests, including VCA‐IgA and EBNA1‐IgA, have been recommended for NPC screening.[16] However, these tests have a positive predictive value (PPV) as low as approximately 4%,[[16]] causing >95% of the testing population to undergo unnecessary clinical examinations. Consequently, both compliance and screening efficiency for early diagnosis of NPC remain low. Measurement of circulating plasma cell‐free EBV DNA levels was proposed as a potential screening tool,[1] but it was discovered to have low sensitivity in identifying patients with early‐stage NPC.[18] In endemic areas, prevalent latent EBV infection in the general population also caused a high false‐positive rate.[[19]] These drawbacks have limited the use of EBV DNA as a mass screening tool.

We hypothesized that electronic medical records (EMRs) of symptom‐related diagnoses and reimbursement information in a population‐based database could help detect NPC. However, symptom‐related diagnoses and procedures tend to cluster in groups,[21] and a complex interplay between them may be challenging to understand. Fortunately, these difficulties can be overcome by using machine learning (ML) approaches.[[22], [24]] Thus, in this study, we inputted demographic data, symptom‐related diagnoses, and procedure reimbursement into an ML algorithm‐based prediction model to evaluate whether such a model could expedite the risk‐stratifying process and shorten the prediagnostic period in patients with NPC.

METHODS

Study design and participants

We obtained data from Taiwan's National Health Insurance Research Databases (NHIRD)[[25]] during 2001–2008. The NHIRD is a population‐based medical claims database that includes patient diagnostic, procedural, and treatment information and laboratory testing results. This study was approved by Taiwan's Ministry of Health and Welfare and the Institutional Review Board of Cardinal Tien Hospital (CTH‐110‐3‐5‐013), and the requirement for informed consent was waived because the data in the NHIRD are unidentified.

The NPC diagnosis was identified using the International Classification of Diseases, Ninth [Tenth] Revision, Clinical Modification (ICD‐9[10]‐CM) code 147 and ICD‐10‐CM code C11 during the study period. Patients who had cancer before the diagnosis of NPC were excluded. We defined the index date as 14 days before the first diagnosis of NPC to exclude those diagnoses and procedures occurring immediately before confirmation of NPC.

Definition of variables

All claim data within 90, 120, 150, 180, and 360 days before the index date was collected for analysis. Predictor variables were grouped into the following categories (Table S1): (I) participant demographics including sex and age (2 variables); (II) the potential ICD‐9‐CM diagnostic codes of pre‐NPC symptoms (28 variables); (III) the potential medical claim data of pre‐NPC procedures, treatments, and laboratory tests (24 variables); (IV) the combined features of diagnostic codes (CFD) (7 variables); and (V) the combined features of procedures, treatments, and laboratory tests (CFPTLT) (5 variables).

Statistical analysis

Model development

Figure 1 depicts the study flowchart. In Taiwan, NPC has an incidence rate of approximately 6.8 per 100,000 population, highlighting a notable contrast between NPC and non‐NPC cases. To mitigate prediction bias, we employed undersampling, achieving a 1:1 ratio between NPC and non‐NPC cohorts by randomly selecting samples from the latter. The data were then divided into training (80%) and validation (20%) sets using the holdout method. Feature selection was performed, and various ML algorithms were compared using the training set, including LGB, XGB, RF, MARS, and LG models. Evaluation metrics included sensitivity, specificity, balanced accuracy, and AUROC. The top‐performing model was applied to the validation set for further assessment. Shapley values were utilized to interpret the model with the highest AUROC.[[27]] Additionally, to simulate real‐world conditions, we created an imbalanced dataset by sampling 1 million individuals from the NHIRD and assessed the stability of the best model. Finally, the best predictive variables and algorithms were progressively selected to establish the final predictive model.

Feature selection strategy

We employed a two‐step approach for feature selection. First, to mitigate the risk of model overfitting resulting from the inclusion of multiple variables, six sets of feature selection combinations were constructed, derived from categories I–V of Table S1. Table S2 in the Supplement provides detailed information of each feature selection combination used to train the ML algorithms. Subsequently, the models' predictive performance was assessed sequentially to determine the optimal set of feature variables to be included in the final model.

Next, feature combinations from 90, 120, 150, 180, and 360 days before the index date were collected to determine the optimal data length required to achieve the optimal predictive performance. Thus, this study also incorporated the time interval of symptom‐related diagnoses and claims data required for model development.

Data preparation and variable construction were conducted using SAS version 9.4 (SAS Institute). Subsequent ML analyses were performed using R version 3.2.3 (R Foundation for Statistical Computing).

RESULTS

Baseline characteristics

Table 1 presents the distribution of characteristics between the training and validation sets. A total of 22,186 patients' medical records were included in the analysis. Both sets were balanced for key characteristics except for the "removal of nasal packing" procedure. Eventually, 8874/17748 (50%) patients in the training set and 2219/4438 (50%) in the validation set received 14‐day time window before NPC diagnosis. The 50% proportion of patients with NPC in the training and validation sets resulted from adjustment for imbalanced data. We randomly matched an equivalent number of patients without NPC to achieve balance.

1 TABLE Characteristics of the training and validation sets.

Characteristic	Cohort, No. (%)	p Value
Train data set (n = 17,748)	Validation data set (n = 4438)
Age, mean (SD), y	42.62 (18.28)	42.88 (18.50)	0.400
Sex
Male	11,063 (62.33%)	2754 (62.05%)	0.732
Female	6685 (37.67%)	1684 (37.95%)
Potential pre‐NPC symptom‐related diagnostic codes
Benign neoplasm of other and unspecified sites	291 (1.64%)	70 (1.58%)	0.769
Neoplasm of uncertain behavior of other specified sites	5 (0.03%)	0 (0%)	0.263
Neoplasm of uncertain behavior, site unspecified	13 (0.07%)	1 (0.02%)	0.229
Neoplasm of unspecified nature of other specified sites	92 (0.52%)	13 (0.29%)	0.050
Neoplasm of unspecified nature, site unspecified	27 (0.15%)	3 (0.07%)	0.171
Other specified disorders of nervous system	10 (0.06%)	2 (0.05%)	0.773
Unspecified disorders of nervous system	43 (0.24%)	12 (0.27%)	0.736
Trigeminal nerve disorders	93 (0.52%)	21 (0.47%)	0.672
Disorders of other cranial nerves	34 (0.19%)	7 (0.16%)	0.639
Diplopia	28 (0.16%)	4 (0.09%)	0.288
Other disorders of binocular vision	1 (0.01%)	0 (0%)	0.617
Chronic serous otitis media	226 (1.27%)	62 (1.4%)	0.515
Suppurative and unspecified otitis media	591 (3.33%)	141 (3.18%)	0.610
Mastoiditis and related conditions	3 (0.02%)	0 (0%)	0.386
Other disorders of ear	545 (3.07%)	137 (3.09%)	0.955
Hearing loss	306 (1.72%)	71 (1.6%)	0.567
Chronic pharyngitis and nasopharyngitis	786 (4.43%)	216 (4.87%)	0.208
Chronic sinusitis	526 (2.96%)	135 (3.04%)	0.784
Chronic disease of tonsils and adenoids	79 (0.45%)	16 (0.36%)	0.440
Other diseases of upper respiratory tract	269 (1.52%)	71 (1.6%)	0.683
Swelling, mass, or lump in head and neck	930 (5.24%)	214 (4.82%)	0.260
Epistaxis	654 (3.68%)	178 (4.01%)	0.307
Hemorrhage from throat	5 (0.03%)	0 (0%)	0.263
Other symptoms involving head and neck	5 (0.03%)	2 (0.05%)	0.571
Hemoptysis	142 (0.8%)	43 (0.97%)	0.269
Benign neoplasm of connective and other soft tissue of head, face, and neck	192 (1.08%)	52 (1.17%)	0.608
Paralytic strabismus	83 (0.47%)	19 (0.43%)	0.728
Headache	1171 (6.6%)	283 (6.38%)	0.594
Potential NHI claim codes of pre‐NPC symptom‐related procedures, treatments, or laboratory tests
EBV VCA IgG, IgM, IgA, IFA method, each	840 (4.73%)	211 (4.75%)	0.952
EBV capsid Ab	3 (0.02%)	3 (0.07%)	0.066
EBNA Ab	185 (1.04%)	40 (0.9%)	0.402
Impedance audiometry	228 (1.28%)	63 (1.42%)	0.480
Tympanometry	568 (3.2%)	152 (3.42%)	0.450
Eustachian tube function test	6 (0.03%)	1 (0.02%)	0.705
Nasopharyngolaryngoscopy	2145 (12.09%)	511 (11.51%)	0.294
Sinoscopy	352 (1.98%)	87 (1.96%)	0.922
Laryngoscopy	129 (0.73%)	30 (0.68%)	0.719
Tympanic aspiration	146 (0.82%)	36 (0.81%)	0.940
Myringeal puncture, unilateral	12 (0.07%)	2 (0.05%)	0.593
Middle ear cavity puncture	7 (0.04%)	1 (0.02%)	0.596
Eustachian tube Inflation, unilateral	56 (0.32%)	17 (0.38%)	0.482
Eustachian tube Inflation, bilateral	23 (0.13%)	5 (0.11%)	0.776
Simple epistaxis, anterior	243 (1.37%)	70 (1.58%)	0.293
Complicated epistaxis, posterior	27 (0.15%)	7 (0.16%)	0.932
Intranasal cauterization	1 (0.01%)	0 (0%)	0.617
Anterior nasal packing	14 (0.08%)	6 (0.14%)	0.264
Posterior nasal packing	4 (0.02%)	1 (0.02%)	1.000
Removal of nasal packing	52 (0.29%)	22 (0.5%)	0.036
Tympanocentesis	62 (0.35%)	20 (0.45%)	0.320
Myringotomy UNDER microscope or telescope	91 (0.51%)	13 (0.29%)	0.055
Myringotomy with ventilation tube Insertion under microscope	55 (0.31%)	11 (0.25%)	0.497
Head and neck soft tissue echo	101 (0.57%)	17 (0.38%)	0.128

Feature selection

Figure S1 illustrates the heatmap of the predicted performance (AUROC) of different feature selection combinations and ML algorithms using data from 90, 120, 150, 180, and 360 days before the index date. The vertical axis represents the different types of algorithms used, and the horizontal axis represents the various combinations of feature selection. Regardless of the period of medical records used to establish the predictive model, Fea_comb3 and Fea_comb6 consistently exhibited superior predictive power (Figure S1). However, given the practical and clinical data construction costs, achieving the same level of predictive power with fewer variables is a more compelling solution. Therefore, we selected Fea_comb3 (consisting of 14 variables) for modeling.

We further analyzed the modeling performance of Fea_comb3 variable combinations in various algorithms (Table S4). Regardless of the duration of medical records (90–360 days), the average predictive power of the model remained stable at 0.89, indicating a limited contribution of more days to the model's predictive power beyond 90 days. On the basis of these findings, our model was constructed using the Fea_comb3 variable combination and medical records from 90 days before the index date as the basis for data analysis in modeling.

Predictive ABILITY and validation of performance

Table 2 presents the performance metrics of the ML algorithms based on the selected feature of the Fea_comb3 model collected 90 days before the index date. All models performed well on the test set. LGB and XGB had the same AUROCs, which were slightly higher than those of MARS, RF, and LG. LGB also had slightly higher specificity than XGB.

2 TABLE Performance metrics of machine learning algorithms in 90 days.

	Days before the index date
Sensitivity	Specificity	Balanced accuracy	AUROC
A. AUROC value in the training set
MARS	0.79	0.79	0.79	0.88
LGB	0.81	0.81	0.81	0.90
XGB	0.81	0.80	0.80	0.90
RF	0.79	0.80	0.80	0.90
LG	0.82	0.75	0.78	0.88
B. AUROC value in the validation set
MARS	0.79	0.79	0.79	0.88
LGB	0.79	0.80	0.79	0.89
XGB	0.80	0.80	0.80	0.90
RF	0.79	0.80	0.79	0.89
LG	0.81	0.75	0.78	0.87
C. AUROC value in the test set
MARS	0.63	0.80	0.72	0.82
LGB	0.64	0.81	0.72	0.83
XGB	0.64	0.80	0.72	0.83
RF	0.64	0.81	0.72	0.82
LG	0.66	0.76	0.71	0.80

1 Note: This performance metric is based on the selected features of the Fea_comb3 model collected 90 days before the index date.

2 Abbreviations: LG: Logistics Regression; LGB: Light Gradient Boosting Machine; MARS: Multivariate Adaptive Regression Splines; RF: Random Forest; XGB: eXtreme Gradient Boosting.

Our LGB model achieved an AUROC level of 0.83 using only 14 predictive variables, which was superior to that of the predictive model incorporating all 66 variables (Table S5) in terms of sensitivity, specificity, and AUROC. Consequently, we selected the LGB‐driven model as the final predictive model.

SHAP summary plot

Figure 2 displays the SHapley Additive exPlanations (SHAP) summary plot,[[27]] based on the selected features of the Fea_comb3 model collected 90 days before the index date. It demonstrates the ranking of importance and directional influence of the feature variables. The 10 most important variables in descending order were age, nasal symptoms management (CFPTLT), sex, head and neck mass (CFD), serum markers (CFPTLT), aural symptoms (CFD), bleeding (CFD), aural symptom‐related treatments (CFPTLT), nasal symptom‐related treatments (CFD), and headache (CFD). These variables were positively correlated with NPC, indicating a higher likelihood of individuals exhibiting these features within the past 90 days to be diagnosed as having NPC.

Robustness analysis

To further validate the accuracy of the final model in predicting NPC by using real‐world data, we tested it on the data of 1 million individuals randomly selected from the NHIRD as of January 1, 2009. The high‐risk and low‐risk groups were distinguished using the 75th percentile of the risk prediction value (descriptive statistics of different risks are shown in Table S6). The Kaplan–Meier method was used to determine whether individuals predicted to be at high risk by the model had a higher cumulative incidence of NPC in the subsequent 5 years.

Figure 3 displays the actual NPC 5‐year incidence under risk prediction based on LGBM algorithms. The incidence rate for the high‐risk group was 21.45 per 100,000, whereas that for the low‐risk group was 3.67 per 100,000, which was approximately 5.85 times (95% CI, 4.75–7.21) lower. The model‐layering effect is valid.

DISCUSSION

To our knowledge, this population‐based cohort study is the first to develop and validate an ML‐based NPC prediction model by using a population‐base medical claims database in an Asian population. With minimal curation of the collected data, our model builds on an otolaryngologist's knowledge in terms of feature selection, including demographic characteristics and NPC symptom‐related diagnoses, procedures, treatments, and laboratory tests. By using routinely obtained information available in a population‐based claims database and only 14 variables, our algorithm exhibited good accuracy in predicting NPC occurrence in the general population. The result of Shapley's summary plot confirmed the explanatory power of our model. Application of this risk‐stratifying tool to personal health‐care records, such as My Health Bank,[[29]] can help shorten the NPC prediagnostic period. Moreover, this theoretical approach can be scaled across specialties or tailored to predict other critical illnesses by using disease‐specific feature combinations, particularly those with inconspicuous initial symptoms.

The NPC is multifactorial. Although its exact cause remains unknown, three major risk factors have been identified: genetic susceptibility, environmental factors, and EBV infection.[31] Individuals with a family history of NPC have an increased NPC risk.[32] This genetic susceptibility is further highlighted by the higher prevalence of NPC in distinctive ethnic groups and geographic regions, particularly in Chinese, Southeast Asian, and North African ethnicities.[33] Ingesting preserved or salted fish, especially during childhood, has been associated with an increased risk of NPC in populations in which NPC is endemic.[[33]] Further, dietary exposure to volatile nitrosamines[35] and occupational exposure to formaldehyde, wood dust, fumes, and chemicals have been identified as environmental risk factors for NPC because these substances produce active carcinogenic metabolites that cause chronic inflammation in the nasopharynx.[36] Finally, despite the high correlation between EBV infection and NPC, how EBV infection contributes to NPC pathogenesis remains unclear and may be a result of a complex interaction between the host stroma and EBV as well as genetic changes in infected host cells.[37] Although these predisposing factors contribute to NPC development, practical primary preventive efforts have been limited by the insufficient explanatory power of modifiable risk factors.[38] Therefore, secondary prevention using screening to detect early and asymptomatic disease has been emphasized. Unfortunately, no widely accepted risk assessment system is available for early‐stage NPC identification, particularly for mass screening in an endemic area.

Several risk assessment models have been developed for the early detection of NPC. In a large case–control study of Cantonese‐origin participants, Ruan et al.[39] demonstrated that an environmental model that included the factors of salted fish, preserved vegetable consumption, and cigarette smoking could discriminate participants with NPC from those without NPC with only modest ability (AUROC = 0.68). Adding data on the family history of NPC and genetic risk score increased the model performance to AUROCs of 0.70 and 0.74, respectively. However, they did not incorporate the presence of EBV antibody titers, and the model was not validated on an independent data set. Subsequent studies have attempted to combine EBV subtypes, host genetic susceptibility, and serological EBV titers for NPC risk stratification.[[40]] Zhou et al.[40] developed a comprehensive NPC risk score incorporating epidemiology factors, 2 host single nucleotide polymorphisms (SNPs), and 3 EBV SNPs, which yielded an AUROC of 0.77 in distinguishing patients with and without NPC in the validation set. Their model improved the PPV for detecting NPC from 4.7% for serum EBV antibody levels alone to 43.24% by including serological test results with the top 20% comprehensive risk score, but at the expense of a decreased negative predictive value from 99.97% to 99.91%. Similarly, He et al.[41] created and validated a polygenic risk score for NPC derived from a genome‐wide association analysis. The PPV of the model increased from an average of 4.84% to 11.91% when the top 5% of the polygenic risk score and the findings of the EBV‐serology‐based screening test were combined. Nevertheless, the universal applicability of these models in the general population and their cost‐effectiveness remain undetermined, necessitating the development of more practical and cost‐effective models independent of EBV and other laboratory tests for rapid NPC screening of the general population.

Numerous studies have addressed ML applications in health care,[[22], [24]] opening new possibilities for better clinical decision support.[[42], [44], [46]] Chen et al.[48] designed an algorithm‐driven risk prediction model for NPC screening by using a single institution's EMRs and patient graph analysis, without incorporating EBV and other laboratory test results, to improve accessibility and increase NPC screening rate. The XGB models based on 100 and 20 variables achieved AUROCs of 0.934 and 0.854, respectively, in NPC prediction. However, data collection and preprocessing during model development were time‐consuming. Furthermore, the proposed model consisted of only hospital EMRs but not outpatient clinic records, which may decrease its performance in rural areas. In addition, the model was not tested on an independent data set. By contrast, we used a population‐based claims data set for model development, which included hospital‐based, emergency‐department‐based, and outpatient‐clinic‐based information. Moreover, the feature selection of the model was based on symptom‐related diagnoses, procedures, treatments, and laboratory tests, which were all explainable. We tested our model on an independent data set (2009–2013), achieving high performance. Using the online portal My Health Bank, which contains the individual's medical care data over the past 3 years, Taiwan's citizens with National Health Insurance can check their medical records at any time and monitor their health.[[29]] This approach provides the potential for personalized risk stratification and large‐scale population screening in endemic areas for the early diagnosis and secondary prevention of NPC.

Strengths

This study has several strengths. First, while the treatment prognosis of NPC has improved in the past decade in Taiwan, patients with NPC who presented with clinical stage III and stage IV increased from 70.4% (961/1365) in 2009 to 72.1% (998/1385) in 2021, indicating a need for a universal screening method to expedite early diagnosis. This study proposed a machine‐learning model using the healthcare‐seeking records available for individual patients (My Health Bank) and nationwide (the National Health Insurance Research Database). Applying the model in screening may alter the presenting stage of patients with NPC in the long run, thereby further improving prognosis. Second, our results indicate that ML can be a promising method for NPC risk stratification. The developed model used a 14‐day time window before NPC diagnosis, a minimum of 90 days of EMR data, and 14 clinically explainable variables. Third, this tool was tested on an independent data set, achieving high performance. By using accessible, personalized data, namely that from My Health Bank, our model could perform an automated NPC risk assessment within seconds, thus facilitating the large‐scale, nationwide implementation of screening programs for early detection of NPC. The decision support system may also assist a physician's clinical decision‐making for NPC diagnostic interventions, especially for nonotolaryngologists. Finally, this approach can be scaled across specialties or customized for risk stratification of other severe diseases, especially those with subtle initial symptoms.

Limitations

This study has several limitations. First, the case–control study‐based predictive model included only sex, age, and symptom‐related management data but no other identified risk factors. A well‐designed cohort study including more established risk factors, such as family history, cigarette smoking, diet, environmental exposure, EBV status, and genetic predisposition, might improve the current model. Second, our model was built using data from a Taiwanese population in an endemic region, precluding easy generalizability to other areas without NPC endemicity and with different ethnicities. Moreover, differences in medical insurance systems and health‐seeking behavior in different countries may cause variations in health‐care data structure and availability. Therefore, the models should be adjusted before they can be used for NPC screening and risk assessment in other countries. Finally, although this NPC predictive model was tested on a 5‐year unbalanced real‐world data set, an extended cohort with a larger sample size and a longer follow‐up are necessary to evaluate its validity.

CONCLUSIONS

Individual EMRs represent an inexpensive and accessible data source. Our study used ML models built using such data, thus taking advantage of the opportunity to more reliably predict the occurrence of NPC. Applying our predictive model can help shorten the prediagnostic period in patients with NPC, and patients identified by the model as high‐risk should be promptly referred for confirmatory tests.

AUTHOR CONTRIBUTIONS

Jeng‐Wen Chen: Conceptualization (equal); funding acquisition (equal); investigation (equal); methodology (equal); project administration (equal); resources (equal); visualization (equal); writing – original draft (equal); writing – review and editing (equal). Shih‐Tsang Lin: Conceptualization (equal); data curation (equal); funding acquisition (equal); project administration (equal); resources (equal). Yi‐Chun Lin: Conceptualization (equal); data curation (equal); formal analysis (equal); methodology (equal); software (equal); visualization (equal). Bo‐Sian Wang: Data curation (equal); methodology (equal). Yu‐Ning Chien: Conceptualization (equal); formal analysis (equal); investigation (equal); methodology (equal); project administration (equal); resources (supporting); validation (equal); visualization (equal); writing – original draft (equal); writing – review and editing (equal). Hung‐Yi Chiou: Conceptualization (equal); funding acquisition (equal); project administration (equal); supervision (equal); validation (equal).

ACKNOWLEDGMENTS

We are grateful for the administrative assistance on this project provided by Chiu‐Ping Wang, Shu‐Hwei Fan, Wei‐Chun Chen, Uan‐Shr Jan, and Wan‐Ning Luo; they received no additional compensation for their contributions. We also thank the Health and Welfare Data Science Center, Ministry of Health and Welfare, Taiwan, for providing the National Health Insurance Research Database.

FUNDING INFORMATION

This study was supported by three resources: (1) The National Science and Technology Council of the Republic of China (Taiwan) grant (NSTC 110‐2511‐H‐567‐001‐MY2, NSTC 112‐2410‐H‐567‐001‐MY3, NSTC 112‐2314‐B‐845‐001) and (2) Cardinal Tien Hospital grant (CTH109A‐2218, CTH110A‐2204) and Cardinal Tien Junior College of Health Care and Management grant (CTCN‐109C‐09). The funders had no role in the design and conduct of the study; data collection, management, analysis, and interpretation; manuscript preparation, review, or approval; and the decision to submit the manuscript forpublication.

CONFLICT OF INTEREST STATEMENT

None declared. All authors disclose that they have no relevant relationships.

DATA AVAILABILITY STATEMENT

Research data are not shared. Data sharing is not applicable to this article, because the NHIRD data is managed by the government and researchers can only analyze the data but cannot obtain it.

GRAPH: Appendix S1.

Footnotes 1 Jeng‐Wen Chen and Shih‐Tsang Lin have contributed equally to this work and share first authorship. REFERENCES Chen YP, Chan ATC, Le QT, Blanchard P, Sun Y, Ma J. Nasopharyngeal carcinoma. Lancet. 2019 ; 394 (10192): 64 ‐ 80. doi: 10.1016/S0140-6736(19)30956-0 2 Wong KCW, Hui EP, Lo KW, et al. Nasopharyngeal carcinoma: an evolving paradigm. Nat Rev Clin Oncol. 2021 ; 18 (11): 679 ‐ 695. doi: 10.1038/s41571-021-00524-x 3 Wei WI. Nasopharyngeal cancer: current status of management: a New York head and neck society lecture. Arch Otolaryngol Head Neck Surg. 2001 ; 127 (7): 766 ‐ 769. 4 Abdulamir AS, Hafidh RR, Abdulmuhaimen N, Abubakar F, Abbas KA. The distinctive profile of risk factors of nasopharyngeal carcinoma in comparison with other head and neck cancer types. BMC Public Health. 2008 ; 8 : 400. doi: 10.1186/1471-2458-8-400 5 Chang ET, Ye W, Zeng YX, Adami HO. The evolving epidemiology of nasopharyngeal carcinoma. Cancer Epidemiol Biomarkers Prev. 2021 ; 30 (6): 1035 ‐ 1047. doi: 10.1158/1055-9965.EPI-20-1702 6 Burt RD, Vaughan TL, McKnight B. Descriptive epidemiology and survival analysis of nasopharyngeal carcinoma in the United States. Int J Cancer. 1992 ; 52 (4): 549 ‐ 556. doi: 10.1002/ijc.2910520409 7 Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021 ; 71 (3): 209 ‐ 249. doi: 10.3322/caac.21660 8 Yang D, Bin N, Zhou Z, et al. Demographics and economic burden of nasopharyngeal carcinoma inpatients. Biomed Res Int. 2022 ; 2022 : 6958806. doi: 10.1155/2022/6958806 9 Low WK, Leong JL, Goh YH, Fong KW. Diagnostic value of Epstein‐Barr viral serology in nasopharyngeal carcinoma. Otolaryngol Head Neck Surg. 2000 ; 123 (4): 505 ‐ 507. doi: 10.1067/mhn.2000.108201 Moon PK, Ma Y, Megwalu UC. Head and neck cancer stage at presentation and survival outcomes among native Hawaiian and other Pacific islander patients compared with Asian and white patients. JAMA Otolaryngol Head Neck Surg. 2022 ; 148 (7): 636 ‐ 645. doi: 10.1001/jamaoto.2022.1086 Acquarelli MJ, Peter AS. Obscure symptoms and signs related to the diagnosis of nasopharyngeal carcinoma. Laryngoscope. 1963 ; 73 : 760 ‐ 768. doi: 10.1288/00005537-196306000-00013 Lee AW, Foo W, Law SC, et al. Nasopharyngeal carcinoma: presenting symptoms and duration before diagnosis. Hong Kong Med J. 1997 ; 3 (4): 355 ‐ 361. Li J, Zou X, Wu YL, et al. A comparison between the sixth and seventh editions of the UICC/AJCC staging system for nasopharyngeal carcinoma in a Chinese cohort. PloS One. 2014 ; 9 (12): e116261. doi: 10.1371/journal.pone.0116261 Loh KS, Petersson F. Nonexophytic nasopharyngeal carcinoma: high frequency of advanced lymph node and distant metastasis. Otolaryngol Head Neck Surg. 2011 ; 145 (4): 594 ‐ 598. doi: 10.1177/0194599811411141 Lee VH, Adham M, Ben Kridis W, et al. International recommendations for plasma Epstein‐Barr virus DNA measurement in nasopharyngeal carcinoma in resource‐constrained settings: lessons from the COVID‐19 pandemic. Lancet Oncol. 2022 ; 23 (12): e544 ‐ e551. doi: 10.1016/S1470-2045(22)00505-8 Chien YC, Chen JY, Liu MY, et al. Serologic markers of Epstein‐Barr virus infection and nasopharyngeal carcinoma in Taiwanese men. N Engl J Med. 2001 ; 345 (26): 1877 ‐ 1882. doi: 10.1056/NEJMoa011610 Ji MF, Sheng W, Cheng WM, et al. Incidence and mortality of nasopharyngeal carcinoma: interim analysis of a cluster randomized controlled screening trial (PRO‐NPC‐001) in southern China. Ann Oncol. 2019 ; 30 (10): 1630 ‐ 1637. doi: 10.1093/annonc/mdz231 Ji MF, Huang QH, Yu X, et al. Evaluation of plasma Epstein‐Barr virus DNA load to distinguish nasopharyngeal carcinoma patients from healthy high‐risk populations in southern China. Cancer. 2014 ; 120 (9): 1353 ‐ 1360. doi: 10.1002/cncr.28564 Kanakry J, Ambinder R. The biology and clinical utility of EBV monitoring in blood. Curr Top Microbiol Immunol. 2015 ; 391 : 475 ‐ 499. doi: 10.1007/978-3-319-22834-1_17 Lo YM, Chan LY, Lo KW, et al. Quantitative analysis of cell‐free Epstein‐Barr virus DNA in plasma of patients with nasopharyngeal carcinoma. Cancer Res. 1999 ; 59 (6): 1188 ‐ 1191. Fan G, Filipczak L, Chow E. Symptom clusters in cancer patients: a review of the literature. Curr Oncol. 2007 ; 14 (5): 173 ‐ 179. doi: 10.3747/co.2007.145 Bur AM, Shew M, New J. Artificial intelligence for the otolaryngologist: a state of the art review. Otolaryngol Head Neck Surg. 2019 ; 160 (4): 603 ‐ 611. doi: 10.1177/0194599819827507 Chen PC, Liu Y, Peng L. How to develop machine learning models for healthcare. Nat Mater. 2019 ; 18 (5): 410 ‐ 414. doi: 10.1038/s41563-019-0345-0 Crowson MG, Ranisau J, Eskander A, et al. A contemporary review of machine learning in otolaryngology‐head and neck surgery. Laryngoscope. 2020 ; 130 (1): 45 ‐ 51. doi: 10.1002/lary.27850 Hsieh CY, Su CC, Shao SC, et al. Taiwan's National Health Insurance Research Database: past and future. Clin Epidemiol. 2019 ; 11 : 349 ‐ 358. doi: 10.2147/CLEP.S196293 Lin LY, Warren‐Gash C, Smeeth L, Chen PC. Data resource profile: the National Health Insurance Research Database (NHIRD). Epidemiol Health. 2018 ; 40 : e2018062. doi: 10.4178/epih.e2018062 Liu Y. SHAP for XGBoost in R: SHAP for xgboost. https://liuyanguu.github.io/post/2019/07/18/visualization‐of‐shap‐for‐xgboost/ Merrick L, Taly A. The explanation game: explaining machine learning models using Shapley values. Machine Learning and Knowledge Extraction. Springer International Publishing ; 2020 : 17 ‐ 38. doi: 10.48550/arXiv.1909.08128 Lee PC, Wang JTH, Chen TY, Peng CH, Lee PC, et al. Digital Health Care in Taiwan. Innovations of National Health Insurance. Springer Nature ; 2022 : 270. doi: 10.1007/978-3-031-05160-9 National Health Insurance Administration. Ministry of Health and Welfare. My Health Bank. 2014. Accessed on April 4, 2023. https://eng.nhi.gov.tw/en/lp‐22‐2.html Tsao SW, Yip YL, Tsang CM, et al. Etiological factors of nasopharyngeal carcinoma. Oral Oncol. 2014 ; 50 (5): 330 ‐ 338. doi: 10.1016/j.oraloncology.2014.02.006 Yuan JM, Wang XL, Xiang YB, Gao YT, Ross RK, Yu MC. Non‐dietary risk factors for nasopharyngeal carcinoma in Shanghai, China. Int J Cancer. 2000 ; 85 (3): 364 ‐ 369. doi: 10.1002/(sici)1097-0215(20000201)85:3 Yu MC, Yuan JM. Epidemiology of nasopharyngeal carcinoma. Semin Cancer Biol. 2002 ; 12 (6): 421 ‐ 429. doi: 10.1016/s1044579x02000858 Yu MC, Huang TB, Henderson BE. Diet and nasopharyngeal carcinoma: a case–control study in Guangzhou, China. Int J Cancer. 1989 ; 43 (6): 1077 ‐ 1082. doi: 10.1002/ijc.2910430621 Zou XN, Lu SH, Liu B. Volatile N‐nitrosamines and their precursors in Chinese salted fish‐a possible etological factor for NPC in China. Int J Cancer. 1994 ; 59 (2): 155 ‐ 158. doi: 10.1002/ijc.2910590202 Yu MC, Garabrant DH, Huang TB, Henderson BE. Occupational and other non‐dietary risk factors for nasopharyngeal carcinoma in Guangzhou, China. Int J Cancer. 1990 ; 45 (6): 1033 ‐ 1039. doi: 10.1002/ijc.2910450609 Huang SCM, Tsao SW, Tsang CM. Interplay of viral infection, host cell factors and tumor microenvironment in the pathogenesis of nasopharyngeal carcinoma. Cancers (Basel). 2018 ; 10 (4): 106. doi: 10.3390/cancers10040106 Lee AW, Ng WT, Chan YH, Sze H, Chan C, Lam TH. The battle against nasopharyngeal cancer. Radiother Oncol. 2012 ; 104 (3): 272 ‐ 278. doi: 10.1016/j.radonc.2012.08.001 Ruan HL, Qin HD, Shugart YY, et al. Developing genetic epidemiological models to predict risk for nasopharyngeal carcinoma in high‐risk population of China. PloS One. 2013 ; 8 (2): e56128. doi: 10.1371/journal.pone.0056128 Zhou X, Cao SM, Cai YL, et al. A comprehensive risk score for effective risk stratification and screening of nasopharyngeal carcinoma. Nat Commun. 2021 ; 12 (1): 5189. doi: 10.1038/s41467-021-25402-z He YQ, Wang TM, Ji M, et al. A polygenic risk score for nasopharyngeal carcinoma shows potential for risk stratification and personalized screening. Nat Commun. 2022 ; 13 (1): 1966. doi: 10.1038/s41467-022-29570-4 Lai C, Zhang C, Lv H, et al. A novel prognostic model predicts overall survival in patients with nasopharyngeal carcinoma based on clinical features and blood biomarkers. Cancer Med. 2021 ; 10 (11): 3511 ‐ 3523. doi: 10.1002/cam4.3839 Zhang LL, Xu F, Song D, et al. Development of a nomogram model for treatment of nonmetastatic nasopharyngeal carcinoma. JAMA Netw Open. 2020 ; 3 (12): e2029882. doi: 10.1001/jamanetworkopen.2020.29882 Chu CS, Lee NP, Ho JWK, Choi SW, Thomson PJ. Deep learning for clinical image analyses in Oral squamous cell carcinoma: a review. JAMA Otolaryngol Head Neck Surg. 2021 ; 147 (10): 893 ‐ 900. doi: 10.1001/jamaoto.2021.2028 Daniels K, Gummadi S, Zhu Z, et al. Machine learning by ultrasonography for genetic risk stratification of thyroid nodules. JAMA Otolaryngol Head Neck Surg. 2020 ; 146 (1): 36 ‐ 41. doi: 10.1001/jamaoto.2019.3073 Karadaghy OA, Shew M, New J, Bur AM. Development and assessment of a machine learning model to help predict survival among patients with Oral squamous cell carcinoma. JAMA Otolaryngol Head Neck Surg. 2019 ; 145 (12): 1115 ‐ 1120. doi: 10.1001/jamaoto.2019.0981 Noel CW, Sutradhar R, Gotlib Conn L, et al. Development and validation of a machine learning algorithm predicting emergency department use and unplanned hospitalization in patients with head and neck cancer. JAMA Otolaryngol Head Neck Surg. 2022 ; 148 (8): 764 ‐ 772. doi: 10.1001/jamaoto.2022.1629 Chen A, Lu R, Han R, et al. Building practical risk prediction models for nasopharyngeal carcinoma screening with patient graph analysis and machine learning. Cancer Epidemiol Biomarkers Prev. 2023 ; 32 (2): 274 ‐ 280. doi: 10.1158/1055-9965.EPI-22-0792

By Jeng‐Wen Chen; Shih‐Tsang Lin; Yi‐Chun Lin; Bo‐Sian Wang; Yu‐Ning Chien and Hung‐Yi Chiou

Reported by Author; Author; Author; Author; Author; Author

Titel:	Early detection of nasopharyngeal carcinoma through machine‐learning‐driven prediction model in a population‐based healthcare record database
Autor/in / Beteiligte Person:	Chen, Jeng‐Wen ; Lin, Shih‐Tsang ; Lin, Yi‐Chun ; Wang, Bo‐Sian ; Chien, Yu‐Ning ; Chiou, Hung‐Yi
Link:	Volltext (PDF) View record in DOAJ (Volltext) https://doaj.org/toc/2045-7634
Zeitschrift:	Cancer Medicine, Jg. 13 (2024), Heft 7, S. n/a-n/a
Veröffentlichung:	Wiley, 2024
Medientyp:	academicJournal
ISSN:	2045-7634 (print)
DOI:	10.1002/cam4.7144
Schlagwort:	head and neck cancer machine learning nasopharyngeal carcinoma prediagnostic Neoplasms. Tumors. Oncology. Including cancer and carcinogens RC254-282
Sonstiges:	Nachgewiesen in: Directory of Open Access Journals Sprachen: English Collection: LCC:Neoplasms. Tumors. Oncology. Including cancer and carcinogens Document Type: article File Description: electronic resource Language: English

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

BibTeX Citavi, JabRef, u.a.
(Literaturverwaltung)

PDF kein Volltext!
(Merkzettel, Notizen)

RIS Endnote, Citavi u.a.
(Literaturverwaltung)

MODS
(XML zur Weiterverarbeitung)

oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

Gewünschter Zitations-Stil:

oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.