Simple Summary: Current AI algorithms show breast cancer detection rates that are comparable to human readers, but it is not clear whether AI and humans detect cancers with similar characteristics. As these factors will influence survival, we aimed to compare the invasiveness status, histological grade, lymph node stage, and tumour size of cancers. Women diagnosed with breast cancer between 2009 and 2019 from three UK double-reading sites were included in this retrospective cohort evaluation. From 1718 screen-detected cancers (SDCs) and 293 interval cancers (ICs), AI indicated 85.9% and 31.7%, respectively, as suspicious. The first human reader detected 90.8% of SDCs and 7.2% of ICs. There were no differences in the detected proportion for any of the investigated prognostic factors. The AI algorithm detected more ICs. These findings imply that using AI has limited or no downstream effects on screening programmes, supporting its potential role in the double-reading workflow. Invasiveness status, histological grade, lymph node stage, and tumour size are important prognostic factors for breast cancer survival. This evaluation aims to compare these features for cancers detected by AI and human readers using digital mammography. Women diagnosed with breast cancer between 2009 and 2019 from three UK double-reading sites were included in this retrospective cohort evaluation. Differences in prognostic features of cancers detected by AI and the first human reader (R1) were assessed using chi-square tests, with significance at p < 0.05. From 1718 screen-detected cancers (SDCs) and 293 interval cancers (ICs), AI flagged 85.9% and 31.7%, respectively. R1 detected 90.8% of SDCs and 7.2% of ICs. Of the screen-detected cancers detected by the AI, 82.5% had an invasive component, compared to 81.1% for R1 (p-0.374). For the ICs, this was 91.5% and 93.8% for AI and R1, respectively (p = 0.829). For the invasive tumours, no differences were found for histological grade, tumour size, or lymph node stage. The AI detected more ICs. In summary, no differences in prognostic factors were found comparing SDC and ICs identified by AI or human readers. These findings support a potential role for AI in the double-reading workflow.
Keywords: breast cancer; AI; population screening; cancer prognosis; cancer survival
The positive effect of breast cancer screening on survival rates is mainly due to the early detection of invasive cancers [[
Studies have reported that AI performance is comparable to or might even outperform humans in the interpretation of screening studies [[
While the topic is considered important, original research on it is scarce. Lee et al. reported that the AI gave higher malignancy scores to invasive cancers compared to non-invasive cancers but did not compare the proportion found by the AI to that of human readers [[
The American Joint Committee on Cancer now incorporates biomarkers such as hormone receptor status and HER2 status into the 8th edition of its cancer staging system [[
A retrospective study was carried out at four centres: three in the UK and one in Hungary. The trial was registered at ISRCTN (https://
Mammograms were acquired on equipment from a single vendor at each screening site (Hologic at LTHT, GE Healthcare at NUH, and Siemens at ULH). In the UK, women are routinely invited between the ages of 50 and 70, with women over 70 years of age able to self-refer. The screening programmes used double human reading as the standard of care throughout the study period. Discrepancies between readers in the double-reading workflow were resolved by arbitration using an independent third reader or consensus, depending on local screening centre protocols. When readers agreed to recall a case, either a "recall" decision was reached or an arbitration performed by a single or group of radiologists made the definitive "recall" or "no recall" decision, also depending on the local screening centre protocols. For the current analysis, human reader performance was based on the first reader's opinion (R1). At all sites, the historical first reader's opinion was made in isolation, and the second reader had access, at their discretion, to the opinion of the first. Therefore, the first human reader's opinion is a more reliable indicator of individual human performance.
All study cases were analysed by the Mia
The AI software version was fixed prior to the study. All study data came from participants, whose data were never used in any aspect of algorithm development.
All positive cases, both screen-detected and interval cancers, were pathology proven malignancies. The presence of an invasive component, histological grade, tumour size, and lymph node status were retrieved from the NHS National Breast Screening Service (NBSS) database, including cancer registry information. In addition, the human readers' recall decisions were also retrieved from the NBSS database.
The screen-detected cancers and interval cancers (IC) were analysed separately. For all tumour characteristics, numbers, percentages, and 95% Wilson score confidence intervals were reported [[
The sensitivity of AI and R1 for each prognostic subgroup was compared using McNemar's test [[
Commonly used outcome metrics for breast screening programmes have recently been reported for the study population [[
Data from a previously performed study was used [[
The AI system detected 85.9% (1475) of the screen-detected cancers and indicated 31.7% (
The screen-detected cancers were invasive in 82.5% (95% CI 80.4%–84.3%) of the cases for the AI, while this was 81.1% (95% CI 79.1%–83.0%) for R1, resulting in a non-significant p-value of 0.374 (Table 1, Figure 3). For the ICs detected by the AI, a proportion of 91.5% (95% CI 82.8%–96.1%) was invasive, while this was 93.8% (95% CI 71.7%–98.9%) for R1 (p = 0.999; Table 1, Figure 3).
The distribution of the tumour characteristics for the invasive cancers is shown in Table 2 and Table 3, for the screen-detected cancers and ICs, respectively, as well as Figure 3a–d. A refined categorisation for whole tumour size and invasive tumour size can be found in Table S1. The 95% confidence intervals for the histological grade distribution showed large overlap, for screen-detected as well as ICs. The screen-detected invasive cancers consisted of 17.2% and 17.3% of grade 3 tumours, for the AI and R1, respectively (p = 0.981). For the invasive ICs, the percentage of grade 3 cancers was 35.9% for the AI and 33.3% for R1 (p = 0.851). The absolute number of detected ICs was much lower than the number of screen-detected invasive cancers, resulting in wider confidence intervals (Table 3).
There were no differences in invasive tumour size for screen-detected cancers identified by AI or R1: 83.1% (907/1091) and 83.1% (941/1133) were ≤20 mm, respectively, p = 0.995. A comparison of the whole tumour size, which includes the size of invasive and non-invasive disease, showed no difference between AI and R1: 66.9% (723/1080) and 66.1% (742/1122). For the invasive tumour size of the ICs, the proportion of small cancers for the AI (56.7%) was lower than for R1 (64.3%), but the chi-square test resulted in a non-significant p-value of 0.826 (Table 3). For invasive as well as whole tumour size, the 95% confidence intervals largely overlapped. Results for a refined categorisation into 5 categories (≤5 mm, 5–10 mm, 10–20 mm, 20–50 mm, and >50 mm) can be found in Supplementary Table S2.
The proportion of screen-detected cancers with a positive lymph node status was 20.0% and 20.1% for the AI and R1, respectively (p = 0.959), while these proportions were 36.4% and 40.0% for the ICs (p = 0.999). Additionally, for this tumour characteristic, the overlap of the 95% confidence intervals was large.
The sensitivity of R1 and AI for cancers with specific characteristics is shown in Table 4. While R1 has a significantly higher sensitivity for non-invasive cancers, the AI shows a somewhat higher sensitivity for cancers with a large invasive tumour size and for high-grade tumours. This finding is mainly due to the higher proportion of ICs that the AI detects, as shown in Table S3a,b for screen-detected and interval cancer, respectively. The 95% confidence intervals show again a large overlap for most characteristics, except for non-invasive cancers, where R1′s sensitivity is higher than the AI. Table S4 displays the relative sensitivity per prognostic subgroup. For most subgroups, the relative sensitivities of the AI and R1 are in line with each other, indicating that the number of cancers detected by R1 from the group of cancers detected by the AI is comparable to the number of cancers detected by the AI from the group of cancers detected by R1.
The aim of this large-scale evaluation was to compare the prognostic features of cancers detected by AI and human readers. The proportion of invasive cancers detected by AI was comparable to the proportion detected by R1. Additionally, for the tumour characteristics of the invasive cancers, no differences were found between the AI and R1. Results for screen-detected cancers and ICs were similar, although the number of ICs in the dataset was lower, which resulted in broader confidence intervals. In addition, the AI detected a higher number of ICs than the human readers (31.7% versus 7.2%, respectively), indicating the complementary value of the AI to humans. Our results indicate that using AI in breast screening will not have downstream effects on screening programmes. Moreover, the findings suggest that cancers detected by AI and human readers are likely to have a similar clinical course and outcome, supporting the potential role of AI as a reader in the double-reading workflow. As such, these results pave the way for prospective assessment of AI, either in clinical studies or in service evaluations, as a safe next step.
Very few previous studies have assessed the prognostic features of cancers detected by AI. McKinney et al. compared the histological features of cancers detected by the first human reader and the AI system and found no difference in the presence of invasive disease, histological grade, or size, but their analysis only included 414 cancers [[
The study of Leibig et al. only included screen-detected cancers, while we analysed screen-detected as well as interval cancers [[
It is well known that interval cancers have less favourable prognostic features compared to screen-detected cancers [[
In the last decades, there has been an ongoing debate about the benefits and harms of breast cancer screening in terms of overdiagnosis, whether of ductal carcinoma in situ (DCIS) or invasive cancer [[
The strength of our study is that the data has been acquired from large real-world screening populations with mammograms acquired from multiple mammography equipment vendors. However, the missing data for some cancer cases might be seen as a limitation. In addition, cancers were detected several months to three years after the initial screening. This delay makes it difficult to interpret the tumour characteristics, as it is unknown whether a biopsy and subsequent surgery performed at the time of the initial screening would have resulted in a smaller or non-invasive cancer. In some cases, cancer might not have been present at all on the initial screen. Additionally, interval cancer data is incomplete for the later part of the ten-year screening period as not enough time has elapsed for all interval cancer cases to have presented or been notified. As the AI system has been shown to be more sensitive than R1 on interval cancers, it is expected that with a more complete collection of interval cancers, the relative sensitivity of the AI system to R1 would be revealed to be greater. For the same cohort and AI system as in this study, this was reported to be the case where the relative difference in sensitivity of the AI system to R1 ranged from −1.4 to 0.8% across UK sites for the whole study time period, including the later years with missing interval cancer data, but was revealed to be greater, ranging from 2.5% to 7.6% for a study time period with more complete interval cancer data [[
Despite these limitations, our study, based on a large number of screen-detected breast cancers and interval cancers, provides valuable insights into the cancer spectrum detected by AI compared to human readers. The ability of AI to detect breast cancers with similar prognostic features to human readers suggests that cancers detected by AI and human readers will have a comparable clinical course and outcome, including survival. In addition, AI also has the potential to identify interval cancers earlier, which is likely to be of benefit to individual women and the wider screening programme. Prospective studies combining an AI opinion and a human reader are currently lacking and will be needed to better understand the effect on cancer detection, interval cancer rates, and prognostic features.
The results of this analysis are based on a large number of screen-detected and interval cancers. The AI showed that it can potentially detect interval cancers earlier, which could be beneficial for individual women and the wider screening programme. The cancers detected by the AI and by human readers were comparable in terms of invasiveness, histological grade, tumour size, and lymph node status. Therefore, cancers detected by AI are expected to have a similar clinical course and outcome, including survival. This implies that using AI in a double-reading workflow will have no or limited effects on screening programmes.
These findings indicate that it is safe to integrate AI into breast cancer screening programmes. As such, it paves the way for initiating prospective studies and service evaluations.
Graph: Figure 1 Flow chart of participants. The cases are shown per screening centre/vendor. CN—confirmed negatives; UN—unconfirmed; CP—screen-detected cancer; IC—interval cancer.
DIAGRAM: Figure 2 Venn diagram showing cancer detection for the first human reader (R1, purple circle) and AI (green circle). The overlap indicates the cancers being detected by both R1 and the AI. The yellow circle indicates the cancers that were not detected by R1 or the AI. Cancer detection is shown for (a) all cancers, (b) CP (screen-detected cancers), and (c) IC (interval cancers).
Graph: Figure 3 Cancers with an invasive component detected by AI versus Reader 1: (a) number of screen-detected cancers; (b) number of interval cancers; (c) percentage of screen-detected cancers; and (d) percentage of interval cancers.
Table 1 Presence of an invasive component in the detected cancers.
Invasive component present 1261 81.1% 79.1–83.0% 1213 82.5% 80.4–84.3% 1385 0.374 not present 293 18.9% 17.0–20.9% 258 17.5% 15.7–19.6% 326 missing 6 4 7 Invasive component present 15 93.8% 71.7–98.9% 65 91.5% 82.8–96.1% 222 0.999 not present 1 6.3% 1.1–28.3% 6 8.5% 3.9–17.2% 15 missing 5 22 56
Table 2 Characteristics of invasive cancers detected on screen by a human reader or AI.
Detected by R1 Detected by AI N = 1261 % 95% CI N = 1213 % 95% CI Histological grade 1 329 27.1% 24.7–29.7% 313 26.9% 24.4–29.5% 0.981 2 673 55.5% 52.7–58.3% 651 55.9% 53.1–58.8% 3 210 17.3% 15.3–19.6% 200 17.2% 15.1–19.5% Missing 49 97 Whole tumour size ≤20 mm 742 66.1% 63.3–68.8% 723 66.9% 64.1–69.7% 0.720 >20 mm 380 33.9% 31.2–36.7% 357 33.1% 30.3–35.9% Missing 139 181 Invasive tumour size ≤20 mm 941 83.1% 80.8–85.1% 907 83.1% 80.8–85.2% 0.995 >20 mm 192 16.9% 14.9–19.2% 184 16.9% 14.8–19.2% Missing 128 170 Lymph node status Negative 912 79.9% 77.4–82.1% 878 80.0% 77.6–82.3% 0.959 Positive 230 20.1% 17.9–22.6% 219 20.0% 17.7–22.4% Missing 119 164
Table 3 Characteristics of invasive interval cancers detected by human reader or AI.
Detected by R1 Detected by AI N = 15 % 95% CI N = 65 % 95% CI Histological grade 1 3 20.0% 7.0–45.2% 16 25.0% 16.0–36.8% 0.999 2 7 46.7% 24.8–69.9% 25 39.1% 28.1–51.3% 3 5 33.3% 15.2–58.3% 23 35.9% 25.3–48.2% Missing 0 1 Whole tumour size ≤20 mm 3 37.5% 13.7–69.4% 19 44.2% 30.4–58.9% 0.999 >20 mm 5 62.5% 30.6–86.3% 24 55.8% 41.1–69.6% Missing 7 22 Invasive tumour size ≤20 mm 9 64.3% 38.8–83.7% 34 56.7% 44.1–68.4% 0.766 >20 mm 5 35.7% 16.3–61.2% 26 43.3% 31.6–55.9% Missing 1 5 Lymph node status Negative 6 60.0% 31.3–83.2% 28 63.6% 48.9–76.2% 0.999 Positive 4 40.0% 16.8–68.7% 16 36.4% 23.8–51.1% Missing 5 21
Table 4 Sensitivity per prognostic subgroup for screen-detected and interval cancers combined.
Variable N Detected AI SEN AI 95% CI Detected R1 SEN R1 95% CI Invasive Component not present 341 264 77.4% 72.9–81.8% 294 86.2% 82.5–89.9% <0.001 present 1607 1278 79.5% 77.6–81.5% 1276 79.4% 77.4–81.4% 0.954 missing 63 Tumour grade grade 1 401 329 82.0% 78.3–85.6% 332 82.8% 78.9–86.5% 0.836 grade 2 845 676 80.0% 77.2–82.6% 680 80.5% 77.7–83.0% 0.805 grade 3 307 223 72.6% 67.6–77.5% 215 70.0% 64.9–75.3% 0.322 missing 54 Whole tumour size ≤20 mm 911 742 81.4% 78.9–83.9% 745 81.8% 79.2–84.2% 0.880 >20 mm 492 381 77.4% 73.7–81.1% 385 78.3% 74.6–81.9% 0.752 missing 204 Invasive tumour size ≤20 mm 1168 941 80.6% 78.3–82.7% 950 81.3% 79.1–83.5% 0.594 >20 mm 289 210 72.7% 67.5–77.8% 197 68.2% 62.8–73.5% 0.092 missing 150 Lymph node status negative 1110 906 81.6% 79.3–83.8% 918 82.7% 80.4–84.9% 0.460 positive 298 235 78.9% 74.2–83.6% 234 78.5% 73.7–83.1% 0.999 missing 199
Conceptualization, N.S., J.N., J.J.J., A.Y.N. and P.D.K.; methodology, C.J.G.O., A.Y.N., P.D.K. and J.N.; formal analysis, C.J.G.O.; writing—original draft preparation, A.Y.N., P.D.K. and C.J.G.O.; writing—review and editing, C.J.G.O., A.Y.N., N.S., J.J.J., J.N. and P.D.K.; supervision, P.D.K. All authors have read and agreed to the published version of the manuscript.
The study had UK National Health Service (NHS) Health Research Authority (HRA) (Reference: 19/HRA/0376) and ETT-TUKEB (Medical Research Council, Scientific and Research Ethics Committee, Hungary) approval (Reference: OGYÉI/46651-4/2020). The study was performed in accordance with the principles outlined in the Declaration of Helsinki.
The need for consent to participate was reviewed by HRA and ETT-TUKEB and confirmed to not be required as the study involved secondary use of retrospective and pseudonymised data.
Restrictions apply to the availability of these data. Data generated or analysed during the study are available from the corresponding author by request. The imaging datasets are obtained under licences for this study and are not publicly available.
All authors have completed the MDPI disclosure form. The following authors are paid employees of Kheiron Medical Technologies: Cary J.G. Oberije, Annie Y. Ng, Jonathan Nash, and Peter D. Kecskemethy. All other authors received no payment for this work. There are no other relationships or activities that could appear to have influenced the submitted work.
The following supporting information can be downloaded at: https://
By Cary J. G. Oberije; Nisha Sharma; Jonathan J. James; Annie Y. Ng; Jonathan Nash and Peter D. Kecskemethy
Reported by Author; Author; Author; Author; Author; Author