Featured Application: The use of highly robust radiomic features is fundamental to reduce intrinsic dependencies and to provide reliable predictive models. This work presents a study on breast tumor DCE-MRI considering the radiomic feature robustness against the quantization settings and segmentation methods. Machine learning models based on radiomic features allow us to obtain biomarkers that are capable of modeling the disease and that are able to support the clinical routine. Recent studies have shown that it is fundamental that the computed features are robust and reproducible. Although several initiatives to standardize the definition and extraction process of biomarkers are ongoing, there is a lack of comprehensive guidelines. Therefore, no standardized procedures are available for ROI selection, feature extraction, and processing, with the risk of undermining the effective use of radiomic models in clinical routine. In this study, we aim to assess the impact that the different segmentation methods and the quantization level (defined by means of the number of bins used in the feature-extraction phase) may have on the robustness of the radiomic features. In particular, the robustness of texture features extracted by PyRadiomics, and belonging to five categories—GLCM, GLRLM, GLSZM, GLDM, and NGTDM—was evaluated using the intra-class correlation coefficient (ICC) and mean differences between segmentation raters. In addition to the robustness of each single feature, an overall index for each feature category was quantified. The analysis showed that the level of quantization (i.e., the 'bincount' parameter) plays a key role in defining robust features: in fact, in our study focused on a dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) dataset of 111 breast masses, sets with cardinality varying between 34 and 43 robust features were obtained with 'binCount' values equal to 256 and 32, respectively. Moreover, both manual segmentation methods demonstrated good reliability and agreement, while automated segmentation achieved lower ICC values. Considering the dependence on the quantization level, taking into account only the intersection subset among all the values of 'binCount' could be the best selection strategy. Among radiomic feature categories, GLCM, GLRLM, and GLDM showed the best overall robustness with varying segmentation methods.
Keywords: robustness analysis; radiomic features; quantization levels; segmentation method agreement; DCE-MRI; breast tumors
Radiomics techniques, aimed at analyzing a large amount of minable features extracted from medical images, have shown great potential in different clinical areas [[
It is well known that radiomic features might be affected by several mathematical definitions, and the proliferation of toolboxes does not help this aspect. Thus, standardization initiatives have been carried out by the scientific community in order to deal with the lack of reproducibility and validation of radiomics studies. In particular, the Image Biomarker Standardization Initiative (IBSI) [[
Nevertheless, there are still no comprehensive and clear guidelines to obtain radiomic features that are not only reproducible but also robust. For that reason, an accurate and careful analysis of the robustness of the radiomic features is mandatory to define robust and clinically relevant biomarkers. Due to the wider diffusion and use of radiomics, in recent years, the study of methodologies aimed at improving the reproducibility and robustness of these tools has become a research topic faced, by the scientific community, from different points of view and in several clinical application domains.
There are many 'sources of variability' that should be considered: many literature works have analyzed and assessed the impact on the robustness of radiomic features related to intrinsic factors, such as (i) imaging protocol [[
The main goal of this work is to analyze the dependence of robustness on the quantization level used during feature extraction and the reliability of the segmentation of breast masses on dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) by relying upon two segmentation methods (both automated and manual) obtained by two raters. The proposed analysis exploits hand-crafted features and traditional machine learning approaches, which are still popular in radiomics studies today, aimed at establishing predictive models that are reliable for clinical applications.
The main contributions of this study are:
- an evaluation of the robustness of radiomic features, in terms of ICC, as a function of quantization level in PyRadiomics;
- an assessment of segmentation by comparing three different raters: a automated approach and the manual segmentation of two human operators, to evaluate the features reproducibility;
- a definition of a robustness scale for individual features as a function of ICC and mean standard deviation of the feature differences;
- a definition and quantification of an overall robustness metric for each radiomic category (i.e., GLCM, GLRLM, GLSZM, GLDM, and NGTDM).
The remainder of this work is organized as follows. Section 2 describes the conducted study aimed at dealing with the robustness evaluation. Section 3 illustrates the experimental results in terms of dependence on the quantization level, as well as the segmentation assessment. Finally, the discussion and conclusions are provided in Section 4.
This section describes the characteristics of the DCE-MRI dataset used and the analysis carried out to assess the robustness of the features, relative to their dependence on the quantization level, as well as on their reproducibility concerning the segmentation method.
The radiomic feature robustness analysis and quantification method were implemented entirely in the MATLAB R2019b (64-bit version) environment MathWorks, Natick, MA, USA). PyRadiomics was used for the extraction of radiomic features, an open-source Python package developed for the standardization of radiomic feature extraction [[
A total of 111 breast masses from DCE-MRI exams were considered in this study, for a total of 1231 MR slices. Table 1 provides some relevant characteristics concerning the MR imaging protocol.
ROIs containing breast masses were segmented using two different segmentation methods:
-
Automated delineation. This is a computer-assisted method based on the spatial fuzzy c-means (sFCM) clustering algorithm [[
19 ]]. The sFCM algorithm, compared to the traditional FCM, takes into account the spatial relationship among neighboring pixels, making it less sensitive to noise and other imaging artifacts. This approach has been previously implemented and validated in [[21 ]]; - Manual free-hand delineation. In order to quantify the inter-operator dependence of the robustness of the features, two delineations were performed by two radiologists with more than 5 years of experience with breast MRI, in consensus with a consultant breast radiologist (with more than 30 years of experience with breast imaging).
Segmentations—in terms of original MR slices containing the breast mass and the corresponding masks—were converted from DICOM to the Neuroimaging Informatics Technology Initiative (NIfTI) format [[
The DCE-MRI images analyzed represent a homogeneous dataset in terms of spatial resolution along the
- Gray-level co-occurrence matrix (GLCM) [[
23 ]]: spatial relationship between pixels in a specific direction, highlighting the properties of uniformity, homogeneity, randomness, and linear dependencies; - Gray-level run length matrix (GLRLM) [[
25 ]]: texture in specific direction, where fine texture has shorter runs while coarse texture presents more long runs with different intensity values; - Gray-level size zone matrix (GLSZM) [[
26 ]]: regional intensity variations or the distribution of homogeneity regions; - Gray-level dependence matrix (GLDM) [[
27 ]]: quantifies gray-level dependencies; - Neighboring gray tone difference matrix (NGTDM) [[
28 ]]: spatial relationship among three or more pixels, closely approaching the human perception of the image.
Conversely, shape-based and first-order features were not considered in the study because they are independent of the quantization level.
It is worth noting that texture features, such as the GLCM features (also known as Haralick's features [[
Radiomic features were extracted considering different quantization levels (i.e., number of bins equal to
(
where
The quantization of images, in terms of the rebinning of the gray levels prior to feature computation, has a two-fold goal: (i) noise reduction and (ii) avoidance of sparse matrices (possibly resulting in unsuitable and poorly robust features for predictive modeling). The Image Biomarker Standardization Initiative (IBSI) [[
The 'binCount' parameter is used by PyRadiomics to determine the image quantization settings (i.e., the number of bins) in the radiomic feature-extraction phase. The optimal value for 'binCount' was chosen so as to maximize the number of robust features (in terms of ICC). As a matter of fact, this choice allowed us to carefully assess the quantization settings, thereby avoiding an arbitrary selection of the number of bins. With more details, to evaluate the robustness as the quantization level changes, the
(
Each breast mass was segmented via two different delineation approaches: (i) manual segmentation, performed by two distinct radiologists with 5 years of experience; (ii) an automated method based on spatial FCM clustering that was already tested and validated in [[
(
(
(
After determining the features showing excellent robustness, we aimed to identify the most relevant features for the analysis at hand by evaluating the dependence on the segmentation method. To this aim, we considered
The Shapiro–Wilk test [[
To evaluate the robustness of the MRI radiomic features, a six-level scale (ranging from 0 to 5) was defined, based on a combination of the ICCs coefficient and the standard deviation (SD) of the mean percentage differences between the three raters (i.e., auto, man1, man2), according the conditions in Table 2. Percentage differences were evaluated according to Equation (
(
Quantization setting dependence was evaluated by means of ICC, as indicated in Section 2.4.1. In order to consider only radiomic features with excellent robustness, the cutoff value was set to
Table 3 summarizes the total of the robust features obtained for each quantization level.
The robustness of the MRI radiomic features was evaluated using a six-level scale (going from 0 to 5) and based on a combination of ICCs and the SD of the mean percentage differences between the three raters (i.e., auto, man1, man2). Percentage differences were evaluated according to Equation (
The values obtained are illustrated, for each of the five feature categories, in the following figures (Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11).
Starting from the previously defined scale (see Table 2), an overall robustness value—defined according to Equation (
(
where:
-
-
- maxValue represents the maximum possible value (i.e., 5);
-
The aim of this work was to acquire further insights into the dependence of radiomic features' robustness on the quantization level used during feature extraction, and the reproducibility of features in relation to segmented ROIs of breast masses. In particular, first of all, in order to quantify the robustness and to provide a score—for each radiomic feature and for each category (i.e., GLCM, GLRLM, GLSZM, GLDM, NGTDM)—a correlation analysis based on the ICC was carried out to identify the features that are the least dependent on the level of quantization (i.e., the 'binCount' parameter). After determining the best value for 'binCount' and considering the subset containing only the robust features, an additional analysis was carried out in order to assess the reproducibility by comparing, relying upon the ICC, three different segmentations: two that were independently obtained through manual delineation by two radiologists, and one that was obtained using a validated automated segmentation approach [[
Among the GLCM features, Contrast, DifferenceAverage, DifferenceVariance, lmc1, lmc2, and MaximumProbability seem to have a poor robustness, related to the segmentation method. Even if, in the comparison of manual1 vs. manual2, their robustness index is good (surely due to a higher accordance between the two manual segmentations), in the comparison with the automated method, the lmc1 is lower. This denotes a high dependence on the ROI. The category has a medium-to-high overall robustness index (0.48–0.8). Among the GLRLM features, LongRunHighGrayLevelEmphasis obtains the worst score in the comparison between automated and manual segmentations, followed by GrayLevelNonUniformity, RunLengthNonUniformity, and RunVariance. On the other hand, the other features present a very good score in all comparisons, and the overall category score is high (0.6–0.8). Nearly all GLSZM features have a medium/low index that brings the category a very low overall index, varying in the range 0.36–0.53, depending on the segmentation. GLDM almost all have a medium other score, except for LargeDependenceHighGrayLevelEmphasis and SmallDependenceHighGrayLevelEmphasis, showing an overall score in the range 0.42–0.67. Finally, all features belonging to the NGTDM category show very low robustness indexes and, consequently, the category obtains an overall score in the range 0.16–0.4.
Considering the dependence on the quantization level, taking into account only the intersection subset among all the values of 'binCount' could be the best selection strategy. Among radiomic categories, GLCM, GLRLM, and GLDM showed the best overall robustness against variations due to the segmentation method.
A fair comparison against analogous literature works is not possible, because each one does not necessarily refer to the same disease and, consequently, does not use the same data. As a matter of fact, rather than comparing results, we believe it is more appropriate to summarize the results obtained (see Table 6), in order to provide an overview of the literature and the parameters for analyzing the robustness of the characteristics.
The main focus of our work was on the development of a reliable system in terms of robust radiomic features according to the quantization settings and segmentation approaches. The adopted methods are not computationally expensive since traditional statistical analyses (mostly based on the ICC) are applied after the radiomic feature extraction. This assessment allowed us to propose a lightweight system based on classic machine learning techniques. Moreover, the type of analysis performed disregards variations in the data that could affect the stability of the system. In fact, the analysis performed on the features is computed from the segmented ROI and by quantifying the correlation (via ICC) as the quantization level and segmentation approach change.
Presently, clinical decision support systems (CDSSs) are becoming increasingly prevalent in clinical routines, and are able to assist the work of clinicians [[
Furthermore, as future developments, it would be very interesting to validate the insights gained from this study on a more extensive dataset in order to assess the repeatability of the analysis made and its ability to generalize. A pre-analysis and feature calibration phase—such as the one proposed in this study—is absolutely essential to have an initial set of non-redundant and robust features. In fact, every radiomic study consists of several phases [[
Graph: Figure 1 Examples related to a benign (on the top) and a malignant (on the bottom) breast mass. For each example, the DCE-MRI image area is shown without (on the left) and with (on the right) delineation contours. Segmentation results yielded by the automated delineation (red contour) and manual delineations—performed by the first (green contour) and the second (blue contour) radiologist—are compared. All figures are shown with a 2.5× magnification factor.
Graph: Figure 2 Dependence of the quantization level of GLCM radiomic features, considering different values (i.e., 8, 16, 32, 64, 128, and 256) of the 'bincount' PyRadiomics parameter. The features ClusterProminence, ClusterShade, ClusterTendency, Correlation, JointAverage, and SumAverage were discarded, as the ICC did not overcome the cutoff for any 'binCount' setting.
Graph: Figure 3 Dependence of the quantization level of GLRLM radiomic features, considering different values (i.e., 8, 16, 32, 64, 128, and 256) of the 'binCount' PyRadiomics parameter. The features SumSquares, GrayLevelNonUniformityNormalized, GrayLevelVariance, HighGrayLevelRunEmphasis, LongRunLowGrayLevelEmphasis, LowGrayLevelRunEmphasis, ShortRunHighGrayLevelEmphasis, and ShortRunLowGrayLevelEmphasis were discarded, as the ICC did not overcome the cutoff for any 'binCount' setting.
Graph: Figure 4 Dependence of the quantization level of GLSZM radiomic features, considering different values (i.e., 8, 16, 32, 64, 128, and 256) of the 'binCount' PyRadiomics parameter. The features GrayLevelNonUniformityNormalized, GrayLevelVariance, HighGrayLevelZoneEmphasis, LargeAreaLowGrayLevelEmphasis, LowGrayLevelZoneEmphasis, SmallAreaHighGrayLevelEmphasis, and SmallAreaLowGrayLevelEmphasis were discarded, as the ICC did not overcome the cutoff for any 'binCount' setting.
Graph: Figure 5 Dependence of the quantization level of GLDM radiomic features, considering different values (i.e., 8, 16, 32, 64, 128, and 256) of the 'binCount' PyRadiomics parameter. The features GrayLevelVariance, HighGrayLevelEmphasis, LargeDependenceLowGrayLevelEmphasis, LowGrayLevelEmphasis, and SmallDependenceLowGrayLevelEmphasis were discarded, as the ICC did not overcome the cutoff for any 'binCount' setting.
Graph: Figure 6 Dependence of the quantization level of NGTDM radiomic features, considering different values (i.e., 8, 16, 32, 64, 128, and 256) of the 'binCount' PyRadiomics parameter.
Graph: Figure 7 Robustness of the GLCM radiomic features obtained on DCE-MRI images using automated segmentation against two manual segmentations from two independent readers.
Graph: Figure 8 Robustness of the GLRLM radiomic features obtained on DCE-MRI images using automated segmentation against two manual segmentations from two independent readers.
Graph: Figure 9 Robustness of the GLSZM radiomic features obtained on DCE-MRI images using automated segmentation against two manual segmentations from two independent readers.
Graph: Figure 10 Robustness of the GLDM radiomic features obtained on DCE-MRI images using automated segmentation against two manual segmentations from two independent readers.
Graph: Figure 11 Robustness of the NGTDM radiomic features obtained on DCE-MRI images using automated segmentation against two manual segmentation.
Table 1 Some relevant characteristics concerning the DCE-MRI imaging protocol.
Protocol Characteristic Value series Ax VIBRANT mphase MR scanner GE Signa HDxt magnetic field 1.5 Tesla repetition time (37.72–56.92) ms echo time (17.64–26.80) ms flip angle 10° matrix size pixels slice thickness (2–3) mm spacing between slices (1–1.5) mm pixel spacing (0.6875–0.7422) mm
Table 2 Conditions to evaluate the robustness of each radiomic feature category.
Score (Robustness) ICC Condition SD Condition 5 (very high) ≥90% ≤10% 4 (high) ≥85% ≤20% 3 (medium) ≥80% ≤30% 2 (limited) ≥75% ≤40% 1 (low) ≥70% ≤100% 0 (very low) <70% >100%
Table 3 Summary of robust features (ICC > 0.9) considering different quantization levels. Bold values represent the setting guaranteeing the maximum number of robust features.
Quantization Level ('binCount') Robust Features Number Robust Features Percentage (%) 8 42 85.7 16 38 77.6 64 39 79.6 128 37 75.5 256 34 69.4
Table 4 Overall feature category robustness across the three segmentation raters.
Category Auto vs. Man1 Auto vs. Man2 Man1 vs. Man2 GLCM 0.48 0.45 0.8 GLRLM 0.6 0.63 0.8 GLSZM 0.42 0.36 0.53 GLDM 0.42 0.42 0.67 NGTDM 0.36 0.16 0.4
Table 5 Percentage of features with robustness index
Category Auto vs. Man1 Auto vs. Man2 Man1 vs. Man2 GLCM 50% 50% 100% GLRLM 50% 50% 100% GLSZM 33.3% 0% 44.4% GLDM 33.3% 33.3% 100% NGTDM 0% 0% 0%
Table 6 Overview of the state-of-the-art results, along with the considered settings, to assess the robustness of radiomic features.
Related Work Imaging, Disease (# Samples) Dependence Extraction Tool (# Features) Main Findings Shafiq-Ul-Hassan et al. [ CT, phantom (8 by 8 CT scanners) voxel size; gray levels in-house tool (213) In total, 150 features are reproducible across voxel sizes Escudero Sanchez et al. [ CT, liver cancer (43) slice thickness PyRadiomics (107) In total, 75–90% of features are highly robust Whitney et al. [ MRI (DCE-MRI), breast cancer (612) magnetic field strength PyRadiomics (38) In total, 5 features are robust across field strength Scalco et al. [ MR (T2w)/prostate cancer/14 image signal normalization PyRadiomics (91) In total, 60% of features have a poor reproducibility De Farias et al. [ CT, various lesion types (10,000 slices) + validation on NSCLC (17,938 slices) super-resolution PyRadiomics (75) In total, 10 texture features have excellent robustness Zwanenburg et al. [ CT, NSCLC (31) + HNSCC (19) image perturbation N.A. (4032) In total, 2310 (57.3%) NSCLC features are robust; 582 (14.4%) HNSCC features are robust; 454 (11.3%) features are robust in both cohorts Mottola et al. [ CT, RCC (98) + CK (93) image resampling and perturbation in-house tool (32) In total, 94.6% and 87.7% of features achieve the best reproducibility in RCC and CK Tixier et al. [ MRI (FLAIR, T1w), glioblastoma (98) segmentation method PyRadiomics (108) IH and GLCM features are the most robust; GLSZM features have a mixed robustness Granzier et al. [ MR (T1w), breast cancer (102) inter-observer segmentation variability RadiomiX (1328) + PyRadiomics (833) In total, 41.6% (552/1328) and 32.8% (273/833) of all RadiomiX and Pyradiomics features, respectively, are robust Le et al. [ CTA, culprit lesions in carotid arteries (41) inter-observer segmentation variability; image configurations PyRadiomics (93) In total, 55.9% (52/93) of features have excellent robustness; 33.3% (31/93) of features have moderate robustness; 10.8% (10/93) of features have poor robustness MRI (DCE-MRI), breast cancer (111) quantization level; inter-observer segmentation variability PyRadiomics (49) In total, 87.8% (43/49) of features were robust (ICC ) with binCount = 32; GLCM and GLRLM features have high robustness; GLDM features have moderate robustness
Conceptualization, C.M. and L.R.; methodology, C.M. and L.R.; software, C.M.; validation, C.M. and L.R.; formal analysis, C.M. and L.R.; investigation, C.M.; resources, M.D., A.O. and I.D.; data curation, M.D., A.O. and I.D.; writing—original draft preparation, C.M. and L.R.; writing—review and editing, M.D., A.O., V.C. and T.V.B.; visualization, C.M. and L.R.; supervision, V.C. and T.V.B.; project administration, T.V.B. All authors have read and agreed to the published version of the manuscript.
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Ethics Committee of "Azienda Ospedaliera Universitaria Policlinico P.Giaccone" of Palermo, Italy (protocol code n.1/2020-15/01/2020).
Retrospective data collection was approved by the Ethics Committee. The requirement for evidence of informed consent was waived because of the retrospective nature of our study.
The authors declare no conflict of interest.
The following abbreviations are used in this manuscript:
CK Contralateral normal kidney CT Computed tomography CTA Computed tomography angiography DCE-MRI Dynamic contrast-enhanced magnetic resonance imaging EVK Enhancement variance kinetics FCM Fuzzy c-means FO First-order GLCM Gray-level co-occurrence matrix GLDM Gray-level dependence matrix GLRLM Gray-level run length matrix GLSZM Gray-level size zone matrix HNSCC Head and neck squamous cell carcinoma IBSI Image Biomarker Standardization Initiative ICC Intra-class correlation coefficient IH Intensity histogram IVH Intensity–volume histogram KCA Kinetic curve assessment MR Magnetic resonance NGTDM Neighboring gray tone difference matrix NIfTI Neuroimaging Informatics Technology Initiative NSCLC Non-small cell lung cancer RCC Renal cell carcinoma ROI Region of interest sFCM Spatial fuzzy c-means T1w T1 weighed T2w T2 weighed
By Carmelo Militello; Leonardo Rundo; Mariangela Dimarco; Alessia Orlando; Ildebrando D'Angelo; Vincenzo Conti and Tommaso Vincenzo Bartolotta
Reported by Author; Author; Author; Author; Author; Author; Author