Radiomics-based machine learning for glioma grade classification: a multicenter study with SHapley Additive exPlanations interpretability analysis
Highlight box
Key findings
• Using data from 905 patients at multiple centres, we built and externally validated a transparent radiomics-based extreme gradient boosting model for pre-operative World Health Organization grading of gliomas on routine contrast-enhanced T1-weighted magnetic resonance imaging (MRI).
• The model showed consistently high discrimination: area under the curve 0.983 in the training set, 0.897 in the internal testing set, and 0.834 and 0.880 in two independent external cohorts, with overall accuracies of 82.3–94.3%.
• SHapley Additive exPlanation identified 14 key radiomic features—mainly texture-heterogeneity metrics extracted with Laplacian-of-Gaussian and wavelet filters—as the principal drivers of model output.
What is known and what is new?
• Conventional MRI grading of gliomas is only moderately accurate (55–77%) and shows considerable inter-observer variability (15–25%).
• Earlier radiomics studies rarely included external validation, interpretability analyses, or testing across geographically diverse populations.
• This work delivers the first interpretable, externally validated machine-learning model that maintains performance across international cohorts and molecular subtypes.
What is the implication, and what should change now?
• The model substantially outperforms conventional imaging assessment and provides transparent, explainable predictions essential for clinical trust and regulatory approval.
• Prospective validation in real-world clinical workflows is needed to assess impact on treatment decisions, patient outcomes, and cost-effectiveness before routine clinical implementation.
Introduction
Gliomas account for approximately 80% of all malignant primary brain tumors, with an annual incidence of 6.0 per 100,000 individuals worldwide (1,2). The World Health Organization (WHO) classification system stratifies gliomas into grades I, II, III, and IV, with 5-year survival rates decreasing dramatically from 68% for grade II to 9.9% for grade IV tumors (3,4). The recent 2021 WHO classification system further integrated molecular markers, including isocitrate dehydrogenase (IDH) mutations, 1p/19q codeletion, and O6-methylguanine-DNA methyltransferase (MGMT) methylation status, which are crucial for accurate diagnosis and treatment planning (5,6).
Accurate preoperative grading is fundamental for treatment planning, as it directly influences the surgical approach, adjuvant therapy selection, and patient counseling. Grade IV glioblastomas require aggressive multimodal therapy, including maximal safe resection, concurrent chemoradiotherapy, and maintenance temozolomide, while grade II tumors may be managed with observation or limited intervention (7,8). The clinical impact of accurate preoperative grading is profound; it can reduce unnecessary aggressive interventions by 25% in low-grade cases and prevent treatment delays in 40% of high-grade cases (9).
However, conventional magnetic resonance imaging (MRI) assessment faces significant limitations in distinguishing among WHO grades, particularly between grade II and III gliomas, where diagnostic accuracy ranges from 55% to 77% (10,11). Traditional imaging features such as contrast enhancement, necrosis, and peritumoral edema demonstrate substantial overlap between grades, leading to diagnostic uncertainty in up to 30% of cases (12,13). Moreover, inter-observer variability in radiological interpretation ranges from 15% to 25%, affecting treatment decision consistency (14).
Model interpretability is crucial in clinical artificial intelligence. First, clinicians need to understand the basis of predictions to use them confidently in treatment decisions. Second, regulatory frameworks prioritize explainability: agencies such as the Food and Drug Administration (FDA) require interpretable models for medical device approval, especially in high-risk diagnostic or therapeutic settings. Third, transparency helps clinicians detect potential algorithmic biases and recognize cases in which predictions may be unreliable.
Several prediction models for glioma grading have been developed; however, these models have significant limitations. For example, most rely on single-center cohorts with limited sample sizes (n<500), lack external validation across different populations, or employ traditional statistical methods that cannot capture complex non-linear relationships (15,16). Critically, existing models lack interpretability, presenting a “black-box” problem that hinders their clinical adoption and regulatory approval, as evidenced by an implementation rate of <5% in routine clinical practice (17,18).
Radiomics, an emerging field that involves the extraction of quantitative features from medical images, has shown promise in improving diagnostic accuracy and treatment planning in oncology (19,20). By analyzing texture, shape, and first-order statistical features invisible to the human eye, radiomics can provide objective and reproducible biomarkers for tumor characterization (21,22). Recent advances in multiparametric MRI radiomics, which extend beyond traditional morphological assessment to capture underlying biological heterogeneity, have shown significant potential in predicting treatment response and prognosis across different glioma subtypes (23).
Machine learning (ML) algorithms can effectively integrate multiple radiomics features to build robust predictive models (24). The SHapley Additive exPlanation (SHAP) method addresses the interpretability limitation by providing both global and local explanations for model predictions (25). Recent radiomics studies support the clinical adoption of interpretable ML models, which have been shown to demonstrate superior performance in tumor differentiation while maintaining the model transparency crucial for clinical decision-making (26).
Therefore, robust, interpretable prediction models that can provide accurate, individualized glioma grading are urgently needed to be developed to optimize treatment strategies and improve patient outcomes. This study aimed to: (I) extract comprehensive radiomics features from preoperative T1-weighted contrast-enhanced MRI; (II) develop and compare multiple ML models; (III) validate model performance across multiple independent cohorts spanning diverse geographic regions and ethnic groups; and (IV) provide model interpretability using a SHAP analysis. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/rc).
Methods
Study design and ethical considerations
This multicenter retrospective study developed and validated an interpretable ML model for automated WHO grading of gliomas using preoperative contrast-enhanced T1-weighted MRI radiomics features. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and approved by the Institutional Review Board of Nantong University (No. 2024-L215). Given the retrospective nature and use of de-identified data, the requirement for informed consent was waived.
Study population
Between 2005 and 2024, we enrolled 905 patients with histopathologically confirmed gliomas from three independent institutions: The Cancer Genome Atlas (TCGA) cohort (n=329), University of California San Francisco (UCSF) cohort (n=482), and Nantong University Affiliated Hospital (NTUA) cohort (n=94). Patients were included in the study if they met the following inclusion criteria: (I) had a histopathologically confirmed diagnosis of glioma according to the WHO classification; (II) had preoperative contrast-enhanced T1-weighted MRI scans available; (III) had complete WHO grading information (Grade II, III, or IV) available; (IV) and had molecular marker information, including IDH mutation status, 1p/19q codeletion, and MGMT methylation status, available. Patients were excluded from the study if they met any of the following exclusion criteria: (I) had missing preoperative contrast-enhanced T1-weighted MRI images; (II) had no WHO grading information; and/or (III) had poor-quality images precluding reliable radiomics analysis.
Primary outcome and sample size determination
The primary outcome was glioma grade, dichotomized into low-grade gliomas (LGGs; WHO grade II) and high-grade gliomas (HGGs; WHO grades III–IV), as determined by postoperative histopathological diagnosis. To ensure methodological rigor, outcome labels remained blinded during feature processing and model training, guaranteeing independence between predictor handling and outcome assessment.
Sample size calculation followed the “10 events per variable” rule for ML models. Assuming 14 final radiomics features and an estimated 60% HGG prevalence, a minimum of 200 cases was required for robust model development (12 features ×10/0.6=200). To ensure adequate statistical power for external validation and subgroup analyses, we targeted enrollment of at least 400 cases across external testing cohorts, ultimately exceeding this threshold with 576 external validation cases.
Patient and public involvement
Patients and public representatives were not directly involved in the study design, conduct, or reporting phases due to: (I) the retrospective nature utilizing existing de-identified datasets; (II) the technical complexity of radiomics methodology requiring specialized expertise; and (III) institutional constraints during the study period. However, we acknowledge this limitation and plan to incorporate patient advisory groups in future prospective validation studies to guide clinical implementation strategies, ensure patient-centered outcome measures, and facilitate community dissemination of findings.
Clinical variables and molecular markers
The following clinical data were extracted from electronic medical records: age at diagnosis, gender, WHO grade (II, III, or IV), tumor location, and surgical extent. The analyzed molecular markers included IDH1/2 mutation status (wild-type vs. mutant), 1p/19q codeletion status (intact vs. codeleted), and MGMT promoter methylation status (unmethylated vs. methylated). These molecular markers were determined using standard protocols including immunohistochemistry, fluorescence in situ hybridization, and pyrosequencing as appropriate (27).
Image acquisition and preprocessing
All the patients underwent preoperative MRI, including contrast-enhanced T1-weighted sequences. The images were converted to Neuroimaging Informatics Technology Initiative (NIfTI) format and underwent a standardized preprocessing pipeline, including: (I) N4 bias-field correction to eliminate intensity inhomogeneity; (II) resampling to isotropic 1×1×1 mm3 voxel spacing using trilinear interpolation; and (III) intensity normalization using Z-score standardization.
Data quality control: all the imaging data underwent systematic quality assessment, including: (I) motion artifact evaluation using automated detection algorithms; (II) contrast injection adequacy verification based on an enhancement ratio >150% in normal vascular structures; (III) image completeness confirmation to ensure full brain coverage; and (IV) standardized anonymization using protocols to maintain patient privacy while preserving essential clinical variables. Tumor regions of interest (ROIs) were manually delineated on contrast-enhanced T1-weighted images by two experienced neuroradiologists (each with >10 years of experience) who were blinded to the histopathological results. For quality control, a subset of 100 cases underwent independent segmentation by both radiologists, achieving excellent inter-rater reliability (intraclass correlation coefficient >0.90).
Radiomics feature extraction
Radiomics features were extracted using the PyRadiomics package (version 3.0.1) following Image Biomarker Standardization Initiative (IBSI) guidelines (28). A total of 1,197 quantitative features were extracted from each tumor ROI, including shape features (N=14): three-dimensional (3D) geometric properties, including the volume, surface area, sphericity, and compactness; first-order features (N=17): statistical descriptors of voxel intensity distribution, including the mean, median, standard deviation, skewness, and kurtosis; second-order texture features (N=72): gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), gray-level size zone matrix (GLSZM), gray-level dependence matrix (GLDM), and neighboring gray-tone difference matrix (NGTDM); wavelet features (N=728): features extracted from eight wavelet decompositions; Laplacian of Gaussian (LoG) features (N=364): features extracted using the LoG filters with sigma values ranging from 1.0 to 5.0 mm; and square and square root filtered features (N=2): mathematical transformations to capture different intensity ranges.
Feature selection
Feature selection was performed using RFE with 10-fold cross-validation repeated 10 times to ensure robust selection. The RFE algorithm systematically removes features with lowest importance rankings until optimal performance is achieved, as measured by the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. Highly correlated features (Pearson correlation coefficient >0.9) were identified and the feature with higher clinical relevance retained.
ML model development
The performance of 10 ML algorithms in glioma grading prediction was compared: AdaBoost (AB), extra trees (ET), stochastic gradient boosting (SGBT), logistic regression (LR), light gradient boosting machine (LightGBM), neural network (NNET), Naive Bayes (NB), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGBoost). Hyperparameter optimization was performed using 10-fold cross-validation, repeated 10 times, combined with a grid search across predefined parameter spaces. The optimal hyperparameters were selected based on the maximum AUC performance during cross-validation.
Model performance comparison
Model performance was comprehensively evaluated based on the following performance metrics: (I) discrimination: The AUC of the ROC curve with 95% confidence intervals (CIs) was calculated using 1,000 bootstrap resamples, while the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1-score were calculated at optimal probability thresholds determined by Youden’s index; (II) calibration: calibration plots comparing predicted probabilities to observed frequencies across deciles of predicted risk were assessed, while the Hosmer-Lemeshow test was used to evaluate goodness-of-fit, with P>0.05 indicating adequate calibration; (III) clinical utility: a decision curve analysis (DCA) was conducted to quantify the net benefit across different probability thresholds compared to “treat all” and “treat none” strategies; and (IV) statistical comparison: DeLong’s test was used to compare the AUC values between models, while McNemar’s test was used to evaluate differences in classification accuracy for paired models.
Fairness assessment
To evaluate algorithmic fairness, we assessed model performance across key demographic and clinical subgroups including gender (male vs. female), age groups (≤60 vs. >60 years), and molecular marker status (IDH wild-type vs. mutant). Performance metrics including AUC, sensitivity, and specificity were calculated separately for each subgroup. Statistical comparisons between subgroups were performed using DeLong’s test for AUC and McNemar’s test for sensitivity/specificity. P values <0.05 indicated significant performance differences, suggesting potential algorithmic bias.
Model interpretability analysis
The optimal model underwent comprehensive interpretability analysis using SHAP, a game theory-based approach providing both global and local explanations (29). SHAP values quantify each feature’s contribution to individual predictions, enabling global importance ranking, feature impact visualization, and individual case analysis.
Statistical analysis
The continuous variables are expressed as the mean ± standard deviation, or the median (interquartile range) depending on data distribution normality, and were assessed using the Shapiro-Wilk test. The categorical variables are presented as the frequency and percentage. Inter-group comparisons for continuous variables were performed using the independent t-test for normally distributed data or the Mann-Whitney U test for non-normally distributed data. The Chi-squared test or Fisher’s exact test was used for the categorical variables as appropriate.
Feature selection was conducted using a multi-step process. Initially, a Pearson correlation analysis was conducted to identify and remove highly correlated features (correlation coefficient |r| >0.9) to mitigate multicollinearity. Subsequently, a univariate analysis using an independent t-test was applied to select features with significant differences between glioma grades (P<0.05). Finally, RFE with cross-validation was implemented using the XGBoost algorithm to rank and select the optimal subset of features based on their contribution to model performance, minimizing overfitting while maximizing discriminative power. Predictions for patients in the validation cohorts were obtained by applying the final trained models to independent validation data. Outcome labels were withheld, and no information from the validation sets was used during model training or feature selection.
Model performance was evaluated using multiple metrics, including the AUC of the ROC curve with the 95% CI calculated using the DeLong’s test, accuracy, sensitivity, specificity, PPV, NPV, and F1 score. Model calibration was assessed using calibration curves and the Hosmer-Lemeshow goodness-of-fit test, with P>0.05 indicating adequate calibration. Clinical utility was quantified by a DCA, comparing net benefit across threshold probabilities. For interpretability, SHAP values were computed to quantify individual feature contributions to predictions, with the mean absolute SHAP values used for global feature importance ranking. Local explanations were visualized through waterfall and dependence plots. Statistical significance was set at P<0.05 (two-tailed). All analyses were performed using Python 3.12 with libraries including scikit-learn (version 1.4.0), XGBoost (version 2.0.3), and SHAP (version 0.44.0).
Results
Baseline clinical characteristics
A total of 905 glioma patients from three independent cohorts were included in this study (Figure 1). The TCGA cohort was divided into a training set (n=230, 70%) and an internal testing set (n=99, 30%), while the UCSF (n=482) and NTUA (n=94) cohorts served as two independent external testing sets for assessing model generalizability across diverse populations and institutional practices.
The demographic and clinicopathological characteristics of all the patients are detailed in Tables 1,2. Significant differences were observed between the training and external validation cohorts in several baseline characteristics, such as gender (P=0.07) and WHO grade (P=0.001). Specifically, external testing set 2 (NTUA) demonstrated a higher male predominance (73.4% in external testing set 2 vs. 58.3% in the training set, P=0.02) and a greater proportion of grade IV tumors (60.6% vs. 44.3%, P=0.041). IDH wild-type status was also more prevalent in the external cohorts (86.2% in the NTUA set vs. 48.3% in the training set, P<0.001), reflecting the real-world population heterogeneity and strengthening the rigor of the external validation. Table 2 highlights significant differences in age and molecular marker profiles across different WHO grades (all P<0.001).
Table 1
| Characteristic | Total (n=905) | Training set (n=230) | Internal testing set (n=99) | External testing set 1 (n=482) | External testing set 2 (n=94) | P value |
|---|---|---|---|---|---|---|
| Age (years) | 56.0 (45.0–66.0) | 54.0 (39.0–64.0) | 55.0 (41.5–64.0) | 59.0 (47.0–68.0) | 55.1±12.6 | 0.002 |
| Gender | 0.07 | |||||
| Male | 550 (60.8) | 134 (58.3) | 58 (58.6) | 289 (60.0) | 69 (73.4) | |
| Female | 355 (39.2) | 96 (41.7) | 41 (41.4) | 193 (40.0) | 25 (26.6) | |
| WHO grade | 0.001 | |||||
| II | 273 (30.2) | 79 (34.3) | 32 (32.3) | 150 (31.1) | 12 (12.8) | |
| III | 234 (25.9) | 67 (29.1) | 25 (25.3) | 124 (25.7) | 18 (19.1) | |
| IV | 398 (44.0) | 84 (36.5) | 42 (42.4) | 208 (43.2) | 64 (68.1) | |
| IDH mutation status | <0.001 | |||||
| Mutant | 384 (42.4) | 119 (51.7) | 42 (42.4) | 210 (43.6) | 13 (13.8) | |
| Wild-type | 521 (57.6) | 111 (48.3) | 57 (57.6) | 272 (56.4) | 81 (86.2) | |
| 1p/19q codeletion status | <0.001 | |||||
| Intact | 667 (73.7) | 158 (68.7) | 72 (72.7) | 360 (74.7) | 77 (81.9) | |
| Codeletion | 238 (26.3) | 72 (31.3) | 27 (27.3) | 122 (25.3) | 17 (18.1) | |
| MGMT methylation status | 0.04 | |||||
| Methylated | 451 (49.8) | 125 (54.3) | 48 (48.5) | 243 (50.4) | 35 (37.2) | |
| Unmethylated | 454 (50.2) | 105 (45.7) | 51 (51.5) | 239 (49.6) | 59 (62.8) | |
| Histology | <0.001 | |||||
| Astrocytoma | 470 (52.0) | 108 (47.0) | 52 (52.5) | 253 (52.5) | 57 (60.6) | |
| Oligodendroglioma | 238 (26.3) | 72 (31.3) | 27 (27.3) | 122 (25.3) | 17 (18.1) | |
| Oligoastrocytoma | 146 (16.1) | 39 (17.0) | 15 (15.2) | 88 (18.3) | 4 (4.3) | |
| Glioblastoma | 39 (4.3) | 9 (3.9) | 4 (4.0) | 15 (3.1) | 11 (11.7) | |
| Other | 12 (1.3) | 2 (0.9) | 1 (1.0) | 4 (0.8) | 5 (5.3) |
Data are presented as mean ± standard deviation for the normally distributed continuous variables, median (interquartile range) for the non-normally distributed continuous variables, and number (percentage) for the categorical variables. P values were calculated using the Kruskal-Wallis test for the continuous variables, and the Chi-squared test or Fisher’s exact test for the categorical variables as appropriate. IDH, isocitrate dehydrogenase; MGMT, O6-methylguanine-DNA methyltransferase; WHO, World Health Organization.
Table 2
| Characteristic | Overall (N=905) | Grade II (N=136) | Grade III (N=126) | Grade IV (N=643) | P value† |
|---|---|---|---|---|---|
| Age (years) | |||||
| Mean ± standard deviation | 55.1±15.4 | 40.7±12.9 | 47.0±13.7 | 59.8±13.6 | <0.001 |
| Median (Q1–Q3) | 57.0 (45.0–66.0) | 38.5 (30.0–49.5) | 47.5 (36.0–58.0) | 61.0 (52.0–69.0) | |
| Gender | 0.13 | ||||
| Male | 550 (60.8) | 72 (52.9) | 78 (61.9) | 400 (62.2) | |
| Female | 355 (39.2) | 64 (47.1) | 48 (38.1) | 243 (37.8) | |
| IDH mutation status | <0.001 | ||||
| Wild-type | 553 (61.1) | 16 (11.8) | 41 (32.5) | 496 (77.1) | |
| Mutated | 257 (28.4) | 119 (87.5) | 85 (67.5) | 53 (8.2) | |
| Unknown | 95 (10.5) | 1 (0.7) | 0 (0.0) | 94 (14.6) | |
| 1p/19q codeletion status | <0.001 | ||||
| Non-codeleted | 134 (14.8) | 52 (38.2) | 51 (40.5) | 31 (4.8) | |
| Codeleted | 427 (47.2) | 62 (45.6) | 60 (47.6) | 305 (47.4) | |
| Unknown | 344 (38.0) | 22 (16.2) | 15 (11.9) | 307 (47.7) | |
| MGMT promoter methylation | <0.001 | ||||
| Unmethylated | 196 (21.7) | 13 (9.6) | 14 (11.1) | 169 (26.3) | |
| Methylated | 457 (50.5) | 5 (3.7) | 16 (12.7) | 436 (67.8) | |
| Unknown | 252 (27.8) | 118 (86.8) | 96 (76.2) | 38 (5.9) | |
| Histological subtype | <0.001 | ||||
| Anaplastic astrocytoma | 33 (3.6) | 0 (0.0) | 33 (26.2) | 0 (0.0) | |
| Anaplastic oligoastrocytoma | 10 (1.1) | 0 (0.0) | 10 (7.9) | 0 (0.0) | |
| Anaplastic oligodendroglioma | 27 (3.0) | 0 (0.0) | 27 (21.4) | 0 (0.0) | |
| Astrocytoma | 16 (1.8) | 16 (11.8) | 0 (0.0) | 0 (0.0) | |
| Oligoastrocytoma | 14 (1.5) | 14 (10.3) | 0 (0.0) | 0 (0.0) | |
| Oligodendroglioma | 39 (4.3) | 39 (28.7) | 0 (0.0) | 0 (0.0) | |
| Unknown | 766 (84.6) | 67 (49.3) | 56 (44.4) | 643 (100.0) | |
| Dataset allocation | <0.001 | ||||
| Training set | 230 (25.4) | 50 (36.8) | 52 (41.3) | 128 (19.9) | |
| Internal testing set | 99 (10.9) | 19 (14.0) | 18 (14.3) | 62 (9.6) | |
| External testing set 1 | 482 (53.3) | 45 (33.1) | 41 (32.5) | 396 (61.6) | |
| External testing set 2 | 94 (10.4) | 22 (16.2) | 15 (11.9) | 57 (8.9) |
Data are presented as n (%), unless otherwise specified. †, one-way analysis of means; Pearson’s Chi-squared test. IDH, isocitrate dehydrogenase; MGMT, O6-methylguanine-DNA methyltransferase; WHO, World Health Organization.
Feature selection and model development
From an initial pool of 1,197 extracted radiomic features, univariate t-tests identified 903 features significantly associated with glioma grade (P<0.05). These candidate features were then subjected to RFE with 10-fold cross-validation repeated 10 times to ensure robust feature selection. This process systematically removed features with the lowest importance rankings until an optimal subset was obtained. Ultimately, 14 optimal radiomics features were retained for model development, demonstrating the highest discriminative power for glioma grading with minimal redundancy (inter-feature correlation <0.85). The selected features included LoG filtered features (n=3), wavelet-transformed texture features (n=10), and shape-based features (n=1).
The performance of 10 ML algorithms (AB, ET, SGBT, LR, LightGBM, NNET, NB, RF, SVM, and XGBoost) in glioma grading prediction was comprehensively evaluated. Hyperparameter optimization was performed using 10-fold cross-validation repeated 10 times combined with a grid search. The internal validation results, as illustrated in Figure 2, revealed that the tree-based ensemble methods, particularly XGBoost, RF, and ET, demonstrated superior and consistent performance across all evaluation metrics, while traditional linear and distance-based algorithms showed an inferior discrimination capacity.
Model performance, external validation and model equity
The comprehensive performance assessment in the cross-validation analysis demonstrated that XGBoost achieved superior discrimination with an AUC of 0.943 (95% CI: 0.926–0.960), a sensitivity of 0.876 (95% CI: 0.834–0.918), and a specificity of 0.892 (95% CI: 0.851–0.933).
The ROC curve analysis across all validation cohorts confirmed the robust performance of the XGBoost model (Figure 3A-3D). In the training set (Figure 3A), XGBoost achieved excellent discrimination (AUC =0.983, 95% CI: 0.968–0.996), followed closely by AB (AUC =0.970) and RF (AUC =0.963). In the internal testing set (Figure 3B), XGBoost maintained superior performance (AUC =0.897, 95% CI: 0.836–0.956), demonstrating good generalizability with a performance drop of less than 6%, suggesting minimal overfitting.
Crucially, in the external validation, XGBoost maintained clinically acceptable performance despite dataset heterogeneity. It achieved an AUC of 0.834 (95% CI: 0.766–0.883) in external testing set 1 (UCSF, Figure 3C) and an AUC of 0.880 (95% CI: 0.771–0.974) in external testing set 2 (NTUA, Figure 3D). Despite the challenging validation scenario in external testing set 2, which had a higher proportion of grade IV tumors (68.1% vs. 36.5% in training), the model maintained clinically acceptable performance, thereby validating its robustness across different patient populations and institutional practices. To evaluate potential algorithmic bias and ensure equitable performance across diverse patient populations, we conducted comprehensive subgroup analyses stratified by demographic and molecular characteristics (Figure S1). The XGBoost model demonstrated remarkable consistency across demographic subgroups. No significant performance differences were observed between male (n=550; AUC 0.902, 95% CI: 0.862–0.934) and female patients (n=355; AUC 0.891, 95% CI: 0.843–0.930; P=0.70, DeLong’s test), nor between younger (<60 years; n=476; AUC 0.877, 95% CI: 0.841–0.912) and older patients (≥60 years; n=429; AUC 0.888, 95% CI: 0.819–0.943; P=0.63). Regarding molecular markers, the model maintained robust performance across MGMT methylation status (methylated: n=457, AUC 0.821, 95% CI: 0.633–0.963 vs. unmethylated: n=196, AUC 0.843, 95% CI: 0.673–0.972; P=0.81). However, performance variations were observed in molecular subgroups that define distinct biological entities. The model achieved superior discrimination in IDH-mutant tumors (n=257; AUC 0.852, 95% CI: 0.805–0.896) compared to IDH wild-type cases (n=553; AUC 0.745, 95% CI: 0.598–0.881; P=0.02). Similarly, enhanced performance was noted in 1p/19q non-codeleted cases (n=512; AUC 0.888, 95% CI: 0.856–0.919) versus codeleted tumors (n=49; AUC 0.728, 95% CI: 0.581–0.868; P=0.045). These findings reveal two critical insights: first, the minimal performance variation across demographic characteristics (AUC difference <0.02) confirms the absence of algorithmic bias, supporting equitable clinical implementation. Second, the observed performance differences in molecular subgroups likely reflect the inherent biological heterogeneity of gliomas rather than model limitations, as IDH-mutant and 1p/19q-codeleted tumors represent biologically distinct entities with different imaging phenotypes and clinical behaviors.
Comprehensive subgroup analyses demonstrated the radiomics model’s consistent performance across diverse patient populations (Figure S1). The model achieved comparable discrimination between HGGs and LGGs regardless of age (≥60 years: AUC 0.888, 95% CI: 0.816–0.943 vs. <60 years: AUC 0.877, 95% CI: 0.841–0.912; P=0.72) or gender (female: AUC 0.891, 95% CI: 0.843–0.930 vs. male: AUC 0.902, 95% CI: 0.862–0.934; P=0.58), indicating demographic-independent reliability. Molecular subtype stratification further validated model robustness. Strong performance persisted across MGMT methylation status (methylated: AUC 0.821, 95% CI: 0.633–0.963; unmethylated: AUC 0.843, 95% CI: 0.673–0.972), IDH mutation status (mutant: AUC 0.852, 95% CI: 0.805–0.896; wildtype: AUC 0.745, 95% CI: 0.598–0.881), and 1p/19q codeletion status (codeleted: AUC 0.728, 95% CI: 0.581–0.868; non-codeleted: AUC 0.888, 95% CI: 0.856–0.919). This consistent discriminative capacity across heterogeneous clinical and molecular contexts underscores the model’s suitability for routine clinical deployment.
Model calibration and clinical utility
The calibration curves across all datasets showed excellent agreement between the predicted probabilities and observed frequencies (Figure 4A-4D). The XGBoost model demonstrated excellent calibration in the training set (Figure 4A), internal testing set (Figure 4B), external testing set 1 (Figure 4C), and external testing set 2 (Figure 4D). The Hosmer-Lemeshow test consistently yielded P values >0.05 for all cohorts, indicating adequate goodness-of-fit.
The DCA further confirmed the clinical utility of the XGBoost model (Figure 4E-4H). It provided a superior net benefit compared to the “treat all” or “treat none” strategies across threshold probabilities of 0.2–0.8 in all the datasets. The model achieved a maximum net benefit of 0.42 at a threshold probability of 0.45 (Figure 4E). The parallel coordinate plots of the evaluation metrics (Figure 4I-4L) visually summarized XGBoost’s consistent superiority in terms of accuracy, sensitivity, specificity, PPV, NPV, and F1 score.
Heatmap analysis of model variables
A heatmap was generated to visually represent the XGBoost model’s performance in predicting glioma grade (Figure 5). This visualization displayed the distributions of the actual values of various predictive factors (e.g., age, WHO grade, and IDH mutation status) across different patients, alongside the XGBoost predicted value and the actual outcome. The heatmap revealed the model’s ability to differentiate between LGG and HGG patients, demonstrating consistent performance across training and testing sets. This showed the model’s good prediction accuracy and generalizability, suggesting its utility for early clinical identification of high-risk glioma patients.
Model interpretability analysis
To elucidate the “black-box” nature of the ML model and enhance clinical trust, we employed the SHAP method to provide both global and local interpretability for the optimal XGBoost model’s predictions (Figure 6). Global feature importance was quantified using the mean absolute SHAP values, revealing the key drivers of the model’s overall predictions. The SHAP summary bar plot (Figure 6A,6B) identified log_sigma_5_0_mm_3D_firstorder_RootMeanSquared as the most influential predictor (mean |SHAP| =1.066), followed by log_sigma_5_0_mm_3D_glszm_SmallAreaLowGrayLevelEmphasis (mean |SHAP| =0.235) and wavelet_LLH_glszm_LargeAreaHighGrayLevelEmphasis (mean |SHAP| =0.133). These findings underscore the model’s reliance on quantitative measures of spatial heterogeneity and textural complexity, which are known to be correlated with tumor biology.
At the individual patient level, the SHAP analysis provided transparent, case-specific explanations crucial for clinical decision-making. The waterfall plot (Figure 6C) detailed how each feature contributed to a single patient’s prediction. For instance, in a correctly classified LGG case, the log_sigma_5_0_mm_3D_firstorder_RootMeanSquared feature contributed a SHAP value of −1.14, thereby lowering the model’s output probability, while the wavelet_LHH_firstorder_Skewness feature contributed positively (+0.15), demonstrating the model’s ability to integrate competing evidence. This process was further visualized by a force plot (Figure 6D), which showed the cumulative feature contributions pushing the prediction from a baseline to a final output. Finally, dependence plots (Figure 6E) revealed the nuanced, often non-linear relationships between individual feature values and their impact on the model’s output, identifying specific thresholds where a feature’s influence shifts toward predicting an LGG tumor.
Discussion
This multicenter study successfully developed and validated an interpretable radiomics-based ML model for the automated WHO grading of gliomas. Our XGBoost model showed robust performance across multiple independent cohorts, achieving excellent discrimination with AUCs ranging from 0.834 to 0.983, good calibration, and significant clinical utility. The integration of the SHAP analysis provides unprecedented interpretability for clinical decision-making.
Our model’s performance (AUC =0.834–0.983) compares favorably to that of recent studies while addressing their critical limitations. Cho et al. developed a model with an AUC of 0.89 using 285 patients, but external validation was lacking (30). Sun et al.’s model achieved similar internal performance (AUC =0.91), but multicenter validation was lacking (31). Consistent with our findings, recent studies have demonstrated the value of radiomics-based ML in predicting molecular markers and survival outcomes in HGG patients, achieving comparable performance metrics while emphasizing the importance of model interpretability for clinical translation (32). Sudre et al. developed the dynamic susceptibility contrast (DSC)-MRI radiomics model, which had an AUC of 0.85, but was limited to single-sequence analysis (33). Importantly, our study addressed the critical gap of external validation across geographically diverse populations, maintaining model performance across three independent cohorts representing different healthcare systems and imaging protocols. Unlike previous studies that focused primarily on discrimination metrics, we comprehensively evaluated calibration, clinical utility, and interpretability. Our DCA showed a clinically meaningful net benefit, while our SHAP analysis provides the first detailed explanation of feature contributions in glioma grading models, addressing the “black-box” criticism that has limited the clinical application of such models (34).
The predominance of LoG and wavelet-transformed features in our model reflects underlying tumor biology. HGGs exhibit increased cellular density, enhanced angiogenesis, and greater spatial heterogeneity, which manifest as distinct textural patterns captured by these mathematical transformations (35,36). The log_sigma_5_0_mm_3D_firstorder_RootMeanSquared feature, which was identified as the most important predictor, quantifies intensity variation in tumor regions, correlating with the increased metabolic activity and cellular proliferation characteristic of grade IV glioblastomas (37).
Wavelet decomposition captures multi-scale texture information, with high-high-low (HHL) and low-high-high (LHH) components representing high-frequency details in horizontal and vertical directions, respectively. The prominence of correlation and small dependence emphasis features suggests that HGGs demonstrate more organized local texture patterns despite their overall heterogeneity, potentially reflecting neoangiogenesis and cellular organization changes during malignant transformation (38).
A SHAP analysis was conducted to address a critical barrier to clinical adoption of ML models—the “black-box” problem. By providing both global and local explanations, our model enables clinicians to understand which imaging features contribute to specific predictions, building trust and facilitating clinical decision-making (26). Recent advances in interpretable ML for glioma analysis have further validated the clinical importance of such transparent approaches (32). This interpretability is particularly important in medical applications where understanding the reasoning behind predictions is crucial for patient care and medicolegal considerations.
The waterfall plots provide case-specific explanations that enhance model transparency and support clinical decision-making by visualizing individual feature contributions to each prediction (39). This approach has been shown to improve clinician confidence in artificial intelligence-assisted diagnosis across various medical imaging applications.
Our model addresses a genuine clinical need, as preoperative glioma grading accuracy directly impacts treatment planning. The 82.3–94.3% accuracy of our model across the validation cohorts represents a substantial improvement over conventional imaging assessment (accuracy: 55–77%) (10,11). The maintained performance of our model across different institutions and patient populations suggests potential for broad clinical deployment.
Our results demonstrated that radiomics features can effectively capture the imaging phenotype associated with different WHO grades, which correlates with underlying molecular characteristics. The relationship between radiomics features and molecular markers such as IDH mutations, 1p/19q codeletion, and MGMT methylation status represents an important area for future investigation. The integration of radiomics with molecular data may further improve prediction accuracy and provide insights into tumor biology. Despite robust multicenter validation, several critical barriers remain before clinical deployment. First, MRI protocol standardization is paramount, as cross-institutional heterogeneity remains the principal constraint on generalizability. Although analyses across TCGA, UCSF, and NTUA indicate protocol-agnostic performance, systematic harmonization is still required through consensus guidelines for radiomics-optimized acquisition parameters and domain-adaptation strategies. Second, automated segmentation is essential to mitigate workflow bottlenecks posed by manual delineation; near-expert performance can be achieved by deploying validated deep learning architectures trained on diverse datasets, complemented by semi-automated refinement tools and quality-control algorithms that flag cases requiring human review. Finally, intuitive user interfaces are needed to present predictions, confidence levels, and case-specific explanations clearly, alongside continuous feedback that iteratively improve usability and strengthen clinical integration.
Future research directions include: (I) the integration of multiparametric MRI sequences [diffusion-weighted imaging (DWI), perfusion, and fluid-attenuated inversion recovery (FLAIR)] to capture additional biological information; (II) prospective validation in real-time clinical settings; (III) model expansion to predict treatment response and survival outcomes; and (IV) the investigation of radiomics-molecular marker interactions for personalized medicine applications.
We acknowledge several limitations in this study. First, while our multicenter design mitigates single-institution biases, the retrospective nature of this study might have introduced selection bias. The variation in MRI protocols across centers, while reflecting real-world conditions, might have introduced technical heterogeneity that could affect model generalization. Larger prospective validation studies across diverse populations are needed to confirm model generalizability. Second, the manual tumor segmentation, while performed by experienced radiologists with excellent inter-rater reliability [intraclass correlation coefficient (ICC) >0.90], introduced potential subjectivity that automated segmentation could address. However, current automated segmentation methods for gliomas remain imperfect, particularly those for non-enhancing tumors. Third, our analysis focused on contrast-enhanced T1-weighted imaging; the integration of multiparametric MRI sequences (DWI, perfusion, and FLAIR) may further enhance model performance but would also increase model complexity and computational requirements. A single-sequence analysis was chosen because it balanced performance with clinical practicality. Fourth, molecular marker integration was limited by data availability across cohorts, but represents an important future direction. The 2021 WHO classification system emphasizes molecular characteristics, and future models should incorporate these markers alongside imaging features for optimal accuracy. Finally, our retrospective design may not fully capture the challenges of prospective implementation, and the clinical impact of using model predictions on treatment decisions and patient outcomes should be evaluated in prospective interventional studies.
Conclusions
In this multicenter, retrospective study, an interpretable radiomics-based machine-learning model for pre-operative WHO grading of gliomas was developed and externally validated. The model exhibited robust discrimination and calibration across both internal and geographically diverse international cohorts, substantially outperforming conventional radiological assessment. SHAP analysis revealed that intratumoral texture heterogeneity features were the dominant predictors, offering transparent explanations that mitigate the ‘black-box’ concern surrounding artificial intelligence. Prospective, workflow-embedded studies are now warranted to confirm clinical utility, determine effects on treatment decisions and patient outcomes, and minimize unnecessary interventions.
Acknowledgments
We would like to express our sincere appreciation to our colleagues and the staff of the Department of Radiotherapy at The Affiliated Hospital of Nantong University for their invaluable support. Our gratitude also extends to the OnekeyAI platform for its technical assistance. We are profoundly grateful to the patients and investigators who contributed data to TCGA and the UCSF, whose participation was indispensable to this work.
Footnote
Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/rc
Data Sharing Statement: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/dss
Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/prf
Funding: This work was supported by grants from
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study was approved by the Institutional Review Board of Nantong University (No. 2024-L215). Given the retrospective nature and use of de-identified data, the requirement of informed consent was waived.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Ostrom QT, Patil N, Cioffi G, et al. CBTRUS Statistical Report: Primary Brain and Other Central Nervous System Tumors Diagnosed in the United States in 2013-2017. Neuro Oncol 2020;22:iv1-iv96. [Crossref] [PubMed]
- Siegel RL, Miller KD, Fuchs HE, et al. Cancer statistics, 2022. CA Cancer J Clin 2022;72:7-33. [Crossref] [PubMed]
- Louis DN, Perry A, Wesseling P, et al. The 2021 WHO Classification of Tumors of the Central Nervous System: a summary. Neuro Oncol 2021;23:1231-51. [Crossref] [PubMed]
- Weller M, van den Bent M, Preusser M, et al. EANO guidelines on the diagnosis and treatment of diffuse gliomas of adulthood. Nat Rev Clin Oncol 2021;18:170-86. [Crossref] [PubMed]
- Sahm F, Brandner S, Bertero L, et al. Molecular diagnostic tools for the World Health Organization (WHO) 2021 classification of gliomas, glioneuronal and neuronal tumors; an EANO guideline. Neuro Oncol 2023;25:1731-49. [Crossref] [PubMed]
- Antonelli M, Poliani PL. Adult type diffuse gliomas in the new 2021 WHO Classification. Pathologica 2022;114:397-409. [Crossref] [PubMed]
- Stupp R, Mason WP, van den Bent MJ, et al. Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma. N Engl J Med 2005;352:987-96. [Crossref] [PubMed]
- Buckner JC, Shaw EG, Pugh SL, et al. Radiation plus Procarbazine, CCNU, and Vincristine in Low-Grade Glioma. N Engl J Med 2016;374:1344-55. [Crossref] [PubMed]
- Renovanz M, Hickmann AK, Nadji-Ohl M, et al. Health-related quality of life and distress in elderly vs. younger patients with high-grade glioma-results of a multicenter study. Support Care Cancer 2020;28:5165-75. [Crossref] [PubMed]
- Lasocki A, Gaillard F. Non-Contrast-Enhancing Tumor: A New Frontier in Glioblastoma Research. AJNR Am J Neuroradiol 2019;40:758-65. [Crossref] [PubMed]
- Villanueva-Meyer JE, Mabray MC, Cha S. Current Clinical Brain Tumor Imaging. Neurosurgery 2017;81:397-415. [Crossref] [PubMed]
- Zikou AK, Alexiou GA, Kosta P, et al. Diffusion tensor and dynamic susceptibility contrast MRI in glioblastoma. Clin Neurol Neurosurg 2012;114:607-12. [Crossref] [PubMed]
- Danchaivijitr N, Waldman AD, Tozer DJ, et al. Low-grade gliomas: do changes in rCBV measurements at longitudinal perfusion-weighted MR imaging predict malignant transformation? Radiology 2008;247:170-8. [Crossref] [PubMed]
- Patel P, Baradaran H, Delgado D, et al. MR perfusion-weighted imaging in the evaluation of high-grade gliomas after treatment: a systematic review and meta-analysis. Neuro Oncol 2017;19:118-27. [Crossref] [PubMed]
- Zhang B, Chang K, Ramkissoon S, et al. Multimodal MRI features predict isocitrate dehydrogenase genotype in high-grade gliomas. Neuro Oncol 2017;19:109-17. [Crossref] [PubMed]
- Park JE, Kim HS, Park SY, et al. Prediction of Core Signaling Pathway by Using Diffusion- and Perfusion-based MRI Radiomics and Next-generation Sequencing in Isocitrate Dehydrogenase Wild-type Glioblastoma. Radiology 2020;294:388-97. [Crossref] [PubMed]
- Rudin C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat Mach Intell 2019;1:206-15. [Crossref] [PubMed]
- Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25:44-56. [Crossref] [PubMed]
- Gillies RJ, Kinahan PE, Hricak H. Radiomics: Images Are More than Pictures, They Are Data. Radiology 2016;278:563-77. [Crossref] [PubMed]
- Lambin P, Rios-Velazquez E, Leijenaar R, et al. Radiomics: extracting more information from medical images using advanced feature analysis. Eur J Cancer 2012;48:441-6. [Crossref] [PubMed]
- Kumar V, Gu Y, Basu S, et al. Radiomics: the process and the challenges. Magn Reson Imaging 2012;30:1234-48. [Crossref] [PubMed]
- Aerts HJ, Velazquez ER, Leijenaar RT, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun 2014;5:4006. [Crossref] [PubMed]
- Fathi Kazerooni A, Kraya A, Rathi KS, et al. Multiparametric MRI along with machine learning predicts prognosis and treatment response in pediatric low-grade glioma. Nat Commun 2025;16:340. [Crossref] [PubMed]
- Liu Z, Wang S, Dong D, et al. The Applications of Radiomics in Precision Diagnosis and Treatment of Oncology: Opportunities and Challenges. Theranostics 2019;9:1303-22. [Crossref] [PubMed]
- Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). 2017:4765-74.
- Xia X, Wu W, Tan Q, et al. Interpretable Machine Learning Models for Differentiating Glioblastoma From Solitary Brain Metastasis Using Radiomics. Acad Radiol 2025;32:5388-400. [Crossref] [PubMed]
- Śledzińska P, Bebyn M, Szczerba E, et al. Glioma 2021 WHO Classification: The Superiority of NGS Over IHC in Routine Diagnostics. Mol Diagn Ther 2022;26:699-713. [Crossref] [PubMed]
- Zwanenburg A, Vallières M, Abdalah MA, et al. The Image Biomarker Standardization Initiative: Standardized Quantitative Radiomics for High-Throughput Image-based Phenotyping. Radiology 2020;295:328-38. [Crossref] [PubMed]
- Raptis S, Ilioudis C, Theodorou K. From pixels to prognosis: unveiling radiomics models with SHAP and LIME for enhanced interpretability. Biomed Phys Eng Express 2024;
- Cho HH, Lee SH, Kim J, et al. Classification of the glioma grading using radiomics analysis. PeerJ 2018;6:e5982. [Crossref] [PubMed]
- Sun P, Wang D, Mok VC, et al. Comparison of Feature Selection Methods and Machine Learning Classifiers for Radiomics Analysis in Glioma Grading. IEEE Access 2019;7:102010-20.
- Yu M, Liu J, Zhou W, et al. MRI radiomics based on machine learning in high-grade gliomas as a promising tool for prediction of CD44 expression and overall survival. Sci Rep 2025;15:7433. [Crossref] [PubMed]
- Sudre CH, Panovska-Griffiths J, Sanverdi E, et al. Machine learning assisted DSC-MRI radiomics as a tool for glioma classification by grade and mutation status. BMC Med Inform Decis Mak 2020;20:149. [Crossref] [PubMed]
- Yi F, Yang H, Chen D, et al. XGBoost-SHAP-based interpretable diagnostic framework for alzheimer's disease. BMC Med Inform Decis Mak 2023;23:137. [Crossref] [PubMed]
- Le NQK, Hung TNK, Do DT, et al. Radiomics-based machine learning model for efficiently classifying transcriptome subtypes in glioblastoma patients from MRI. Comput Biol Med 2021;132:104320. [Crossref] [PubMed]
- Kang D, Park JE, Kim YH, et al. Diffusion radiomics as a diagnostic model for atypical manifestation of primary central nervous system lymphoma: development and multicenter external validation. Neuro Oncol 2018;20:1251-61. [Crossref] [PubMed]
- Ellingson BM, Bendszus M, Boxerman J, et al. Consensus recommendations for a standardized Brain Tumor Imaging Protocol in clinical trials. Neuro Oncol 2015;17:1188-98. [Crossref] [PubMed]
- Zhou M, Scott J, Chaudhury B, et al. Radiomics in Brain Tumor: Image Assessment, Quantitative Feature Descriptors, and Machine-Learning Approaches. AJNR Am J Neuroradiol 2018;39:208-16. [Crossref] [PubMed]
- Borys K, Schmitt YA, Nauta M, et al. Explainable AI in medical imaging: An overview for clinical practitioners - Saliency-based XAI approaches. Eur J Radiol 2023;162:110787. [Crossref] [PubMed]
(English Language Editor: L. Huleatt)

