Radiomics-based machine learning for glioma grade classification: a multicenter study with SHapley Additive exPlanations interpretability analysis

Yumeng Shi; Wei Jiang; Li Zhou; Yu Shi; Xinran Ji; Wenjie Zheng; Hongyu Zhao; Ninghua Yao

doi:10.21037/tcr-2025-2024

Original Article

Radiomics-based machine learning for glioma grade classification: a multicenter study with SHapley Additive exPlanations interpretability analysis

Yumeng Shi^1,2#, Wei Jiang^3#, Li Zhou^1,2#, Yu Shi¹, Xinran Ji^1,2, Wenjie Zheng⁴, Hongyu Zhao¹, Ninghua Yao¹

¹Department of Radiotherapy, Affiliated Hospital of Nantong University, Nantong, China; ²School of Medicine, Nantong University, Nantong, China; ³Department of Neurology, the Second People’s Hospital of Wuxi, Wuxi, China; ⁴Research Center of Clinical Medicine, Affiliated Hospital of Nantong University, Nantong, China

Contributions: (I) Conception and design: N Yao, H Zhao; (II) Administrative support: W Zheng, H Zhao; (III) Provision of study materials or patients: Yumeng Shi, Yu Shi; (IV) Collection and assembly of data: Yumeng Shi, X Ji, L Zhou; (V) Data analysis and interpretation: Yumeng Shi, W Jiang, N Yao; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Ninghua Yao, MD; Hongyu Zhao, MD, PhD. Department of Radiotherapy, Affiliated Hospital of Nantong University, No. 20 Xisi Road, Chongchuan District, Nantong 226001, China. Email: yaonh2009@163.com; z_hy07@126.com.

Background: The accurate preoperative grading of gliomas is critical for guiding therapeutic strategies and prognostic assessment. This study aimed to develop and validate a robust, interpretable machine learning (ML) model based on multicenter magnetic resonance imaging (MRI) radiomics data for predicting the World Health Organization (WHO) grade in glioma patients.

Methods: We collected MRI data from 905 glioma patients diagnosed between 2005–2024 across three independent cohorts. Data from 329 of The Cancer Genome Atlas (TCGA) patients served as training (n=230) and internal testing (n=99) sets, while data from 482 University of California San Francisco (UCSF) patients and 94 Nantong University Affiliated Hospital (NTUA) patients served as external validation sets. Radiomics features were extracted from preoperative T1-weighted contrast-enhanced MRI. Ten ML methods, including extreme gradient boosting (XGBoost), were compared using recursive feature elimination (RFE) with cross-validation. The SHapley Additive exPlanation (SHAP) method was used for the model interpretability analysis.

Results: Among the 10 ML models, XGBoost performed best with area under the curve (AUC) values of 0.983 [95% confidence interval (CI): 0.968–0.996] in the training set, 0.897 (95% CI: 0.836–0.956) in the internal testing set, and 0.834 (95% CI: 0.766–0.883) in the UCSF cohort and 0.880 (95% CI: 0.771–0.974) in the NTUA cohort. The model achieved an accuracy of 82.3–94.3%, significantly outperforming conventional imaging assessment. The calibration analysis showed excellent agreement (Hosmer-Lemeshow P>0.05) with a maximum net benefit of 0.42. The SHAP analysis identified 12 optimal features, primarily texture heterogeneity measures from Laplacian of Gaussian (LoG) and wavelet transforms, as key predictors of glioma grade.

Conclusions: We successfully developed and validated a robust, interpretable radiomics-based model that accurately predicts glioma WHO grade preoperatively. Its promising performance across diverse datasets suggests potential for clinical translation, though prospective validation in real-world clinical workflows is required to confirm clinical utility and assess impact on treatment decisions and patient outcomes.

Keywords: Glioma; World Health Organization grading (WHO grading); radiomics; machine learning (ML); interpretable model

Submitted Sep 12, 2025. Accepted for publication Oct 20, 2025. Published online Oct 29, 2025.

doi: 10.21037/tcr-2025-2024

Highlight box

Key findings

• Using data from 905 patients at multiple centres, we built and externally validated a transparent radiomics-based extreme gradient boosting model for pre-operative World Health Organization grading of gliomas on routine contrast-enhanced T1-weighted magnetic resonance imaging (MRI).

• The model showed consistently high discrimination: area under the curve 0.983 in the training set, 0.897 in the internal testing set, and 0.834 and 0.880 in two independent external cohorts, with overall accuracies of 82.3–94.3%.

• SHapley Additive exPlanation identified 14 key radiomic features—mainly texture-heterogeneity metrics extracted with Laplacian-of-Gaussian and wavelet filters—as the principal drivers of model output.

What is known and what is new?

• Conventional MRI grading of gliomas is only moderately accurate (55–77%) and shows considerable inter-observer variability (15–25%).

• Earlier radiomics studies rarely included external validation, interpretability analyses, or testing across geographically diverse populations.

• This work delivers the first interpretable, externally validated machine-learning model that maintains performance across international cohorts and molecular subtypes.

What is the implication, and what should change now?

• The model substantially outperforms conventional imaging assessment and provides transparent, explainable predictions essential for clinical trust and regulatory approval.

• Prospective validation in real-world clinical workflows is needed to assess impact on treatment decisions, patient outcomes, and cost-effectiveness before routine clinical implementation.

Introduction

Gliomas account for approximately 80% of all malignant primary brain tumors, with an annual incidence of 6.0 per 100,000 individuals worldwide (1,2). The World Health Organization (WHO) classification system stratifies gliomas into grades I, II, III, and IV, with 5-year survival rates decreasing dramatically from 68% for grade II to 9.9% for grade IV tumors (3,4). The recent 2021 WHO classification system further integrated molecular markers, including isocitrate dehydrogenase (IDH) mutations, 1p/19q codeletion, and O6-methylguanine-DNA methyltransferase (MGMT) methylation status, which are crucial for accurate diagnosis and treatment planning (5,6).

Accurate preoperative grading is fundamental for treatment planning, as it directly influences the surgical approach, adjuvant therapy selection, and patient counseling. Grade IV glioblastomas require aggressive multimodal therapy, including maximal safe resection, concurrent chemoradiotherapy, and maintenance temozolomide, while grade II tumors may be managed with observation or limited intervention (7,8). The clinical impact of accurate preoperative grading is profound; it can reduce unnecessary aggressive interventions by 25% in low-grade cases and prevent treatment delays in 40% of high-grade cases (9).

However, conventional magnetic resonance imaging (MRI) assessment faces significant limitations in distinguishing among WHO grades, particularly between grade II and III gliomas, where diagnostic accuracy ranges from 55% to 77% (10,11). Traditional imaging features such as contrast enhancement, necrosis, and peritumoral edema demonstrate substantial overlap between grades, leading to diagnostic uncertainty in up to 30% of cases (12,13). Moreover, inter-observer variability in radiological interpretation ranges from 15% to 25%, affecting treatment decision consistency (14).

Model interpretability is crucial in clinical artificial intelligence. First, clinicians need to understand the basis of predictions to use them confidently in treatment decisions. Second, regulatory frameworks prioritize explainability: agencies such as the Food and Drug Administration (FDA) require interpretable models for medical device approval, especially in high-risk diagnostic or therapeutic settings. Third, transparency helps clinicians detect potential algorithmic biases and recognize cases in which predictions may be unreliable.

Several prediction models for glioma grading have been developed; however, these models have significant limitations. For example, most rely on single-center cohorts with limited sample sizes (n<500), lack external validation across different populations, or employ traditional statistical methods that cannot capture complex non-linear relationships (15,16). Critically, existing models lack interpretability, presenting a “black-box” problem that hinders their clinical adoption and regulatory approval, as evidenced by an implementation rate of <5% in routine clinical practice (17,18).

Radiomics, an emerging field that involves the extraction of quantitative features from medical images, has shown promise in improving diagnostic accuracy and treatment planning in oncology (19,20). By analyzing texture, shape, and first-order statistical features invisible to the human eye, radiomics can provide objective and reproducible biomarkers for tumor characterization (21,22). Recent advances in multiparametric MRI radiomics, which extend beyond traditional morphological assessment to capture underlying biological heterogeneity, have shown significant potential in predicting treatment response and prognosis across different glioma subtypes (23).

Machine learning (ML) algorithms can effectively integrate multiple radiomics features to build robust predictive models (24). The SHapley Additive exPlanation (SHAP) method addresses the interpretability limitation by providing both global and local explanations for model predictions (25). Recent radiomics studies support the clinical adoption of interpretable ML models, which have been shown to demonstrate superior performance in tumor differentiation while maintaining the model transparency crucial for clinical decision-making (26).

Therefore, robust, interpretable prediction models that can provide accurate, individualized glioma grading are urgently needed to be developed to optimize treatment strategies and improve patient outcomes. This study aimed to: (I) extract comprehensive radiomics features from preoperative T1-weighted contrast-enhanced MRI; (II) develop and compare multiple ML models; (III) validate model performance across multiple independent cohorts spanning diverse geographic regions and ethnic groups; and (IV) provide model interpretability using a SHAP analysis. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/rc).

Methods

Study design and ethical considerations

This multicenter retrospective study developed and validated an interpretable ML model for automated WHO grading of gliomas using preoperative contrast-enhanced T1-weighted MRI radiomics features. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and approved by the Institutional Review Board of Nantong University (No. 2024-L215). Given the retrospective nature and use of de-identified data, the requirement for informed consent was waived.

Study population

Between 2005 and 2024, we enrolled 905 patients with histopathologically confirmed gliomas from three independent institutions: The Cancer Genome Atlas (TCGA) cohort (n=329), University of California San Francisco (UCSF) cohort (n=482), and Nantong University Affiliated Hospital (NTUA) cohort (n=94). Patients were included in the study if they met the following inclusion criteria: (I) had a histopathologically confirmed diagnosis of glioma according to the WHO classification; (II) had preoperative contrast-enhanced T1-weighted MRI scans available; (III) had complete WHO grading information (Grade II, III, or IV) available; (IV) and had molecular marker information, including IDH mutation status, 1p/19q codeletion, and MGMT methylation status, available. Patients were excluded from the study if they met any of the following exclusion criteria: (I) had missing preoperative contrast-enhanced T1-weighted MRI images; (II) had no WHO grading information; and/or (III) had poor-quality images precluding reliable radiomics analysis.

Primary outcome and sample size determination

The primary outcome was glioma grade, dichotomized into low-grade gliomas (LGGs; WHO grade II) and high-grade gliomas (HGGs; WHO grades III–IV), as determined by postoperative histopathological diagnosis. To ensure methodological rigor, outcome labels remained blinded during feature processing and model training, guaranteeing independence between predictor handling and outcome assessment.

Sample size calculation followed the “10 events per variable” rule for ML models. Assuming 14 final radiomics features and an estimated 60% HGG prevalence, a minimum of 200 cases was required for robust model development (12 features ×10/0.6=200). To ensure adequate statistical power for external validation and subgroup analyses, we targeted enrollment of at least 400 cases across external testing cohorts, ultimately exceeding this threshold with 576 external validation cases.

Patient and public involvement

Patients and public representatives were not directly involved in the study design, conduct, or reporting phases due to: (I) the retrospective nature utilizing existing de-identified datasets; (II) the technical complexity of radiomics methodology requiring specialized expertise; and (III) institutional constraints during the study period. However, we acknowledge this limitation and plan to incorporate patient advisory groups in future prospective validation studies to guide clinical implementation strategies, ensure patient-centered outcome measures, and facilitate community dissemination of findings.

Clinical variables and molecular markers

The following clinical data were extracted from electronic medical records: age at diagnosis, gender, WHO grade (II, III, or IV), tumor location, and surgical extent. The analyzed molecular markers included IDH1/2 mutation status (wild-type vs. mutant), 1p/19q codeletion status (intact vs. codeleted), and MGMT promoter methylation status (unmethylated vs. methylated). These molecular markers were determined using standard protocols including immunohistochemistry, fluorescence in situ hybridization, and pyrosequencing as appropriate (27).

Image acquisition and preprocessing

All the patients underwent preoperative MRI, including contrast-enhanced T1-weighted sequences. The images were converted to Neuroimaging Informatics Technology Initiative (NIfTI) format and underwent a standardized preprocessing pipeline, including: (I) N4 bias-field correction to eliminate intensity inhomogeneity; (II) resampling to isotropic 1×1×1 mm³ voxel spacing using trilinear interpolation; and (III) intensity normalization using Z-score standardization.

Data quality control: all the imaging data underwent systematic quality assessment, including: (I) motion artifact evaluation using automated detection algorithms; (II) contrast injection adequacy verification based on an enhancement ratio >150% in normal vascular structures; (III) image completeness confirmation to ensure full brain coverage; and (IV) standardized anonymization using protocols to maintain patient privacy while preserving essential clinical variables. Tumor regions of interest (ROIs) were manually delineated on contrast-enhanced T1-weighted images by two experienced neuroradiologists (each with >10 years of experience) who were blinded to the histopathological results. For quality control, a subset of 100 cases underwent independent segmentation by both radiologists, achieving excellent inter-rater reliability (intraclass correlation coefficient >0.90).

Radiomics feature extraction

Radiomics features were extracted using the PyRadiomics package (version 3.0.1) following Image Biomarker Standardization Initiative (IBSI) guidelines (28). A total of 1,197 quantitative features were extracted from each tumor ROI, including shape features (N=14): three-dimensional (3D) geometric properties, including the volume, surface area, sphericity, and compactness; first-order features (N=17): statistical descriptors of voxel intensity distribution, including the mean, median, standard deviation, skewness, and kurtosis; second-order texture features (N=72): gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), gray-level size zone matrix (GLSZM), gray-level dependence matrix (GLDM), and neighboring gray-tone difference matrix (NGTDM); wavelet features (N=728): features extracted from eight wavelet decompositions; Laplacian of Gaussian (LoG) features (N=364): features extracted using the LoG filters with sigma values ranging from 1.0 to 5.0 mm; and square and square root filtered features (N=2): mathematical transformations to capture different intensity ranges.

Feature selection

Feature selection was performed using RFE with 10-fold cross-validation repeated 10 times to ensure robust selection. The RFE algorithm systematically removes features with lowest importance rankings until optimal performance is achieved, as measured by the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. Highly correlated features (Pearson correlation coefficient >0.9) were identified and the feature with higher clinical relevance retained.

ML model development

The performance of 10 ML algorithms in glioma grading prediction was compared: AdaBoost (AB), extra trees (ET), stochastic gradient boosting (SGBT), logistic regression (LR), light gradient boosting machine (LightGBM), neural network (NNET), Naive Bayes (NB), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGBoost). Hyperparameter optimization was performed using 10-fold cross-validation, repeated 10 times, combined with a grid search across predefined parameter spaces. The optimal hyperparameters were selected based on the maximum AUC performance during cross-validation.

Model performance comparison

Model performance was comprehensively evaluated based on the following performance metrics: (I) discrimination: The AUC of the ROC curve with 95% confidence intervals (CIs) was calculated using 1,000 bootstrap resamples, while the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1-score were calculated at optimal probability thresholds determined by Youden’s index; (II) calibration: calibration plots comparing predicted probabilities to observed frequencies across deciles of predicted risk were assessed, while the Hosmer-Lemeshow test was used to evaluate goodness-of-fit, with P>0.05 indicating adequate calibration; (III) clinical utility: a decision curve analysis (DCA) was conducted to quantify the net benefit across different probability thresholds compared to “treat all” and “treat none” strategies; and (IV) statistical comparison: DeLong’s test was used to compare the AUC values between models, while McNemar’s test was used to evaluate differences in classification accuracy for paired models.

Fairness assessment

To evaluate algorithmic fairness, we assessed model performance across key demographic and clinical subgroups including gender (male vs. female), age groups (≤60 vs. >60 years), and molecular marker status (IDH wild-type vs. mutant). Performance metrics including AUC, sensitivity, and specificity were calculated separately for each subgroup. Statistical comparisons between subgroups were performed using DeLong’s test for AUC and McNemar’s test for sensitivity/specificity. P values <0.05 indicated significant performance differences, suggesting potential algorithmic bias.

Model interpretability analysis

The optimal model underwent comprehensive interpretability analysis using SHAP, a game theory-based approach providing both global and local explanations (29). SHAP values quantify each feature’s contribution to individual predictions, enabling global importance ranking, feature impact visualization, and individual case analysis.

Statistical analysis

The continuous variables are expressed as the mean ± standard deviation, or the median (interquartile range) depending on data distribution normality, and were assessed using the Shapiro-Wilk test. The categorical variables are presented as the frequency and percentage. Inter-group comparisons for continuous variables were performed using the independent t-test for normally distributed data or the Mann-Whitney U test for non-normally distributed data. The Chi-squared test or Fisher’s exact test was used for the categorical variables as appropriate.

Feature selection was conducted using a multi-step process. Initially, a Pearson correlation analysis was conducted to identify and remove highly correlated features (correlation coefficient |r| >0.9) to mitigate multicollinearity. Subsequently, a univariate analysis using an independent t-test was applied to select features with significant differences between glioma grades (P<0.05). Finally, RFE with cross-validation was implemented using the XGBoost algorithm to rank and select the optimal subset of features based on their contribution to model performance, minimizing overfitting while maximizing discriminative power. Predictions for patients in the validation cohorts were obtained by applying the final trained models to independent validation data. Outcome labels were withheld, and no information from the validation sets was used during model training or feature selection.

Model performance was evaluated using multiple metrics, including the AUC of the ROC curve with the 95% CI calculated using the DeLong’s test, accuracy, sensitivity, specificity, PPV, NPV, and F1 score. Model calibration was assessed using calibration curves and the Hosmer-Lemeshow goodness-of-fit test, with P>0.05 indicating adequate calibration. Clinical utility was quantified by a DCA, comparing net benefit across threshold probabilities. For interpretability, SHAP values were computed to quantify individual feature contributions to predictions, with the mean absolute SHAP values used for global feature importance ranking. Local explanations were visualized through waterfall and dependence plots. Statistical significance was set at P<0.05 (two-tailed). All analyses were performed using Python 3.12 with libraries including scikit-learn (version 1.4.0), XGBoost (version 2.0.3), and SHAP (version 0.44.0).

Results

Baseline clinical characteristics

A total of 905 glioma patients from three independent cohorts were included in this study (Figure 1). The TCGA cohort was divided into a training set (n=230, 70%) and an internal testing set (n=99, 30%), while the UCSF (n=482) and NTUA (n=94) cohorts served as two independent external testing sets for assessing model generalizability across diverse populations and institutional practices.

Figure 1 Flow chart of the study design. AB, AdaBoost; AUC, area under the curve; DCA, decision curve analysis; ET, extra trees; GBDT, gradient boosting decision tree; LASSO, least absolute shrinkage and selection operator; LightGBM, light gradient boosting machine; LR, logistic regression; MRI, magnetic resonance imaging; MLP, multilayer perceptron; NB, Naive Bayes; NPV, negative predictive value; NTUA, Nantong University Affiliated Hospital; PPV, positive predictive value; ROC, receiver operating characteristic; RF, random forest; SHAP, SHapley Additive exPlanation; SVM, support vector machine; TCGA, The Cancer Genome Atlas; UCSF, University of California San Francisco; WHO, World Health Organization; XGB, extreme gradient boosting.

The demographic and clinicopathological characteristics of all the patients are detailed in Tables 1,2. Significant differences were observed between the training and external validation cohorts in several baseline characteristics, such as gender (P=0.07) and WHO grade (P=0.001). Specifically, external testing set 2 (NTUA) demonstrated a higher male predominance (73.4% in external testing set 2 vs. 58.3% in the training set, P=0.02) and a greater proportion of grade IV tumors (60.6% vs. 44.3%, P=0.041). IDH wild-type status was also more prevalent in the external cohorts (86.2% in the NTUA set vs. 48.3% in the training set, P<0.001), reflecting the real-world population heterogeneity and strengthening the rigor of the external validation. Table 2 highlights significant differences in age and molecular marker profiles across different WHO grades (all P<0.001).

Table 1

Baseline characteristics of patients

Characteristic	Total (n=905)	Training set (n=230)	Internal testing set (n=99)	External testing set 1 (n=482)	External testing set 2 (n=94)	P value
Age (years)	56.0 (45.0–66.0)	54.0 (39.0–64.0)	55.0 (41.5–64.0)	59.0 (47.0–68.0)	55.1±12.6	0.002
Gender						0.07
Male	550 (60.8)	134 (58.3)	58 (58.6)	289 (60.0)	69 (73.4)
Female	355 (39.2)	96 (41.7)	41 (41.4)	193 (40.0)	25 (26.6)
WHO grade						0.001
II	273 (30.2)	79 (34.3)	32 (32.3)	150 (31.1)	12 (12.8)
III	234 (25.9)	67 (29.1)	25 (25.3)	124 (25.7)	18 (19.1)
IV	398 (44.0)	84 (36.5)	42 (42.4)	208 (43.2)	64 (68.1)
IDH mutation status						<0.001
Mutant	384 (42.4)	119 (51.7)	42 (42.4)	210 (43.6)	13 (13.8)
Wild-type	521 (57.6)	111 (48.3)	57 (57.6)	272 (56.4)	81 (86.2)
1p/19q codeletion status						<0.001
Intact	667 (73.7)	158 (68.7)	72 (72.7)	360 (74.7)	77 (81.9)
Codeletion	238 (26.3)	72 (31.3)	27 (27.3)	122 (25.3)	17 (18.1)
MGMT methylation status						0.04
Methylated	451 (49.8)	125 (54.3)	48 (48.5)	243 (50.4)	35 (37.2)
Unmethylated	454 (50.2)	105 (45.7)	51 (51.5)	239 (49.6)	59 (62.8)
Histology						<0.001
Astrocytoma	470 (52.0)	108 (47.0)	52 (52.5)	253 (52.5)	57 (60.6)
Oligodendroglioma	238 (26.3)	72 (31.3)	27 (27.3)	122 (25.3)	17 (18.1)
Oligoastrocytoma	146 (16.1)	39 (17.0)	15 (15.2)	88 (18.3)	4 (4.3)
Glioblastoma	39 (4.3)	9 (3.9)	4 (4.0)	15 (3.1)	11 (11.7)
Other	12 (1.3)	2 (0.9)	1 (1.0)	4 (0.8)	5 (5.3)

Data are presented as mean ± standard deviation for the normally distributed continuous variables, median (interquartile range) for the non-normally distributed continuous variables, and number (percentage) for the categorical variables. P values were calculated using the Kruskal-Wallis test for the continuous variables, and the Chi-squared test or Fisher’s exact test for the categorical variables as appropriate. IDH, isocitrate dehydrogenase; MGMT, O6-methylguanine-DNA methyltransferase; WHO, World Health Organization.

Table 2

Baseline characteristics of glioma patients stratified by WHO grade

Characteristic	Overall (N=905)	Grade II (N=136)	Grade III (N=126)	Grade IV (N=643)	P value^†
Age (years)
Mean ± standard deviation	55.1±15.4	40.7±12.9	47.0±13.7	59.8±13.6	<0.001
Median (Q1–Q3)	57.0 (45.0–66.0)	38.5 (30.0–49.5)	47.5 (36.0–58.0)	61.0 (52.0–69.0)
Gender					0.13
Male	550 (60.8)	72 (52.9)	78 (61.9)	400 (62.2)
Female	355 (39.2)	64 (47.1)	48 (38.1)	243 (37.8)
IDH mutation status					<0.001
Wild-type	553 (61.1)	16 (11.8)	41 (32.5)	496 (77.1)
Mutated	257 (28.4)	119 (87.5)	85 (67.5)	53 (8.2)
Unknown	95 (10.5)	1 (0.7)	0 (0.0)	94 (14.6)
1p/19q codeletion status					<0.001
Non-codeleted	134 (14.8)	52 (38.2)	51 (40.5)	31 (4.8)
Codeleted	427 (47.2)	62 (45.6)	60 (47.6)	305 (47.4)
Unknown	344 (38.0)	22 (16.2)	15 (11.9)	307 (47.7)
MGMT promoter methylation					<0.001
Unmethylated	196 (21.7)	13 (9.6)	14 (11.1)	169 (26.3)
Methylated	457 (50.5)	5 (3.7)	16 (12.7)	436 (67.8)
Unknown	252 (27.8)	118 (86.8)	96 (76.2)	38 (5.9)
Histological subtype					<0.001
Anaplastic astrocytoma	33 (3.6)	0 (0.0)	33 (26.2)	0 (0.0)
Anaplastic oligoastrocytoma	10 (1.1)	0 (0.0)	10 (7.9)	0 (0.0)
Anaplastic oligodendroglioma	27 (3.0)	0 (0.0)	27 (21.4)	0 (0.0)
Astrocytoma	16 (1.8)	16 (11.8)	0 (0.0)	0 (0.0)
Oligoastrocytoma	14 (1.5)	14 (10.3)	0 (0.0)	0 (0.0)
Oligodendroglioma	39 (4.3)	39 (28.7)	0 (0.0)	0 (0.0)
Unknown	766 (84.6)	67 (49.3)	56 (44.4)	643 (100.0)
Dataset allocation					<0.001
Training set	230 (25.4)	50 (36.8)	52 (41.3)	128 (19.9)
Internal testing set	99 (10.9)	19 (14.0)	18 (14.3)	62 (9.6)
External testing set 1	482 (53.3)	45 (33.1)	41 (32.5)	396 (61.6)
External testing set 2	94 (10.4)	22 (16.2)	15 (11.9)	57 (8.9)

Data are presented as n (%), unless otherwise specified. ^†, one-way analysis of means; Pearson’s Chi-squared test. IDH, isocitrate dehydrogenase; MGMT, O6-methylguanine-DNA methyltransferase; WHO, World Health Organization.

Feature selection and model development

From an initial pool of 1,197 extracted radiomic features, univariate t-tests identified 903 features significantly associated with glioma grade (P<0.05). These candidate features were then subjected to RFE with 10-fold cross-validation repeated 10 times to ensure robust feature selection. This process systematically removed features with the lowest importance rankings until an optimal subset was obtained. Ultimately, 14 optimal radiomics features were retained for model development, demonstrating the highest discriminative power for glioma grading with minimal redundancy (inter-feature correlation <0.85). The selected features included LoG filtered features (n=3), wavelet-transformed texture features (n=10), and shape-based features (n=1).

The performance of 10 ML algorithms (AB, ET, SGBT, LR, LightGBM, NNET, NB, RF, SVM, and XGBoost) in glioma grading prediction was comprehensively evaluated. Hyperparameter optimization was performed using 10-fold cross-validation repeated 10 times combined with a grid search. The internal validation results, as illustrated in Figure 2, revealed that the tree-based ensemble methods, particularly XGBoost, RF, and ET, demonstrated superior and consistent performance across all evaluation metrics, while traditional linear and distance-based algorithms showed an inferior discrimination capacity.

Figure 2 The internal validation results of 10 ML models. It displays the AUC, specificity, and sensitivity of the 10 ML models in internal validation. These models include AB, ET, SGBT, LR, LightGBM, Neural Net, NB, RF, SVM, and XGBoost. Each metric is presented with its 95% CI. AB, AdaBoost; AUC, area under the curve; CI, confidence interval; ET, extra trees; LR, logistic regression; LightGBM, light gradient boosting machine; ML, machine learning; Neural Net, neural network; NB, Naive Bayes; RF, random forest; SGBT, stochastic gradient boosting; SVM, support vector machine; XGBoost, extreme gradient boosting.

Model performance, external validation and model equity

The comprehensive performance assessment in the cross-validation analysis demonstrated that XGBoost achieved superior discrimination with an AUC of 0.943 (95% CI: 0.926–0.960), a sensitivity of 0.876 (95% CI: 0.834–0.918), and a specificity of 0.892 (95% CI: 0.851–0.933).

The ROC curve analysis across all validation cohorts confirmed the robust performance of the XGBoost model (Figure 3A-3D). In the training set (Figure 3A), XGBoost achieved excellent discrimination (AUC =0.983, 95% CI: 0.968–0.996), followed closely by AB (AUC =0.970) and RF (AUC =0.963). In the internal testing set (Figure 3B), XGBoost maintained superior performance (AUC =0.897, 95% CI: 0.836–0.956), demonstrating good generalizability with a performance drop of less than 6%, suggesting minimal overfitting.

Figure 3 Performance of ML models predicting glioma grade in patients in the training, internal test, and external test sets. ROC curve analysis for the training set (A), internal testing set (B), external testing set 1 (C), and external testing set 2 (D) predicting glioma grade using 10 ML algorithms. AdaBoost, adaptive boosting; AUC, area under the curve; GBDT, gradient boosting decision tree; LightGBM, light gradient boosting machine; ML, machine learning; Neural Net, neural network; ROC, receiver operating characteristic; XGBoost, extreme gradient boosting.

Crucially, in the external validation, XGBoost maintained clinically acceptable performance despite dataset heterogeneity. It achieved an AUC of 0.834 (95% CI: 0.766–0.883) in external testing set 1 (UCSF, Figure 3C) and an AUC of 0.880 (95% CI: 0.771–0.974) in external testing set 2 (NTUA, Figure 3D). Despite the challenging validation scenario in external testing set 2, which had a higher proportion of grade IV tumors (68.1% vs. 36.5% in training), the model maintained clinically acceptable performance, thereby validating its robustness across different patient populations and institutional practices. To evaluate potential algorithmic bias and ensure equitable performance across diverse patient populations, we conducted comprehensive subgroup analyses stratified by demographic and molecular characteristics (Figure S1). The XGBoost model demonstrated remarkable consistency across demographic subgroups. No significant performance differences were observed between male (n=550; AUC 0.902, 95% CI: 0.862–0.934) and female patients (n=355; AUC 0.891, 95% CI: 0.843–0.930; P=0.70, DeLong’s test), nor between younger (<60 years; n=476; AUC 0.877, 95% CI: 0.841–0.912) and older patients (≥60 years; n=429; AUC 0.888, 95% CI: 0.819–0.943; P=0.63). Regarding molecular markers, the model maintained robust performance across MGMT methylation status (methylated: n=457, AUC 0.821, 95% CI: 0.633–0.963 vs. unmethylated: n=196, AUC 0.843, 95% CI: 0.673–0.972; P=0.81). However, performance variations were observed in molecular subgroups that define distinct biological entities. The model achieved superior discrimination in IDH-mutant tumors (n=257; AUC 0.852, 95% CI: 0.805–0.896) compared to IDH wild-type cases (n=553; AUC 0.745, 95% CI: 0.598–0.881; P=0.02). Similarly, enhanced performance was noted in 1p/19q non-codeleted cases (n=512; AUC 0.888, 95% CI: 0.856–0.919) versus codeleted tumors (n=49; AUC 0.728, 95% CI: 0.581–0.868; P=0.045). These findings reveal two critical insights: first, the minimal performance variation across demographic characteristics (AUC difference <0.02) confirms the absence of algorithmic bias, supporting equitable clinical implementation. Second, the observed performance differences in molecular subgroups likely reflect the inherent biological heterogeneity of gliomas rather than model limitations, as IDH-mutant and 1p/19q-codeleted tumors represent biologically distinct entities with different imaging phenotypes and clinical behaviors.

Comprehensive subgroup analyses demonstrated the radiomics model’s consistent performance across diverse patient populations (Figure S1). The model achieved comparable discrimination between HGGs and LGGs regardless of age (≥60 years: AUC 0.888, 95% CI: 0.816–0.943 vs. <60 years: AUC 0.877, 95% CI: 0.841–0.912; P=0.72) or gender (female: AUC 0.891, 95% CI: 0.843–0.930 vs. male: AUC 0.902, 95% CI: 0.862–0.934; P=0.58), indicating demographic-independent reliability. Molecular subtype stratification further validated model robustness. Strong performance persisted across MGMT methylation status (methylated: AUC 0.821, 95% CI: 0.633–0.963; unmethylated: AUC 0.843, 95% CI: 0.673–0.972), IDH mutation status (mutant: AUC 0.852, 95% CI: 0.805–0.896; wildtype: AUC 0.745, 95% CI: 0.598–0.881), and 1p/19q codeletion status (codeleted: AUC 0.728, 95% CI: 0.581–0.868; non-codeleted: AUC 0.888, 95% CI: 0.856–0.919). This consistent discriminative capacity across heterogeneous clinical and molecular contexts underscores the model’s suitability for routine clinical deployment.

Model calibration and clinical utility

The calibration curves across all datasets showed excellent agreement between the predicted probabilities and observed frequencies (Figure 4A-4D). The XGBoost model demonstrated excellent calibration in the training set (Figure 4A), internal testing set (Figure 4B), external testing set 1 (Figure 4C), and external testing set 2 (Figure 4D). The Hosmer-Lemeshow test consistently yielded P values >0.05 for all cohorts, indicating adequate goodness-of-fit.

Figure 4 Comprehensive performance evaluation of ML models for predicting glioma grade across different datasets. Calibration curve analysis for the training set (A), internal testing set (B), external testing set 1 (C), and external testing set 2 (D), with Brier scores indicated for each model. DCA for the training set (E), internal testing set (F), external testing set 1 (G), and external testing set 2 (H). Parallel coordinate plots of evaluation metrics for the training set (I), internal testing set (J), external testing set 1 (K), and external testing set 2 (L). ACC, accuracy; AdaBoost, adaptive boosting; DCA, decision curve analysis; F1, F1 score; GBDT, gradient boosting decision tree; LightGBM, light gradient boosting machine; ML, machine learning; Neural Net, neural network; NPV, negative predictive value; PPV, positive predictive value; SEN, sensitivity; SPE, specificity; XGBoost, extreme gradient boosting.

The DCA further confirmed the clinical utility of the XGBoost model (Figure 4E-4H). It provided a superior net benefit compared to the “treat all” or “treat none” strategies across threshold probabilities of 0.2–0.8 in all the datasets. The model achieved a maximum net benefit of 0.42 at a threshold probability of 0.45 (Figure 4E). The parallel coordinate plots of the evaluation metrics (Figure 4I-4L) visually summarized XGBoost’s consistent superiority in terms of accuracy, sensitivity, specificity, PPV, NPV, and F1 score.

Heatmap analysis of model variables

A heatmap was generated to visually represent the XGBoost model’s performance in predicting glioma grade (Figure 5). This visualization displayed the distributions of the actual values of various predictive factors (e.g., age, WHO grade, and IDH mutation status) across different patients, alongside the XGBoost predicted value and the actual outcome. The heatmap revealed the model’s ability to differentiate between LGG and HGG patients, demonstrating consistent performance across training and testing sets. This showed the model’s good prediction accuracy and generalizability, suggesting its utility for early clinical identification of high-risk glioma patients.

Figure 5 Variable heatmap of the XGBoost model for predicting glioma grade. Each row represents a variable, and each column represents a sample, with the colors indicating the variables’ values. The color legend represents the different value ranges of each variable, including the XGBoost predicted value, outcome (LGG or HGG), group (training, validation, test 1, and test 2 set), age, grade, gender, IDH mutation status (WT or Mut), 1p/19q codeletion (Non or Codel), and MGMT methylation (Unmeth or Meth). The color gradients indicate the value ranges of the continuous variables; the categorical variables are represented by different colors for categories. Missing values are marked as unknown (in a grayish-white color). Codel, codeleted; F, female; HGG, high-grade glioma; IDH, isocitrate dehydrogenase; LGG, low-grade glioma; M, male; Meth, methylated; MGMT, O6-methylguanine-DNA methyltransferase; Mut, mutant; Non, non-codeleted; Unmeth, unmethylated; WT, wild-type; XGBoost, extreme gradient boosting.

Model interpretability analysis

To elucidate the “black-box” nature of the ML model and enhance clinical trust, we employed the SHAP method to provide both global and local interpretability for the optimal XGBoost model’s predictions (Figure 6). Global feature importance was quantified using the mean absolute SHAP values, revealing the key drivers of the model’s overall predictions. The SHAP summary bar plot (Figure 6A,6B) identified log_sigma_5_0_mm_3D_firstorder_RootMeanSquared as the most influential predictor (mean |SHAP| =1.066), followed by log_sigma_5_0_mm_3D_glszm_SmallAreaLowGrayLevelEmphasis (mean |SHAP| =0.235) and wavelet_LLH_glszm_LargeAreaHighGrayLevelEmphasis (mean |SHAP| =0.133). These findings underscore the model’s reliance on quantitative measures of spatial heterogeneity and textural complexity, which are known to be correlated with tumor biology.

Figure 6 Global and local model explanation using the SHAP method for the XGBoost model for predicting glioma grade. (A) SHAP summary bar plot. This plot evaluates the contribution of each feature to the model using mean absolute SHAP values, displayed in descending order. (B) SHAP summary dot plot. The probability of high-grade glioma increases with the positive SHAP values of the features. Each dot represents a patient’s SHAP value for a given feature, with red indicating higher feature values and blue indicating lower values. Dots are stacked vertically to show density. (C) SHAP waterfall plot. This plot shows the contribution of each feature to the prediction result for a single patient using the XGBoost model. Red bars indicate features that contribute positively to the prediction, while blue bars indicate negative contributions. (D) SHAP force plot. This plot visualizes the cumulative impact of features on the model output for an individual prediction, with arrows indicating the direction and magnitude of each feature’s contribution from the base value to the final output. (E) SHAP dependence plots. Each dependence plot shows how a single feature affects the model’s output, with each point representing a patient. The SHAP values are on the y-axis, and the actual feature values are on the x-axis. Features with SHAP values above zero push the decision toward the high-grade glioma class. The plots include log_sigma_5_0_mm_3D_firstorder_RootMeanSquared, log_sigma_5_0_mm_3D_glszm_SmallAreaLowGrayLevelEmphasis, wavelet_LLH_glszm_LargeAreaHighGrayLevelEmphasis, wavelet_LLL_glszm_LargeAreaHighGrayLevelEmphasis, wavelet_HLL_firstorder_Kurtosis, wavelet_LLL_glszm_ZoneVariance, log_sigma_5_0_mm_3D_firstorder_90Percentile, and original_shape_Sphericity. HGG, high-grade glioma; LGG, low-grade glioma; SHAP, SHapley Additive exPlanation; XGBoost, extreme gradient boosting.

At the individual patient level, the SHAP analysis provided transparent, case-specific explanations crucial for clinical decision-making. The waterfall plot (Figure 6C) detailed how each feature contributed to a single patient’s prediction. For instance, in a correctly classified LGG case, the log_sigma_5_0_mm_3D_firstorder_RootMeanSquared feature contributed a SHAP value of −1.14, thereby lowering the model’s output probability, while the wavelet_LHH_firstorder_Skewness feature contributed positively (+0.15), demonstrating the model’s ability to integrate competing evidence. This process was further visualized by a force plot (Figure 6D), which showed the cumulative feature contributions pushing the prediction from a baseline to a final output. Finally, dependence plots (Figure 6E) revealed the nuanced, often non-linear relationships between individual feature values and their impact on the model’s output, identifying specific thresholds where a feature’s influence shifts toward predicting an LGG tumor.

Discussion

This multicenter study successfully developed and validated an interpretable radiomics-based ML model for the automated WHO grading of gliomas. Our XGBoost model showed robust performance across multiple independent cohorts, achieving excellent discrimination with AUCs ranging from 0.834 to 0.983, good calibration, and significant clinical utility. The integration of the SHAP analysis provides unprecedented interpretability for clinical decision-making.

Our model’s performance (AUC =0.834–0.983) compares favorably to that of recent studies while addressing their critical limitations. Cho et al. developed a model with an AUC of 0.89 using 285 patients, but external validation was lacking (30). Sun et al.’s model achieved similar internal performance (AUC =0.91), but multicenter validation was lacking (31). Consistent with our findings, recent studies have demonstrated the value of radiomics-based ML in predicting molecular markers and survival outcomes in HGG patients, achieving comparable performance metrics while emphasizing the importance of model interpretability for clinical translation (32). Sudre et al. developed the dynamic susceptibility contrast (DSC)-MRI radiomics model, which had an AUC of 0.85, but was limited to single-sequence analysis (33). Importantly, our study addressed the critical gap of external validation across geographically diverse populations, maintaining model performance across three independent cohorts representing different healthcare systems and imaging protocols. Unlike previous studies that focused primarily on discrimination metrics, we comprehensively evaluated calibration, clinical utility, and interpretability. Our DCA showed a clinically meaningful net benefit, while our SHAP analysis provides the first detailed explanation of feature contributions in glioma grading models, addressing the “black-box” criticism that has limited the clinical application of such models (34).

The predominance of LoG and wavelet-transformed features in our model reflects underlying tumor biology. HGGs exhibit increased cellular density, enhanced angiogenesis, and greater spatial heterogeneity, which manifest as distinct textural patterns captured by these mathematical transformations (35,36). The log_sigma_5_0_mm_3D_firstorder_RootMeanSquared feature, which was identified as the most important predictor, quantifies intensity variation in tumor regions, correlating with the increased metabolic activity and cellular proliferation characteristic of grade IV glioblastomas (37).

Wavelet decomposition captures multi-scale texture information, with high-high-low (HHL) and low-high-high (LHH) components representing high-frequency details in horizontal and vertical directions, respectively. The prominence of correlation and small dependence emphasis features suggests that HGGs demonstrate more organized local texture patterns despite their overall heterogeneity, potentially reflecting neoangiogenesis and cellular organization changes during malignant transformation (38).

A SHAP analysis was conducted to address a critical barrier to clinical adoption of ML models—the “black-box” problem. By providing both global and local explanations, our model enables clinicians to understand which imaging features contribute to specific predictions, building trust and facilitating clinical decision-making (26). Recent advances in interpretable ML for glioma analysis have further validated the clinical importance of such transparent approaches (32). This interpretability is particularly important in medical applications where understanding the reasoning behind predictions is crucial for patient care and medicolegal considerations.

The waterfall plots provide case-specific explanations that enhance model transparency and support clinical decision-making by visualizing individual feature contributions to each prediction (39). This approach has been shown to improve clinician confidence in artificial intelligence-assisted diagnosis across various medical imaging applications.

Our model addresses a genuine clinical need, as preoperative glioma grading accuracy directly impacts treatment planning. The 82.3–94.3% accuracy of our model across the validation cohorts represents a substantial improvement over conventional imaging assessment (accuracy: 55–77%) (10,11). The maintained performance of our model across different institutions and patient populations suggests potential for broad clinical deployment.

Our results demonstrated that radiomics features can effectively capture the imaging phenotype associated with different WHO grades, which correlates with underlying molecular characteristics. The relationship between radiomics features and molecular markers such as IDH mutations, 1p/19q codeletion, and MGMT methylation status represents an important area for future investigation. The integration of radiomics with molecular data may further improve prediction accuracy and provide insights into tumor biology. Despite robust multicenter validation, several critical barriers remain before clinical deployment. First, MRI protocol standardization is paramount, as cross-institutional heterogeneity remains the principal constraint on generalizability. Although analyses across TCGA, UCSF, and NTUA indicate protocol-agnostic performance, systematic harmonization is still required through consensus guidelines for radiomics-optimized acquisition parameters and domain-adaptation strategies. Second, automated segmentation is essential to mitigate workflow bottlenecks posed by manual delineation; near-expert performance can be achieved by deploying validated deep learning architectures trained on diverse datasets, complemented by semi-automated refinement tools and quality-control algorithms that flag cases requiring human review. Finally, intuitive user interfaces are needed to present predictions, confidence levels, and case-specific explanations clearly, alongside continuous feedback that iteratively improve usability and strengthen clinical integration.

Future research directions include: (I) the integration of multiparametric MRI sequences [diffusion-weighted imaging (DWI), perfusion, and fluid-attenuated inversion recovery (FLAIR)] to capture additional biological information; (II) prospective validation in real-time clinical settings; (III) model expansion to predict treatment response and survival outcomes; and (IV) the investigation of radiomics-molecular marker interactions for personalized medicine applications.

We acknowledge several limitations in this study. First, while our multicenter design mitigates single-institution biases, the retrospective nature of this study might have introduced selection bias. The variation in MRI protocols across centers, while reflecting real-world conditions, might have introduced technical heterogeneity that could affect model generalization. Larger prospective validation studies across diverse populations are needed to confirm model generalizability. Second, the manual tumor segmentation, while performed by experienced radiologists with excellent inter-rater reliability [intraclass correlation coefficient (ICC) >0.90], introduced potential subjectivity that automated segmentation could address. However, current automated segmentation methods for gliomas remain imperfect, particularly those for non-enhancing tumors. Third, our analysis focused on contrast-enhanced T1-weighted imaging; the integration of multiparametric MRI sequences (DWI, perfusion, and FLAIR) may further enhance model performance but would also increase model complexity and computational requirements. A single-sequence analysis was chosen because it balanced performance with clinical practicality. Fourth, molecular marker integration was limited by data availability across cohorts, but represents an important future direction. The 2021 WHO classification system emphasizes molecular characteristics, and future models should incorporate these markers alongside imaging features for optimal accuracy. Finally, our retrospective design may not fully capture the challenges of prospective implementation, and the clinical impact of using model predictions on treatment decisions and patient outcomes should be evaluated in prospective interventional studies.

Conclusions

In this multicenter, retrospective study, an interpretable radiomics-based machine-learning model for pre-operative WHO grading of gliomas was developed and externally validated. The model exhibited robust discrimination and calibration across both internal and geographically diverse international cohorts, substantially outperforming conventional radiological assessment. SHAP analysis revealed that intratumoral texture heterogeneity features were the dominant predictors, offering transparent explanations that mitigate the ‘black-box’ concern surrounding artificial intelligence. Prospective, workflow-embedded studies are now warranted to confirm clinical utility, determine effects on treatment decisions and patient outcomes, and minimize unnecessary interventions.

Acknowledgments

We would like to express our sincere appreciation to our colleagues and the staff of the Department of Radiotherapy at The Affiliated Hospital of Nantong University for their invaluable support. Our gratitude also extends to the OnekeyAI platform for its technical assistance. We are profoundly grateful to the patients and investigators who contributed data to TCGA and the UCSF, whose participation was indispensable to this work.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/rc

Data Sharing Statement: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/dss

Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/prf

Funding: This work was supported by grants from the National Natural Science Foundation (No. 82070622), the Jiangsu Provincial Research Hospital (No. YJXYY202204), the Nantong Science and Technology Project (No. MSZ2023174), and the Jiangsu Provincial Postgraduate Research and Practice Innovation Program (No. SJCX24_2068).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-2024/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study was approved by the Institutional Review Board of Nantong University (No. 2024-L215). Given the retrospective nature and use of de-identified data, the requirement of informed consent was waived.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Ostrom QT, Patil N, Cioffi G, et al. CBTRUS Statistical Report: Primary Brain and Other Central Nervous System Tumors Diagnosed in the United States in 2013-2017. Neuro Oncol 2020;22:iv1-iv96. [Crossref] [PubMed]
Siegel RL, Miller KD, Fuchs HE, et al. Cancer statistics, 2022. CA Cancer J Clin 2022;72:7-33. [Crossref] [PubMed]
Louis DN, Perry A, Wesseling P, et al. The 2021 WHO Classification of Tumors of the Central Nervous System: a summary. Neuro Oncol 2021;23:1231-51. [Crossref] [PubMed]
Weller M, van den Bent M, Preusser M, et al. EANO guidelines on the diagnosis and treatment of diffuse gliomas of adulthood. Nat Rev Clin Oncol 2021;18:170-86. [Crossref] [PubMed]
Sahm F, Brandner S, Bertero L, et al. Molecular diagnostic tools for the World Health Organization (WHO) 2021 classification of gliomas, glioneuronal and neuronal tumors; an EANO guideline. Neuro Oncol 2023;25:1731-49. [Crossref] [PubMed]
Antonelli M, Poliani PL. Adult type diffuse gliomas in the new 2021 WHO Classification. Pathologica 2022;114:397-409. [Crossref] [PubMed]
Stupp R, Mason WP, van den Bent MJ, et al. Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma. N Engl J Med 2005;352:987-96. [Crossref] [PubMed]
Buckner JC, Shaw EG, Pugh SL, et al. Radiation plus Procarbazine, CCNU, and Vincristine in Low-Grade Glioma. N Engl J Med 2016;374:1344-55. [Crossref] [PubMed]
Renovanz M, Hickmann AK, Nadji-Ohl M, et al. Health-related quality of life and distress in elderly vs. younger patients with high-grade glioma-results of a multicenter study. Support Care Cancer 2020;28:5165-75. [Crossref] [PubMed]
Lasocki A, Gaillard F. Non-Contrast-Enhancing Tumor: A New Frontier in Glioblastoma Research. AJNR Am J Neuroradiol 2019;40:758-65. [Crossref] [PubMed]
Villanueva-Meyer JE, Mabray MC, Cha S. Current Clinical Brain Tumor Imaging. Neurosurgery 2017;81:397-415. [Crossref] [PubMed]
Zikou AK, Alexiou GA, Kosta P, et al. Diffusion tensor and dynamic susceptibility contrast MRI in glioblastoma. Clin Neurol Neurosurg 2012;114:607-12. [Crossref] [PubMed]
Danchaivijitr N, Waldman AD, Tozer DJ, et al. Low-grade gliomas: do changes in rCBV measurements at longitudinal perfusion-weighted MR imaging predict malignant transformation? Radiology 2008;247:170-8. [Crossref] [PubMed]
Patel P, Baradaran H, Delgado D, et al. MR perfusion-weighted imaging in the evaluation of high-grade gliomas after treatment: a systematic review and meta-analysis. Neuro Oncol 2017;19:118-27. [Crossref] [PubMed]
Zhang B, Chang K, Ramkissoon S, et al. Multimodal MRI features predict isocitrate dehydrogenase genotype in high-grade gliomas. Neuro Oncol 2017;19:109-17. [Crossref] [PubMed]
Park JE, Kim HS, Park SY, et al. Prediction of Core Signaling Pathway by Using Diffusion- and Perfusion-based MRI Radiomics and Next-generation Sequencing in Isocitrate Dehydrogenase Wild-type Glioblastoma. Radiology 2020;294:388-97. [Crossref] [PubMed]
Rudin C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat Mach Intell 2019;1:206-15. [Crossref] [PubMed]
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25:44-56. [Crossref] [PubMed]
Gillies RJ, Kinahan PE, Hricak H. Radiomics: Images Are More than Pictures, They Are Data. Radiology 2016;278:563-77. [Crossref] [PubMed]
Lambin P, Rios-Velazquez E, Leijenaar R, et al. Radiomics: extracting more information from medical images using advanced feature analysis. Eur J Cancer 2012;48:441-6. [Crossref] [PubMed]
Kumar V, Gu Y, Basu S, et al. Radiomics: the process and the challenges. Magn Reson Imaging 2012;30:1234-48. [Crossref] [PubMed]
Aerts HJ, Velazquez ER, Leijenaar RT, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun 2014;5:4006. [Crossref] [PubMed]
Fathi Kazerooni A, Kraya A, Rathi KS, et al. Multiparametric MRI along with machine learning predicts prognosis and treatment response in pediatric low-grade glioma. Nat Commun 2025;16:340. [Crossref] [PubMed]
Liu Z, Wang S, Dong D, et al. The Applications of Radiomics in Precision Diagnosis and Treatment of Oncology: Opportunities and Challenges. Theranostics 2019;9:1303-22. [Crossref] [PubMed]
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). 2017:4765-74.
Xia X, Wu W, Tan Q, et al. Interpretable Machine Learning Models for Differentiating Glioblastoma From Solitary Brain Metastasis Using Radiomics. Acad Radiol 2025;32:5388-400. [Crossref] [PubMed]
Śledzińska P, Bebyn M, Szczerba E, et al. Glioma 2021 WHO Classification: The Superiority of NGS Over IHC in Routine Diagnostics. Mol Diagn Ther 2022;26:699-713. [Crossref] [PubMed]
Zwanenburg A, Vallières M, Abdalah MA, et al. The Image Biomarker Standardization Initiative: Standardized Quantitative Radiomics for High-Throughput Image-based Phenotyping. Radiology 2020;295:328-38. [Crossref] [PubMed]
Raptis S, Ilioudis C, Theodorou K. From pixels to prognosis: unveiling radiomics models with SHAP and LIME for enhanced interpretability. Biomed Phys Eng Express 2024;
Cho HH, Lee SH, Kim J, et al. Classification of the glioma grading using radiomics analysis. PeerJ 2018;6:e5982. [Crossref] [PubMed]
Sun P, Wang D, Mok VC, et al. Comparison of Feature Selection Methods and Machine Learning Classifiers for Radiomics Analysis in Glioma Grading. IEEE Access 2019;7:102010-20.
Yu M, Liu J, Zhou W, et al. MRI radiomics based on machine learning in high-grade gliomas as a promising tool for prediction of CD44 expression and overall survival. Sci Rep 2025;15:7433. [Crossref] [PubMed]
Sudre CH, Panovska-Griffiths J, Sanverdi E, et al. Machine learning assisted DSC-MRI radiomics as a tool for glioma classification by grade and mutation status. BMC Med Inform Decis Mak 2020;20:149. [Crossref] [PubMed]
Yi F, Yang H, Chen D, et al. XGBoost-SHAP-based interpretable diagnostic framework for alzheimer's disease. BMC Med Inform Decis Mak 2023;23:137. [Crossref] [PubMed]
Le NQK, Hung TNK, Do DT, et al. Radiomics-based machine learning model for efficiently classifying transcriptome subtypes in glioblastoma patients from MRI. Comput Biol Med 2021;132:104320. [Crossref] [PubMed]
Kang D, Park JE, Kim YH, et al. Diffusion radiomics as a diagnostic model for atypical manifestation of primary central nervous system lymphoma: development and multicenter external validation. Neuro Oncol 2018;20:1251-61. [Crossref] [PubMed]
Ellingson BM, Bendszus M, Boxerman J, et al. Consensus recommendations for a standardized Brain Tumor Imaging Protocol in clinical trials. Neuro Oncol 2015;17:1188-98. [Crossref] [PubMed]
Zhou M, Scott J, Chaudhury B, et al. Radiomics in Brain Tumor: Image Assessment, Quantitative Feature Descriptors, and Machine-Learning Approaches. AJNR Am J Neuroradiol 2018;39:208-16. [Crossref] [PubMed]
Borys K, Schmitt YA, Nauta M, et al. Explainable AI in medical imaging: An overview for clinical practitioners - Saliency-based XAI approaches. Eur J Radiol 2023;162:110787. [Crossref] [PubMed]

(English Language Editor: L. Huleatt)

Cite this article as: Shi Y, Jiang W, Zhou L, Shi Y, Ji X, Zheng W, Zhao H, Yao N. Radiomics-based machine learning for glioma grade classification: a multicenter study with SHapley Additive exPlanations interpretability analysis. Transl Cancer Res 2025;14(10):7329-7346. doi: 10.21037/tcr-2025-2024

Radiomics-based machine learning for glioma grade classification: a multicenter study with SHapley Additive exPlanations interpretability analysis

Highlight box

Introduction

Methods

Study design and ethical considerations

Study population

Primary outcome and sample size determination

Patient and public involvement

Clinical variables and molecular markers

Image acquisition and preprocessing

Radiomics feature extraction

Feature selection

ML model development

Model performance comparison

Fairness assessment

Model interpretability analysis

Statistical analysis

Results

Baseline clinical characteristics

Table 1

Table 2

Feature selection and model development

Model performance, external validation and model equity

Model calibration and clinical utility

Heatmap analysis of model variables

Model interpretability analysis

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share