Explainable machine-learning prediction of overall and cancer-specific survival in adult triple-negative breast cancer using SEER: a comparative study of nomograms and random survival forests
Highlight box
Key findings
• Using Surveillance, Epidemiology, and End Results (SEER) data [2010–2015] from 27,256 adult women with triple-negative breast cancer (TNBC), this study developed and validated Cox-based nomograms and random survival forest (RSF) models to predict 1-, 3-, and 5-year overall survival (OS) and cancer-specific survival (CSS). RSF demonstrated superior discrimination and comparable or better net benefit in decision curve analysis. SHapley Additive exPlanations (SHAP) analysis identified N stage, T stage, and age as key predictors. Both models effectively stratified patients into high- and low-risk groups (log-rank P<0.001).
What is known and what is new?
• TNBC prognosis is heterogeneous, and traditional Cox-based nomograms offer interpretable survival predictions using clinicopathological variables.
• This study systematically compares RSF with Cox nomograms for TNBC using SEER data, showing RSF’s superior discrimination and decision-analytic utility. It introduces SHAP for interpretability of machine learning predictions with comprehensive time-dependent area under the curve, calibration, and decision curve analysis evaluations for both OS and CSS.
What is the implication, and what should change now?
• RSF models with SHAP interpretability may serve as primary tools for individualized TNBC prognostication, while nomograms remain useful for clinical communication. However, without external validation and potential overfitting concerns, these models should not yet be implemented in practice without further testing in independent cohorts. Future work should focus on external validation and model recalibration.
Introduction
Triple-negative breast cancer (TNBC), defined by the absence of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 expression, constitutes a clinically aggressive and biologically heterogeneous subtype of breast cancer (1-4). Despite advances in multimodality therapy, outcomes in real-world practice remain highly variable, even among patients with apparently similar baseline clinicopathological profiles (5,6). This variability reflects not only differences in tumor burden and metastatic patterns captured by conventional staging, but also heterogeneity in patient factors, treatment delivery, and potentially complex and non-linear interactions among routinely recorded determinants (7-9). Consequently, clinicians often face uncertainty when translating multidimensional baseline information into individualized survival expectations that can support counseling, surveillance planning, and risk-adapted management (10,11).
In breast cancer research, traditional statistical models based on multivariate regression have established a central role in risk prediction, such as the Gail, TyrerCuzick, and Breast Cancer Surveillance Consortium models (12), as well as in prognostic assessment, exemplified by Cox model‑based nomograms. These models are widely adopted due to their favorable interpretability, relative simplicity, and clinical practicality, forming an essential foundation for current clinical decisionsupport systems (13). However, when applied to highly heterogeneous subtypes such as TNBC, which is characterized by complex biology and may involve substantial nonlinear interactions and time-varying effects, the inherent flexibility constraints of traditional linear or parametric models can limit their predictive performance (14).
Current prognostic assessment in TNBC is largely anchored to anatomical staging and a limited set of clinicopathological variables (15). While indispensable, these factors do not directly yield patient-level survival probabilities at clinically meaningful horizons (e.g., 1, 3, and 5 years), and they may under-represent non-linear effects or higher-order interactions (16). Cox proportional hazards regression and nomograms provide an interpretable framework that can be readily implemented; however, their functional form may constrain flexibility when predictor-outcome relationships are complex (17). In contrast, random survival forest (RSF) models can accommodate non-linearity and interactions without strong parametric assumptions, offering a potentially more powerful predictive approach (18). Yet, the clinical adoption of machine-learning survival models is frequently tempered by concerns regarding explainability, reproducibility, and whether incremental discrimination gains translate into tangible decision-analytic benefit (19).
Population-based registries provide an opportunity to develop prognostic models at scale that reflect real-world case mix and treatment patterns. The Surveillance, Epidemiology, and End Results (SEER) database offers broad coverage, standardized follow-up, and sufficient sample size to evaluate TNBC prognosis across clinically relevant strata (20). Therefore, this study aims to investigate, within the specific context where traditional predictive models face challenges in TNBC, whether machine learning approaches can offer predictive gains with potential clinical significance. Utilizing the SEER database, we systematically constructed and compared the predictive performance of Cox-based nomogram and RSF model. The evaluation encompassed multiple dimensions, including discriminative ability, calibration, clinical decision utility, and model interpretability via SHapley Additive exPlanations (SHAP) analysis. Both overall survival (OS) and cancer-specific survival (CSS) were assessed. The objective was to determine if the machine learning framework can provide superior tools and evidence for the individualized prognostic management of TNBC, building upon the traditional statistical foundation. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-aw-2462/rc) (21).
Methods
Data source and study design
This study was a retrospective, population-based cohort analysis using data from the SEER program in the United States (22). Breast cancer cases diagnosed between 2010 and 2015 were extracted using SEER*Stat. The study population was restricted to female patients aged ≥18 years with TNBC.
Study population: inclusion and exclusion criteria
Inclusion criteria
Patients meeting all of the following criteria were included: (I) diagnosis year between 2010 and 2015; (II) female sex and age ≥18 years at diagnosis; (III) primary malignant breast tumor; (IV) molecular subtype consistent with TNBC, i.e., estrogen receptor-negative, progesterone receptor-negative, and human epidermal growth factor receptor 2-negative (23); and (V) available follow-up information sufficient for survival analyses.
Exclusion criteria and data cleaning
A prespecified data cleaning workflow was applied. Cases were excluded if they: (I) had invalid follow-up time (e.g., follow-up time ≤0 or missing); (II) had missing or indeterminate outcome status; or (III) had missing, unknown, or evidently inconsistent coding in key covariates (e.g., core demographic variables, staging/metastatic information, and treatment variables). Extreme or special codes were standardized according to SEER field definitions and harmonized as missing values or retained as separate categories prior to model fitting, as appropriate (24).
Outcomes and follow-up definition
Study endpoints
The study endpoints were OS and CSS. OS was defined as the time (in months) from diagnosis to death from any cause or last follow-up (25). CSS was defined as the time from diagnosis to death attributable to breast cancer; deaths from other causes were treated as censored at the time of death under a cause-specific framework (26). Patients alive at last follow-up were censored at the date of last contact.
Calculation of median follow-up time and censoring rate
Follow‑up time was recorded in months. The overall censoring rate was defined as the proportion of censored subjects among all analyzed individuals. The event indicator was defined separately for each endpoint: for OS, death from any cause was treated as an event (status =1), and subjects who were alive or lost to follow-up were considered censored (status =0); for CSS, death due to cancer was treated as an event (status =1), while all other outcomes (alive, lost to follow-up, or death from non-cancer causes) were considered censored (status =0).
The median follow‑up time was estimated using the reverse Kaplan-Meier method: censoring was regarded as the “event” [the survival object was constructed with Surv (time, 1 − status)], a Kaplan-Meier curve was fitted, and the median follow-up time was taken as the median of the resulting survival distribution, with the 95% confidence interval (CI) also reported.
Covariate collection and preprocessing
Candidate prognostic covariates were extracted from SEER and grouped into three domains: (I) demographic characteristics (age and race); (II) tumor-related characteristics [histologic grade, American Joint Committee on Cancer (AJCC) tumor-node-metastasis (TNM) stage, and site-specific distant metastases including brain, bone, lung, and liver]; and (III) treatment-related variables (surgery, radiotherapy, and chemotherapy). Prior to modeling, variables were cleaned and harmonized according to prespecified rules. Multilevel categorical variables were factorized with consistent coding; variables containing special codes or “unknown/not applicable” entries were normalized based on SEER field definitions (either recoded as missing values or retained as separate categories). Rare levels were merged when clinically reasonable to mitigate model instability and potential separation. After the 7:3 split, factor levels were checked and harmonized between the training and test cohorts to ensure consistent encoding and prevent level mismatch during prediction.
Dataset splitting
The eligible cohort was randomly split into a training cohort and a test cohort at a 7:3 ratio for internal validation. The training cohort was used for model development and parameter estimation, whereas the test cohort was used for independent evaluation of generalizability. A fixed random seed was specified in the analysis scripts to ensure reproducibility.
Prognostic model development
Cox proportional hazards model and nomogram
Cox proportional hazards regression models were fitted in the training cohort separately for OS and CSS. Univariate Cox analyses were first performed to screen candidate variables, followed by multivariable Cox models to identify independent prognostic factors. Hazard ratios (HRs) and 95% CIs were reported. Based on the final multivariable models, nomograms were constructed to estimate 1-, 3-, and 5-year survival probabilities. The Cox linear predictor (LP) was extracted as an individual-level risk measure for downstream stratification analyses. The basic formula of the Cox proportional hazards model is as follows: h(t∣X)=h0(t)*exp(βTX).
RSF
An RSF model was developed as a nonparametric machine-learning approach to capture potential non-linear effects and higher-order interactions. Individual risk scores were generated from the RSF model (e.g., derived from cumulative hazard or survival probability at prespecified time points). RSF and Cox-based models were trained and evaluated under the same training/test split to ensure comparability (27), its formula is as follows: S(t∣X)=(1/B)*∑(Sk(t∣X)).
Model interpretability
To enhance transparency of the RSF model, SHAP were used to quantify feature contributions. SHAP values were computed to interpret predictions at the individual level and to summarize global feature importance and directionality, thereby identifying major drivers of prognosis (28).
Model performance and clinical utility
Discrimination
Discrimination was quantified using Harrell’s C-index, with 95% CIs estimated via bootstrap resampling. Time-dependent ROC curves were plotted and area under the curves (AUCs) were calculated at 1, 3, and 5 years to characterize time-varying discriminative performance.
Calibration
Calibration was assessed at 1, 3, and 5 years using calibration curves. Patients were grouped by quantiles of predicted risk, and Kaplan-Meier estimates were used to obtain observed survival probabilities within each group. Predicted survival probabilities were compared against observed values in both cohorts.
Clinical utility
Clinical utility was evaluated using DCA by quantifying net benefit across a range of clinically plausible threshold probabilities. Net benefit curves were compared with treat-all and treat-none strategies. DCA was performed at 1, 3, and 5 years.
Risk stratification
Risk stratification was performed using the median risk score in the training cohort as the cutoff, which was then applied unchanged to the test cohort. Kaplan-Meier curves were plotted to visualize survival differences between high- and low-risk groups, and differences were assessed using the log-rank test.
Statistical analysis and software
Categorical variables were summarized as counts (percentages). Baseline characteristics between the training and test cohorts were compared using Pearson’s chi-square test, and Fisher’s exact test was used when any expected cell count was <5. All analyses were conducted in R (version 4.5.2). Cox regression and nomogram construction were performed using survival (version 3.8.6) and rms (version 8.1.0). Time-dependent ROC analyses were conducted using time ROC (version 0.4). RSF modeling was implemented with randomForestSRC (version 3.5.0). Model interpretability was performed using kernelshap (version 0.9.1) and shapviz (version 0.10.3). Unless otherwise specified, all statistical tests were two-sided, and P<0.05 was considered statistically significant.
Ethics statement
SEER data are publicly available and de-identified. This study did not involve identifiable personal information and is generally exempt from institutional review board approval. Data use complied with the SEER data-use agreement and relevant regulations. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Results
Study population and baseline characteristics
After applying the prespecified inclusion and exclusion criteria, 27,256 female patients aged ≥18 years with TNBC were included for model development and internal validation (Figure 1). The cohort was randomly split into a training cohort (n=19,079) and a test cohort (n=8,177) at a 7:3 ratio. When OS was considered as the endpoint, 8,886 events occurred, with 18,370 censored cases, yielding a censoring rate of 67.4%. The median follow-up time estimated using the reverse Kaplan-Meier method was 118.0 months (95% CI: 117.0–118.0). For CSS, 5,584 events were observed, and 21,672 cases were censored, corresponding to a censoring rate of 79.5%. The median follow-up time estimated by the reverse Kaplan-Meier method was 112.0 months (95% CI: 111.0–112.0).
Baseline demographic, clinicopathological, metastatic, and treatment characteristics were well balanced between the two cohorts (Table 1). No statistically significant differences were observed for age categories (P=0.55), race (P=0.20), grade (P=0.41), T stage (P=0.22), N stage (P=0.95), M stage (P=0.68), brain/bone/lung/liver metastases (all P>0.05), surgery (P=0.60), radiotherapy (P=0.09), or chemotherapy (P=0.23).
Table 1
| Variables | Total | Training | Test | P |
|---|---|---|---|---|
| Age (years) | 0.55 | |||
| <50 | 8,029 (29.5) | 5,621 (29.5) | 2,408 (29.4) | |
| 50–59 | 6,453 (23.7) | 4,532 (23.8) | 1,921 (23.5) | |
| 60–69 | 6,766 (24.8) | 4,695 (24.6) | 2,071 (25.3) | |
| 70–79 | 4,111 (15.1) | 2,888 (15.1) | 1,223 (15.0) | |
| 80–89 | 1,715 (6.3) | 1,206 (6.3) | 509 (6.2) | |
| ≥90 | 182 (0.7) | 137 (0.7) | 45 (0.6) | |
| Race | 0.20 | |||
| White | 19,978 (73.3) | 14,056 (73.7) | 5,922 (72.4) | |
| Black | 5,089 (18.7) | 3,513 (18.4) | 1,576 (19.3) | |
| API | 2,019 (7.4) | 1,391 (7.3) | 628 (7.7) | |
| AIAN | 170 (0.6) | 119 (0.6) | 51 (0.6) | |
| Grade | 0.41 | |||
| I | 581 (2.1) | 411 (2.2) | 170 (2.1) | |
| II | 4,776 (17.5) | 3,371 (17.7) | 1,405 (17.2) | |
| III | 21,852 (80.2) | 15,260 (80.0) | 6,592 (80.6) | |
| IV | 47 (0.2) | 37 (0.2) | 10 (0.1) | |
| T stage | 0.22 | |||
| T0 | 14 (0.1) | 10 (0.1) | 4 (0.0) | |
| T1 | 12,689 (46.6) | 8,877 (46.5) | 3,812 (46.6) | |
| T2 | 11,419 (41.9) | 7,980 (41.8) | 3,439 (42.1) | |
| T3 | 2,107 (7.7) | 1,460 (7.7) | 647 (7.9) | |
| T4 | 1,027 (3.8) | 752 (3.9) | 275 (3.4) | |
| N stage | 0.95 | |||
| N0 | 18,508 (67.9) | 12,948 (67.9) | 5,560 (68.0) | |
| N1 | 5,992 (22.0) | 4,194 (22.0) | 1,798 (22.0) | |
| N2 | 1,622 (6.0) | 1,134 (5.9) | 488 (6.0) | |
| N3 | 1134 (4.2) | 803 (4.2) | 331 (4.0) | |
| M stage | 0.68 | |||
| M0 | 26,676 (97.9) | 18,668 (97.8) | 8,008 (97.9) | |
| M1 | 580 (2.1) | 411 (2.2) | 169 (2.1) | |
| Brain metastasis | 0.94 | |||
| Yes | 31 (0.1) | 21 (0.1) | 10 (0.1) | |
| No | 27,225 (99.9) | 19,058 (99.9) | 8,167 (99.9) | |
| Bone metastasis | 0.34 | |||
| Yes | 206 (0.8) | 151 (0.8) | 55 (0.7) | |
| No | 27,050 (99.2) | 18,928 (99.2) | 8,122 (99.3) | |
| Lung metastasis | 0.90 | |||
| Yes | 209 (0.8) | 145 (0.8) | 64 (0.8) | |
| No | 27,047 (99.2) | 18,934 (99.2) | 8,113 (99.2) | |
| Liver metastasis | 0.49 | |||
| Yes | 120 (0.4) | 80 (0.4) | 40 (0.5) | |
| No | 27,136 (99.6) | 18,999 (99.6) | 8,137 (99.5) | |
| Surgery | 0.60 | |||
| Yes | 26,930 (98.8) | 18,846 (98.8) | 8,084 (98.9) | |
| No | 326 (1.2) | 233 (1.2) | 93 (1.1) | |
| Radiotherapy | 0.09 | |||
| Yes | 15,517 (56.9) | 10,925 (57.3) | 4,592 (56.2) | |
| No | 11,739 (43.1) | 8,154 (42.7) | 3,585 (43.8) | |
| Chemotherapy | 0.23 | |||
| Yes | 20,567 (75.5) | 14,357 (75.3) | 6,210 (75.9) | |
| No | 6,689 (24.5) | 4,722 (24.7) | 1,967 (24.1) |
Data are presented as n (%). P values were calculated using Pearson’s Chi-squared test to compare baseline characteristics between the training and test cohorts. AIAN, American Indian/Alaska Native; API, Asian or Pacific Islander; M, metastasis; N, node; T, tumor.
In the overall cohort, patients aged <50 years accounted for 29.5%, and those aged 50–69 years comprised approximately 48.5%. Most patients were White (73.3%) and Black (18.7%). Grade III disease predominated (80.2%). Regarding AJCC staging, T1 and T2 represented 46.6% and 41.9%, N0 67.9%, and M1 2.1%. Site-specific distant metastases were uncommon (brain 0.1%, bone 0.8%, lung 0.8%, liver 0.4%). Surgery was performed in 98.8% of patients, while radiotherapy and chemotherapy were recorded in 56.9% and 75.5%, respectively (Table 1).
Cox regression-identified prognostic factors and nomogram construction
Univariate and multivariable Cox proportional hazards models were fitted in the training cohort for OS and CSS (Figure 2A-2D). In univariate analyses, increasing age was associated with progressively higher hazards for both OS and CSS (Figure 2A,2B). For OS, compared with age <50 years, the hazard increased across age strata: 50–59 years (HR =1.085, P=0.043), 60–69 years (HR =1.314, P<0.001), 70–79 years (HR =2.126, P<0.001), 80–89 years (HR =4.357, P<0.001), and ≥90 years (HR =9.308, P<0.001) (Figure 2A). A similar age gradient was observed for CSS (e.g., ≥90 years: HR =4.456, P<0.001; Figure 2B).
Tumor burden-related variables demonstrated strong associations with outcomes. In univariate analyses, N stage and M stage were strongly linked to both OS and CSS (OS: N1–N3 HR =2.017–5.944; M1 HR =7.806; all P<0.001; CSS: N1–N3 HR =3.055–10.098; M1 HR =10.689; all P<0.001) (Figure 2A,2B). Site-specific distant metastases (brain, bone, lung, and liver) were each associated with increased hazards for OS and CSS (all P<0.001) (Figure 2A,2B).
In multivariable models, age and nodal status remained independently associated with prognosis (Figure 2C,2D). For OS, age continued to show a dose-response relationship (80–89 years: HR =3.602, P<0.001; ≥90 years: HR =5.716, P<0.001) and N3 remained a strong predictor (HR =3.977, P<0.001) (Figure 2C). For CSS, similar patterns were observed (≥90 years: HR =2.802, P<0.001; N3: HR =5.834, P<0.001) (Figure 2D). The effect of M1 attenuated after adjustment for site-specific metastasis indicators, yet remained significant for both endpoints (OS: HR =1.771, P<0.001; CSS: HR =2.005, P<0.001), supporting partially overlapping but non-redundant prognostic information between AJCC M classification and metastasis-site indicators. Site-specific metastases remained significant in adjusted models (e.g., OS: brain HR =2.114, liver≤ HR =2.914; CSS: brain HR =2.890, liver HR =2.559; all P≤0.002) (Figure 2C,2D).
Treatment-related variables were independently associated with improved survival. Surgery showed a consistent protective association (OS: HR =0.684, P<0.001; CSS: HR =0.684, P<0.001). Radiotherapy and chemotherapy also remained associated with reduced hazards after adjustment (OS: radiotherapy HR =0.863 and chemotherapy HR =0.653; CSS: radiotherapy HR =0.866 and chemotherapy HR =0.770; all P<0.001) (Figure 2C,2D). Race was not a stable independent predictor across models; however, Asian/Pacific Islander race was associated with a lower hazard in the multivariable OS model (HR =0.658, P=0.01) (Figure 2C).
Nomograms were then constructed based on the final multivariable Cox models to estimate individual 1-, 3-, and 5-year survival probabilities for OS and CSS (Figure 3). The Cox LP was extracted as an individual-level risk measure for downstream stratification.
Nomogram performance and risk stratification
The Cox-based nomogram demonstrated robust discrimination and time-dependent predictive performance in both the training and test cohorts (Figure 4). For OS, the nomogram achieved AUCs of 0.839, 0.786, and 0.773 at 1, 3, and 5 years in the training cohort (Figure 4A-4C) and 0.824, 0.793, and 0.761 in the test cohort (Figure 4D-4F). For cancer-specific survival (CSS), the corresponding AUCs were 0.857, 0.814, and 0.796 in the training cohort (Figure 4G-4I) and 0.825, 0.802, and 0.776 in the test cohort (Figure 4J-4L). Furthermore, compared with the AJCC TNM staging system, the nomogram showed significantly improved discrimination for both OS and CSS. In the training cohort, the nomogram yielded higher AUCs than TNM staging at all three time points (OS: 0.839 vs. 0.773, 0.786 vs. 0.762, 0.773 vs. 0.733; CSS: 0.857 vs. 0.822, 0.814 vs. 0.799, 0.796 vs. 0.778). This advantage was maintained in the test cohort, supporting the added prognostic value of the nomogram.
The C-index further supports the discriminative ability of the models: for OS, the C-indices were 0.739 (95% CI: 0.733–0.745) in the training cohort and 0.735 (95% CI: 0.725–0.745) in the test cohort (Figure 5A); for CSS, the C-indices were 0.761 (95% CI: 0.754–0.769) and 0.750 (95% CI: 0.739–0.763), respectively (Figure 5B). Calibration plots showed good agreement between model-predicted survival probabilities and Kaplan-Meier observed survival probabilities at 1, 3, and 5 years in both cohorts, without obvious systematic overestimation or underestimation. Further quantitative assessment of model calibration indicated that the Integrated Calibration Index (ICI) for the Cox nomogram was below 0.02. The calibration slopes corresponding to the calibration curves at each time point are detailed in Table S1.
For clinical stratification, the median LP in the training cohort was used as the cutoff and applied unchanged to the test cohort. Kaplan-Meier curves demonstrated clear separation between high- and low-risk groups for both OS and CSS in the training and test cohorts (all log-rank P<0.001; Figure 6A-6D), supporting the nomograms’ utility for patient-level risk discrimination.
RSF modeling, SHAP interpretability, and comparative performance
The RSF model was trained using the same training and test cohort split as the Cox model to ensure direct comparability (Figure 4). For OS, the RSF model achieved AUC values of 0.907, 0.823, and 0.803 at 1, 3, and 5 years, respectively, in the training cohort, and 0.844, 0.798, and 0.769 in the test cohort (Figure 4A-4F). For CSS, the RSF model achieved AUC values of 0.933, 0.843, and 0.822 in the training cohort, and 0.845, 0.815, and 0.785 in the test cohort (Figure 4G-4L). Compared with the AJCC TNM staging system, the RSF model demonstrated improved discriminative ability. In the training cohort, the RSF model showed higher AUC values than the TNM staging system at all time points (OS: 0.907 vs. 0.773, 0.823 vs. 0.762, 0.803 vs. 0.733; CSS: 0.933 vs. 0.822, 0.843 vs. 0.799, 0.822 vs. 0.778). This advantage was maintained in the test cohort.
The calibration performance of the models was evaluated using calibration curves and related quantitative metrics. The calibration curves indicated good overall agreement between predicted and observed survival probabilities at all time points in both cohorts (Figure 7A,7B). The ICI values for RSF model were below 0.02. The calibration slopes corresponding to all calibration curves are detailed in Table S1.
SHAP analyses provided complementary interpretability for RSF predictions (Figures 8,9). In the OS model (Figure 8), global importance ranked by mean absolute SHAP values highlighted N stage, T stage, and age as the leading contributors, followed by chemotherapy, M stage, radiotherapy, race, and grade (Figure 8A). The SHAP summary plot indicated that higher tumor burden levels generally aligned with larger SHAP values (higher predicted risk), whereas treatment variables tended to contribute in the risk-reducing direction at the population level (Figure 8B). At the individual level, waterfall/decision plots illustrated how feature contributions accumulated from the baseline prediction to the individual prediction for representative patients (OS: Figure 8C,8D), improving transparency of model outputs. The CSS model exhibited a similar contribution profile, with global feature importance (Figure 9A), SHAP summary plot (Figure 9B), and individual patient-level interpretations (Figure 9C,9D).
RSF-derived risk scores further enabled robust stratification. Using the median training score as the cutoff and applying it to the test cohort, high- and low-risk groups remained well separated for both OS and CSS (all log-rank P<0.001; Figure 10A-10D), supporting the transportability of RSF-based stratification within this internal validation framework.
Decision curve analysis (DCA) for clinical utility
To evaluate potential clinical utility, DCA was performed for OS and CSS at 1, 3, and 5 years in both cohorts, benchmarking against treat-all and treat-none strategies (Figure 11A-11L). Across a broad range of threshold probabilities, both the Cox nomogram and RSF model achieved higher net benefit than treat-all and treat-none, indicating that model-assisted risk stratification may provide added clinical value within corresponding threshold intervals.
For OS, RSF generally demonstrated higher or comparable net benefit relative to the Cox nomogram across most threshold probabilities in both the training cohort (Figure 11A-11C) and the test cohort (Figure 11D-11F), with consistent patterns across 1-, 3-, and 5-year time points. Similar findings were observed for CSS: in both training (Figure 11G-11I) and test cohorts (Figure 11J-11L), both models outperformed baseline strategies, and RSF tended to overlap with or exceed the Cox curve over much of the threshold range.
Taken together, the RSF framework demonstrates overall superior discrimination, SHAP-supported interpretability, reproducible risk stratification, and favorable decision-analytic net benefit. This coherent and clinically oriented evidence chain strengthens the argument for adopting RSF as a primary method for individualized prognostication in population-based TNBC settings, while Cox nomograms remain a valuable and interpretable companion tool.
Discussion
In this SEER-based, population-level cohort of women aged ≥18 years with TNBC, we developed and internally validated two prognostic modeling strategies for both OS and CSS: interpretable Cox proportional hazards-based nomograms and flexible RSF models. The predictive discrimination of the RSF model and the Cox nomogram was generally comparable overall. DCA showed that the performance of the RSF was similar to or better than that of the nomogram across most probability intervals, reflecting a higher decision-analytic net benefit. Meanwhile, SHAP analysis provided transparent attribution of the key drivers underlying the predictions of the RSF. Collectively, these findings support the feasibility of translating routinely captured SEER variables into clinically meaningful, patient-level risk estimates under an internal validation design.
At the clinical-feature level, the multivariable Cox results align with expected TNBC risk architecture (29). Age and tumor burden indicators—particularly nodal stage—remained independently associated with both OS and CSS, and metastasis-related variables contributed additional prognostic information beyond AJCC staging. The attenuation of M1 after incorporating site-specific metastasis indicators likely reflects shared information content between hierarchical staging and metastasis-site variables, rather than inconsistency, and supports retaining anatomically explicit metastasis descriptors for individualized risk estimation. Treatment variables (surgery, radiotherapy, and chemotherapy) were independently associated with lower hazards after adjustment; however, given the observational nature of registry data and limited granularity of treatment regimens, timing, and performance status, these associations should be interpreted as predictive correlates rather than causal effects.
Methodologically, our head-to-head comparison favors RSF as a primary predictive engine in this dataset. RSF achieved higher time-dependent AUCs than Cox nomograms in the training cohort for both OS and CSS and maintained competitive discrimination in the test cohort, consistent with its ability to capture non-linearities and interactions. Further systematic comparison of the prognostic drivers identified by the two models revealed a high degree of consensus regarding the most critical prognostic factors: for both the Cox model and the RSF model, N stage and primary tumor size (T stage) were consistently recognized as key variables for predicting both OS and CSS. This strongly indicates that, despite their methodological differences, both models captured the most fundamental clinicopathological basis determining TNBC prognosis. For other variables (such as specific metastatic sites and treatment modalities), the rankings of importance assigned by the two models differed. This discrepancy primarily stems from intrinsic methodological distinctions: the Cox model assesses pre-specified, independent linear effects (HRs), whereas the RSF model can automatically capture non-linear patterns and complex interactions among variables. Thus, the “importance” presented in the SHAP analysis reflects a variable’s overall contribution within the complex predictive network. A variable with a non-significant effect in the Cox model might be assigned higher importance in the RSF model due to its participation in critical interactions.
This methodological discrepancy also partly explains another key finding of this study: the RSF model demonstrated slightly superior discriminative performance on the training set, likely attributable to its ability to capture these complex non-linear patterns. However, by extension, this may also have led to overfitting on the training data and a relative decline in its generalization capability on the test set. Importantly, DCA extended beyond conventional accuracy metrics by indicating that both models improved net benefit over treat-all and treat-none strategies across broad threshold ranges at 1, 3, and 5 years, with RSF generally matching or exceeding the nomogram across much of the probability spectrum. Together, these results suggest that RSF may provide not only stronger predictive performance but also more favorable decision support under clinically plausible risk thresholds.
Interpretability analyses further strengthened the credibility of the RSF framework (30). SHAP global importance consistently prioritized nodal stage, T stage, and age, followed by treatment variables, aligning with Cox findings and providing convergent evidence across paradigms. At the individual level, waterfall/decision plots offered intuitive decompositions of how patient-specific features shifted predictions from baseline to individualized estimates, potentially facilitating clinician–patient communication and model auditing. Nevertheless, SHAP values represent model-based attributions rather than causal effects and should be interpreted accordingly (31), especially when predictors are correlated or hierarchically related (e.g., AJCC M stage and site-specific metastasis indicators) (32).
Recent studies have shown that Cox proportional hazards models and machine learning models often exhibit similar predictive performance. In the field of kidney transplantation, a study by Truchot et al. found that the predictive performance of the Cox model for graft survival was comparable to or even superior to that of machine learning models, indicating that traditional methods remain reliable in specific clinical scenarios (33). For survival prognosis in patients with nasopharyngeal carcinoma, machine learning methods such as RSF and support vector machines outperformed the Cox model, yet the differences may be influenced by data structure and feature selection (34). For high-dimensional or complex data types (e.g., radiomics, multimodal data), machine learning models typically demonstrate superior performance, with notable potential particularly in identifying complex patterns and integrating heterogeneous information (35-37). In addition, the Cox model often outperforms RSF in terms of prediction error curves and AUC stability, especially in the prediction of mortality in patients with acute tubular necrosis. In summary, machine learning excels in complex pattern recognition, while the Cox model remains an important tool for clinical prediction due to its interpretability and stability with small sample sizes (38).
It must be clearly stated that the models developed in this study, including both the Cox nomogram and the RSF model, were developed and validated using a single retrospective database. Their performance has not yet been tested in independent prospective cohorts, which is the primary constraint on their current generalizability. The limitations of this research are as follows: first, external validation is lacking, which directly affects the generalizability of the models. Second, no formal competing risk analysis was conducted. Although CSS was used as an endpoint to reduce interference from non-cancer deaths, potential misclassification in cause-of-death coding remains a concern. Third, the performance gap between the training and test sets for the RSF model suggests possible overfitting, which may limit its robustness in clinical practice. Fourth, estimates for certain specific metastatic sites may be unstable due to their relatively low frequency in the cohort. Finally, the use of a single predefined data split means that model stability was not further assessed through repeated cross-validation or temporal validation. Therefore, before any potential clinical application, rigorous external validation and necessary recalibration in independent, multicenter prospective cohorts reflecting contemporary treatment practice are required.
In summary, by integrating RSF flexibility, SHAP-based transparency, and decision-analytic evaluation, this study provides an internally validated and clinically oriented evidence chain for TNBC prognostication using routinely available SEER variables. The overall superiority of RSF in discrimination and its favorable net benefit profile support its potential as a primary modeling strategy, while Cox nomograms remain valuable as an interpretable companion tool for bedside communication and implementation.
Conclusions
In this study, we systematically developed and internally validated two prognostic models, Cox-based nomograms and RSF, using the SEER database to predict OS and CSS in adult patients with TNBC. Our findings demonstrate that RSF models generally outperformed traditional nomograms in terms of discriminative ability and net benefit in decision curve analysis. SHAP analysis enhanced model interpretability and identified nodal stage, T stage, and age as the core prognostic factors. Both models exhibited excellent calibration and effectively stratified patients into distinct risk groups. Although internally validated, these models require rigorous external validation and recalibration in independent prospective cohorts before clinical implementation. This study provides an evidence-based tool for individualized prognosis assessment in TNBC using routinely available clinicopathological variables.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-aw-2462/rc
Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-aw-2462/prf
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-aw-2462/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Kim J, Harper A, McCormack V, et al. Global patterns and trends in breast cancer incidence and mortality across 185 countries. Nat Med 2025;31:1154-62. [Crossref] [PubMed]
- Printz C. Female breast cancer most commonly diagnosed cancer globally. Cancer 2021;127:1952-3. [Crossref] [PubMed]
- Zambelli A, Sgarra R, De Sanctis R, et al. Heterogeneity of triple-negative breast cancer: understanding the Daedalian labyrinth and how it could reveal new drug targets. Expert Opin Ther Targets 2022;26:557-73. [Crossref] [PubMed]
- Teng YT, Wang YA, Dong YH, et al. Five-year survival prognosis of young, middle-aged, and elderly adult female invasive breast cancer patients by clinical and lifestyle characteristics. Breast Cancer Res Treat 2024;205:619-31. [Crossref] [PubMed]
- Caswell-Jin JL, Callahan A, Purington N, et al. Treatment and Monitoring Variability in US Metastatic Breast Cancer Care. JCO Clin Cancer Inform 2021;5:600-14. [Crossref] [PubMed]
- Diao K, Andring LM, Barcenas CH, et al. Contemporary Outcomes After Multimodality Therapy in Patients With Breast Cancer Presenting With Ipsilateral Supraclavicular Node Involvement. Int J Radiat Oncol Biol Phys 2022;112:66-74. [Crossref] [PubMed]
- De Santis P, Perrone M, Guarini C, et al. Early-stage triple negative breast cancer: the therapeutic role of immunotherapy and the prognostic value of pathological complete response. Explor Target Antitumor Ther 2024;5:232-50. [Crossref] [PubMed]
- Punie K, Kurian AW, Ntalla I, et al. Unmet need for previously untreated metastatic triple-negative breast cancer: a real-world study of patients diagnosed from 2011 to 2022 in the United States. Oncologist 2025;30:oyaf034. [Crossref] [PubMed]
- Song P, Liu T, Zhang Y, et al. Traditional Chinese medicine in the treatment of breast Cancer. Mol Cancer 2025;24:209. [Crossref] [PubMed]
- Li H, Wang J, Ming X, et al. Comparison of machine learning and Cox regression models for prognostic analysis in hepatocellular carcinoma patients with distant metastasis. Surg Open Sci 2025;27:36-44. [Crossref] [PubMed]
- Zou H, Zeng D, Xiao L, et al. Bayesian inference and dynamic prediction for multivariate longitudinal and survival data. Ann Appl Stat 2023;17:2574-95. [Crossref] [PubMed]
- Brentnall AR, Harkness EF, Astley SM, et al. Mammographic density adds accuracy to both the Tyrer-Cuzick and Gail breast cancer risk models in a prospective UK screening cohort. Breast Cancer Res 2015;17:147. [Crossref] [PubMed]
- Wei C, Ai H, Mo D, et al. A nomogram based on inflammation and nutritional biomarkers for predicting the survival of breast cancer patients. Front Endocrinol (Lausanne) 2024;15:1388861. [Crossref] [PubMed]
- Jin Y, Zhao M, Su T, et al. Comparing Random Survival Forests and Cox Regression for Nonresponders to Neoadjuvant Chemotherapy Among Patients With Breast Cancer: Multicenter Retrospective Cohort Study. J Med Internet Res 2025;27:e69864. [Crossref] [PubMed]
- Li M, Roder D, D'Onise K, et al. Monitoring TNM stage of female breast cancer and survival across the South Australian population, with national and international TNM benchmarking: A population-based cohort study. BMJ Open 2020;10:e037069. [Crossref] [PubMed]
- Howlader N, Cronin KA, Kurian AW, et al. Differences in Breast Cancer Survival by Molecular Subtypes in the United States. Cancer Epidemiol Biomarkers Prev 2018;27:619-26. [Crossref] [PubMed]
- Hernández-Boluda JC, Mosquera-Orgueira A, Gras L, et al. Use of machine learning techniques to predict poor survival after hematopoietic cell transplantation for myelofibrosis. Blood 2025;145:3139-52. [Crossref] [PubMed]
- Scheffner I, Gietzelt M, Abeling T, et al. Patient Survival After Kidney Transplantation: Important Role of Graft-sustaining Factors as Determined by Predictive Modeling Using Random Survival Forest Analysis. Transplantation 2020;104:1095-107. [Crossref] [PubMed]
- Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI. Z Med Phys 2019;29:102-27. [Crossref] [PubMed]
- Duggan MA, Anderson WF, Altekruse S, et al. The Surveillance, Epidemiology, and End Results (SEER) Program and Pathology: Toward Strengthening the Critical Relationship. Am J Surg Pathol 2016;40:e94-e102. [Crossref] [PubMed]
- Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594. [Crossref] [PubMed]
- Qiu Y, Chen Y, Shen H, et al. Triple-negative breast cancer survival prediction: population-based research using the SEER database and an external validation cohort. Front Oncol 2024;14:1388869. [Crossref] [PubMed]
- Pinilla K, Drewett LM, Lucey R, et al. Precision Breast Cancer Medicine: Early Stage Triple Negative Breast Cancer-A Review of Molecular Characterisation, Therapeutic Targets and Future Trends. Front Oncol 2022;12:866889. [Crossref] [PubMed]
- Yang X, Vladmirovich RI, Georgievna PM, et al. Personalized chemotherapy selection for patients with triple-negative breast cancer using deep learning. Front Med (Lausanne) 2024;11:1418800. [Crossref] [PubMed]
- Song Z, Cheng L, Lu L, et al. Development and Validation of the Nomograms for Predicting Overall Survival and Cancer-Specific Survival in Patients With Synovial Sarcoma. Front Endocrinol (Lausanne) 2021;12:764571. [Crossref] [PubMed]
- Guo H, Nie G, Zhao X, et al. A nomogram for cancer-specific survival of lung adenocarcinoma patients: A SEER based analysis. Surg Open Sci 2024;22:13-23. [Crossref] [PubMed]
- Devaux A, Helmer C, Genuer R, et al. Random survival forests with multivariate longitudinal endogenous covariates. Stat Methods Med Res 2023;32:2331-46. [Crossref] [PubMed]
- Wang K, Liu J. Machine learning predictor to investigate treatment modalities and overall survival in HER2+ patients with early-stage breast cancer. Clinics (Sao Paulo) 2025;80:100818. [Crossref] [PubMed]
- Duan D, Yang X, Guo X, et al. Interaction of glaucocalyxin a with glutathione and thioredoxin reductase for triple-negative breast cancer treatment. Bioorg Chem 2025;161:108572. [Crossref] [PubMed]
- Jaeger BC, Welden S, Lenoir K, et al. Accelerated and Interpretable Oblique Random Survival Forests. J Comput Graph Stat 2024;33:192-207. [Crossref] [PubMed]
- Lundberg SM, Erion G, Chen H, et al. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell 2020;2:56-67. [Crossref] [PubMed]
- Sarica A, Aracri F, Bianco MG, et al. Explainability of random survival forests in predicting conversion risk from mild cognitive impairment to Alzheimer's disease. Brain Inform 2023;10:31. [Crossref] [PubMed]
- Truchot A, Raynaud M, Kamar N, et al. Machine learning does not outperform traditional statistical modelling for kidney allograft failure prediction. Kidney Int 2023;103:936-48. [Crossref] [PubMed]
- Xiao Z, Song Q, Wei Y, et al. Use of survival support vector machine combined with random survival forest to predict the survival of nasopharyngeal carcinoma patients. Transl Cancer Res 2023;12:3581-90. [Crossref] [PubMed]
- Liu L, Zeng C, Liu L, et al. Multimodal CT and MRI Radiomics Integrated with Clinical Models Predict Pathological Complete Response in ESCC Following Neoadjuvant Immunochemotherapy. Tomography 2025;11:130. [Crossref] [PubMed]
- Zhong H, Huang D, Wu J, et al. (18)F‑FDG PET/CT based radiomics features improve prediction of prognosis: multiple machine learning algorithms and multimodality applications for multiple myeloma. BMC Med Imaging 2023;23:87. [Crossref] [PubMed]
- Song P, Song F, Shao T, et al. Natural products: promising therapeutics for targeting regulatory immune cells in the tumor microenvironment. Front Pharmacol 2024;15:1481850. [Crossref] [PubMed]
- Shamsutdinova D, Stamate D, Stahl D. Balancing accuracy and Interpretability: An R package assessing complex relationships beyond the Cox model and applications to clinical prediction. Int J Med Inform 2025;194:105700. [Crossref] [PubMed]

