Machine learning-based early survival prediction in early-onset hepatocellular carcinoma: a SEER-based multi-model comparative study

Huan Ma; Zhenpeng Zeng; Chenjie Xiao; Chunyan Li; Xiaosong Tan; Duanming Du; Ying Wu

doi:10.21037/tcr-2025-1-2808

Original Article

Machine learning-based early survival prediction in early-onset hepatocellular carcinoma: a SEER-based multi-model comparative study

Huan Ma^1#, Zhenpeng Zeng^1#, Chenjie Xiao¹, Chunyan Li¹, Xiaosong Tan², Duanming Du¹, Ying Wu¹

¹Department of Interventional Therapy, Shenzhen Second People’s Hospital, The First Affiliated Hospital of Shenzhen University, Shenzhen, China; ²Department of Radiology, The Second Affiliated Hospital of Guangdong Medical University, Zhanjiang, China

Contributions: (I) Conception and design: All authors; (II) Administrative support: D Du, Y Wu; (III) Provision of study materials or patients: H Ma, Z Zeng; (IV) Collection and assembly of data: All authors; (V) Data analysis and interpretation: H Ma, Z Zeng, Y Wu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Duanming Du, MD; Ying Wu, MD. Department of Interventional Therapy, Shenzhen Second People’s Hospital, The First Affiliated Hospital of Shenzhen University, 3002 Huafu Street, Futian District, Shenzhen 518000, China. Email: ming19680101@163.com; wuyingncu@163.com.

Background: Early-onset hepatocellular carcinoma (eHCC) presents distinct clinical challenges; however, prognostic determinants remain inadequately characterized. This study aimed to develop and validate machine learning (ML) models to predict early survival in patients with eHCC and to identify key prognostic factors.

Methods: Patients diagnosed with eHCC were identified from the Surveillance, Epidemiology, and End Results (SEER) database [2010–2021] based on predefined inclusion criteria. The cohort was randomly partitioned into training (70%) and validation (30%) sets. Five ML algorithms were implemented: multilayer perceptron (MLP), logistic regression (LR), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost). Additionally, a conventional model based on the American Joint Committee on Cancer (AJCC) staging system was constructed as a benchmark for comparison with the ML-based models. Model discrimination was assessed by the area under the receiver operating characteristic curve (AUC). Calibration curves and complementary metrics quantified predictive accuracy. The SHapley Additive exPlanations (SHAP) framework was employed to rank variable importance and interpret model predictions.

Results: A total of 1,922 patients with eHCC were included, of whom 1,345 were assigned to the training cohort and 577 to the validation cohort. Thirteen clinicopathological and treatment-related variables were incorporated into the analysis, including age, sex, race, pathological grade, surgical status, AJCC stage, SEER stage, tumor size, alpha-fetoprotein (AFP) level, marital status, radiotherapy, chemotherapy, and macrovascular invasion. The RF model demonstrated optimal performance in the validation cohort, achieving an AUC of 0.80, accuracy of 0.72, precision of 0.71, recall of 0.81, and F1 score of 0.75. This substantially outperformed the AJCC staging-based model with AUC of 0.66, representing a ΔAUC of 0.14 and 21.2% relative improvement. SHAP analysis identified surgical intervention as the dominant predictor in the RF model.

Conclusions: The RF model demonstrated robust discriminative ability for predicting early survival in eHCC patients. This validated model can facilitate risk stratification and inform individualized therapeutic decisions for patients with eHCC.

Keywords: Machine learning (ML); early-onset hepatocellular carcinoma (eHCC); Surveillance, Epidemiology, and End Results database (SEER database)

Submitted Dec 16, 2025. Accepted for publication Feb 04, 2026. Published online Mar 24, 2026.

doi: 10.21037/tcr-2025-1-2808

Highlight box

Key findings

• A random forest (RF) prognostic model was developed by systematically comparing multiple machine learning (ML) algorithms. The model incorporated demographic characteristics, clinicopathological features, therapeutic interventions, and survival outcomes from early-onset hepatocellular carcinoma (eHCC) patients in the Surveillance, Epidemiology, and End Results (SEER) database [2010–2021]. The RF model achieved an area under the receiver operating characteristic curve (AUC) of 0.80 in validation. The RF model achieved an AUC of 0.80 during validation, demonstrating strong predictive performance.

What is known and what is new?

• eHCC presents distinct clinical profiles compared to late-onset disease. Traditional prognostic models inadequately capture nonlinear relationships between predictive variables and survival outcomes. Currently, no validated ML tools exist for short-term survival prediction in eHCC populations.

• Five ML algorithms (multilayer perceptron, logistic regression, support vector machine, RF and extreme gradient boosting) were developed and validated using SEER data, with the RF model demonstrating optimal performance. For comparative analysis, a conventional prognostic model based on the American Joint Committee on Cancer staging system was also constructed to serve as a clinical benchmark against the ML-based approaches. SHapley Additive exPlanations analysis revealed surgical intervention as the dominant prognostic determinant. The developed RF model offers a translatable framework for early survival assessment in clinical settings.

What is the implication, and what should change now?

• A comprehensive analysis of demographic characteristics, clinicopathological features, treatment modalities, and survival outcomes in eHCC patients was conducted. Multiple ML algorithms were developed and systematically compared to identify the optimal predictive model for early survival assessment. This model enables clinicians to effectively identify high-risk eHCC patients, supporting personalized treatment planning, improving survival rates, and enhancing patient quality of life.

Introduction

Background

Early-onset hepatocellular carcinoma (eHCC) refers to hepatocellular carcinoma (HCC) diagnosed in patients aged ≤50 years. Some classification schemes adopt sex-specific age thresholds (≤40 years for females and ≤50 years for males). The global incidence of eHCC has steadily increased, now accounting for over 15% of all HCC diagnoses. This epidemiologic shift toward younger populations significantly increases disease burden among affected groups (1,2).

Clinical and pathological evidence suggests that eHCC exhibits greater biological aggressiveness compared to late-onset HCC. These tumors are typically larger, display increased postoperative recurrence and metastasis, and exhibit reduced capsule formation—characteristics associated with poorer survival outcomes (3,4). Wu documented increased susceptibility to microvascular invasion in eHCC, particularly in patients with larger tumors or advanced disease stages (5). Liu et al. reported that eHCC often arises in non-cirrhotic livers and presents with more prominent symptoms, such as abdominal distension, pain, and jaundice (6).

Current prognostic tools do not adequately address the unique clinical profile of eHCC. Traditional staging systems were predominantly developed from late-onset HCC cohorts. Therefore, these systems may inadequately identify risk factors specific to younger patients. Machine learning (ML) algorithms provide an advantage in modeling complex, nonlinear relationships between predictive variables, a capability currently underutilized for survival prediction in eHCC. Developing an eHCC-specific prognostic model could improve risk stratification and guide treatment selection in this growing patient population.

Rationale and knowledge gap

Survival analyses in eHCC populations have largely relied on traditional statistical methods. These methods exhibit limited performance with high-dimensional datasets containing nonlinear variable relationships and complex interactions (7-10). Kuang et al. and Zhang et al. previously developed Cox proportional hazards models for eHCC prognosis using SEER database records and institutional cohorts, respectively (8,11). However, conventional regression approaches assume proportional hazards over time and linear relationships between covariates and log-hazard ratios. Such assumptions are frequently violated in heterogeneous cancer populations.

ML algorithms overcome these limitations by employing data-driven pattern recognition. These approaches accommodate nonlinearity, high-dimensional feature spaces, and complex interaction effects without pre-specified functional forms (12-14). Despite these methodological advancements, no validated ML framework currently exists for early survival prediction in eHCC patients.

The eHCC population demonstrates distinct tumor biology, treatment responses, and survival trajectories compared to late-onset cases. Therefore, developing an eHCC-specific ML model addresses this unmet clinical need. Such a model would refine risk stratification and support evidence-based therapeutic decisions for younger HCC patients.

Objective

This study developed and validated ML models for predicting early survival in eHCC patients using the Surveillance, Epidemiology, and End Results (SEER) database [2010–2021]. Six models, including five ML algorithms—multilayer perceptron (MLP), logistic regression (LR), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost)—and a conventional American Joint Committee on Cancer (AJCC) staging-based model, were trained and compared to identify the optimal predictive model. The SHapley Additive exPlanations (SHAP) framework was applied to the best-performing algorithm to quantify variable contributions and identify primary prognostic determinants. Model validation was performed by evaluating discriminative ability using the area under the receiver operating characteristic curve (AUC), as well as calibration through calibration plots. The resulting prognostic tool aims to facilitate risk stratification and inform treatment decisions for younger HCC patients, addressing the current absence of validated prediction instruments specific to eHCC. We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2808/rc).

Methods

Data collection

Patient records were extracted from the SEER database for eHCC cases diagnosed between 2010 and 2021. Case identification followed International Classification of Diseases for Oncology, Third Edition (ICD-O-3) criteria: primary site code C22.0 (liver and intrahepatic bile duct) and histology codes 8170/3–8175/3 (hepatocellular carcinoma variants). Clinical data was collected from approximately 17 U.S. administrative regions and patients (15). Inclusion criteria were age ≥18 years at diagnosis, eHCC as the primary malignancy, and complete data for all candidate predictor variables. Exclusion criteria removed patients aged <18 years (n=121), deaths from non-natural causes (n=109), and cases with missing critical data (n=1,837), resulting in an analytic cohort of 1,922 patients. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Extracted variables included demographic characteristics (age, sex, race, marital status), tumor features (AJCC stage, SEER summary stage, pathological grade, tumor size, macrovascular invasion status), therapeutic interventions (surgery, radiotherapy, chemotherapy), and survival metrics (vital status, survival duration). Macrovascular invasion was defined according to SEER documentation standards as tumor infiltration involving major hepatic vessels (portal vein, hepatic vein, or inferior vena cava), confirmed through imaging or pathological assessment. Alpha-fetoprotein (AFP) levels at diagnosis were categorized using established clinical thresholds. The primary endpoint was 1-year overall survival, defined as a binary outcome (alive versus deceased from any cause at 12 months post-diagnosis).

Data cleaning

Missing values were addressed through a hybrid imputation strategy to ensure data integrity and analytical consistency. Categorical variables with missing data were imputed using mode imputation, while continuous variables were imputed using multivariate imputation by chained equations (MICE). Outliers in continuous variables were identified and managed according to clinical plausibility criteria.

Data segmentation

To enhance model generalizability, the dataset was randomly partitioned into training (70%) and validation (30%) sets using stratified random sampling.

Statistical analysis

Categorical variables were reported as frequencies (n) and percentages (%), and group differences were assessed using the Chi-squared test. Continuous variables (age, tumor size) followed a normal distribution and were compared between groups using independent-sample t-tests, reported as mean ± standard deviation (mean ± SD). Missing values (<30% for all variables) were handled using a differential imputation strategy: categorical variables were imputed using mode imputation, while continuous variables were imputed using MICE (Figure S1). Continuous variables were subsequently standardized using z-score transformation. Statistical analyses and visualizations were performed using Python (version 3.10) and R (version 4.2.1). Statistical significance was set at a two-tailed P<0.05.

Feature selection

Univariate feature selection was performed on the standardized training set using the SelectKBest algorithm with ANOVA F-tests to rank candidate predictors by discriminative capacity. The F-statistic measures the ratio of between-group variance to within-group variance for each feature, providing a parametric assessment of association strength with the binary survival outcome. Features with higher F-values indicated better separation between survival groups and were prioritized for model inclusion.

Parameter selection and modeling

Hyperparameter tuning for all ML models was conducted using a grid search approach combined with cross-validation. Multiple hyperparameter combinations were systematically evaluated, and model performance was assessed through five-fold cross-validation to enhance predictive capability.

Validation

Model discrimination was quantified using the area under the receiver operating characteristic curve (AUC-ROC), along with accuracy, precision, recall, and F1-score. Each algorithm was trained on the complete training cohort (n=1,345) and validated on the independent validation set (n=577) to evaluate generalizability. The model achieving the highest validation AUC-ROC underwent comprehensive calibration assessment through calibration plots.

The best-performing model was analyzed for interpretability using SHAP. SHAP values decompose each prediction into additive contributions from individual features based on cooperative game theory. This approach allows quantification of variable importance and directional effects. Global feature importance was ranked by mean absolute SHAP values across the validation cohort, identifying primary prognostic determinants. SHAP dependence plots illustrated variable-outcome relationships by plotting SHAP values against feature values, colored by interacting variables to reveal effect modifiers. A heatmap was generated to visualize correlations among different variables.

Results

Population distribution

A total of 1,922 eHCC patients were included in this study. The dataset was randomly divided into a training set (n=1,345) and a validation set (n=577) at a 7:3 ratio. The median overall survival in the cohort was 27.7 months. Among all included patients, 73.4% (n=1,411) were male and 26.6% (n=511) were female. No statistically significant differences in baseline characteristics were observed between training and validation sets (P>0.05, Table 1).

Table 1

Baseline distribution of demographic and clinical characteristics in patients with eHCC in training and validation sets

Variables	Overall (N=1,922)	Training set (N=1,345)	Validation set (N=577)	P value
Age (years)	43.0±7.60	43.0±7.59	42.9±7.62	0.77
Survival months	27.7±35.1	28.0±35.3	27.0±34.6	0.54
Sex				0.90
Male	1,411 (73.4)	989 (73.5)	422 (73.1)
Female	511 (26.6)	356 (26.5)	155 (26.9)
Race				0.47
White	1,146 (59.6)	795 (59.1)	351 (60.8)
Black	233 (12.1)	171 (12.7)	62 (10.7)
Other	543 (28.3)	379 (28.2)	164 (28.4)
Grade				0.15
Well differentiated	414 (21.5)	297 (22.0)	117 (20.3)
Moderately differentiated	1,164 (60.6)	799 (59.4)	365 (63.3)
Poorly differentiated	320 (16.6)	228 (17.0)	92 (15.9)
Undifferentiated	24 (1.3)	21 (1.6)	3 (0.5)
Surgery				0.70
No	1,088 (56.6)	757 (56.3)	331 (57.4)
Yes	834 (43.4)	588 (43.7)	246 (42.6)
Radiotherapy				>0.99
No/unknown	1,711 (89.0)	1,197 (89.0)	514 (89.1)
Yes	211 (11.0)	148 (11.0)	63 (10.9)
Chemotherapy				0.97
No	1,333 (69.4)	932 (69.3)	401 (69.5)
Yes	589 (30.6)	413 (30.7)	176 (30.5)
AFP				0.43
Positive	1,353 (70.4)	939 (69.8)	414 (71.8)
Negative	569 (29.6)	406 (30.2)	163 (28.2)
Tumor size (mm)	67.6±49.4	66.9±48.6	69.3±51.4	0.34
SEER stage				0.55
Localized	1,130 (58.8)	798 (59.3)	332 (57.5)
Regional	511 (26.6)	358 (26.6)	153 (26.5)
Distant	281 (14.6)	189 (14.1)	92 (16.0)
Material status				0.32
Single	736 (38.3)	525 (39.0)	211 (36.6)
Married	290 (15.1)	208 (15.5)	82 (14.2)
Other	896 (46.6)	612 (45.5)	284 (49.2)
AJCC T stage				0.13
T1	1,011 (52.6)	725 (53.9)	286 (49.6)
T2	396 (20.6)	266 (19.8)	130 (22.5)
T3	243 (12.6)	159 (11.8)	84 (14.6)
T4	272 (14.2)	195 (14.5)	77 (13.3)
AJCC N stage				0.71
N0	1,777 (92.5)	1,246 (92.6)	531 (92.0)
N1	145 (7.5)	99 (7.4)	46 (8.0)
AJCC M stage				0.25
M0	1,654 (86.1)	1,166 (86.7)	488 (84.6)
M1	268 (13.9)	179 (13.3)	89 (15.4)
AJCC stage				0.06
I	947 (49.3)	680 (50.6)	267 (46.3)
II	482 (25.1)	320 (23.8)	162 (28.1)
III	225 (11.7)	166 (12.3)	59 (10.2)
V	268 (13.9)	179 (13.3)	89 (15.4)
Macrovascular invasion				0.32
No	1,699 (88.4)	1,182 (87.9)	517 (89.6)
Yes	223 (11.6)	163 (12.1)	60 (10.4)

Data are presented as mean ± standard deviation or n (%). AFP, alpha-fetoprotein; AJCC, American Joint Committee on Cancer; eHCC, early-onset hepatocellular carcinoma; M, metastasis; N, node; SEER, Surveillance, Epidemiology, and End Results; T, tumor.

Feature selection

Univariate feature selection was performed on the standardized training set using the SelectKBest method combined with the analysis of variance (ANOVA) F-test. A horizontal bar chart was generated to visually rank features according to their F-scores, with bar length indicating the strength of association between each feature and the survival outcome. The results showed that surgery, SEER stage, and AJCC stage exhibited stronger correlations with survival compared to other predictors (Figure 1).

Figure 1 Feature importance ranking based on F-statistics (training set). AFP, alpha-fetoprotein; AJCC, American Joint Committee on Cancer; SEER, Surveillance, Epidemiology, and End Results.

Validation

Five ML algorithms and a conventional AJCC staging-based model were employed to construct one-year survival prediction models for eHCC patients, using 13 clinical features. The RF model demonstrated superior discriminative performance in both the training set (AUC =0.81) and the validation set (AUC =0.80), whereas the AJCC staging-based model showed comparatively lower discriminative capacity with AUCs of 0.67 and 0.66 in the training and validation sets, respectively (Figure 2). Calibration curve analysis indicated that the RF model exhibited no significant overestimation or underestimation of predicted outcomes (Figure 3).

Figure 2 ROC curves for six models: (A) training set and (B) validation set. AJCC, American Joint Committee on Cancer; AUC, area under the receiver operating characteristic curve; MLP, multilayer perceptron; ROC, receiver operating characteristic; SVM, support vector machine; XGBoost, extreme gradient boosting.

Figure 3 Performance evaluation of the RF model: (A) ROC curves in the training and validation sets; (B) calibration curve in the validation set. AUC, area under the receiver operating characteristic curve; RF, random forest; ROC, receiver operating characteristic.

The RF model achieved superior performance metrics in both datasets. In the training set, the model attained an accuracy of 0.76, precision of 0.73, recall of 0.85, and F1-score of 0.79. Corresponding metrics in the validation set were 0.72, 0.70, 0.80, and 0.75, respectively, surpassing all other evaluated algorithms (Tables 2,3).

Table 2

Comparative performance analysis of all ML models (training set)

Model	Accuracy	Precision	Recall	F1-score	AUC-ROC
RF	0.76	0.74	0.85	0.79	0.81
XGBoost	0.76	0.74	0.84	0.78	0.81
MLP	0.75	0.73	0.82	0.77	0.79
SVM	0.74	0.74	0.76	0.75	0.81
LR	0.74	0.75	0.77	0.76	0.81

AUC, area under the receiver operating characteristic curve; LR, logistic regression; ML, machine learning; MLP, multilayer perceptron; RF, random forest; ROC, receiver operating characteristic; SVM, support vector machine; XGBoost, extreme gradient boosting.

Table 3

Comparative performance analysis of all ML models (validation set)

Model	Accuracy	Precision	Recall	F1-score	AUC-ROC
RF	0.72	0.71	0.81	0.75	0.80
XGBoost	0.70	0.70	0.79	0.74	0.80
MLP	0.70	0.69	0.76	0.72	0.76
SVM	0.72	0.74	0.71	0.73	0.79
LR	0.71	0.72	0.73	0.73	0.79

AUC, area under the receiver operating characteristic curve; LR, logistic regression; ML, machine learning; MLP, multilayer perceptron; RF, random forest; ROC, receiver operating characteristic; SVM, support vector machine; XGBoost, extreme gradient boosting.

Interpretability analysis of the RF model using SHAP

SHAP analysis was exclusively applied to the validation cohort (n=577) to quantify feature contributions, thus preventing potential overfitting bias inherent in training set interpretations. Predictions were decomposed into additive SHAP values, reflecting individual feature contributions derived from cooperative game theory. This facilitated both population-level importance rankings and patient-specific predictions.

Global feature importance was visualized using SHAP summary plots (Figure 4). Each point represents a patient’s feature value and its impact on model predictions. Horizontal displacement quantifies SHAP magnitude, where positive values increase survival probability, and negative values decrease it. Color gradients encode raw feature values (blue for low, red for high). Vertical ranking by mean absolute SHAP value positioned surgery at the top, followed by SEER stage and AJCC stage. The bidirectional distribution of SHAP values for continuous variables, such as tumor size, illustrated nonlinear relationships. Larger tumors (red points) clustered in negative SHAP regions, confirming adverse prognostic effects. Smaller tumors (blue points) showed heterogeneous effects influenced by interactions.

Figure 4 SHAP summary plot for the RF model. The x-axis represents SHAP values, quantifying both the magnitude and direction of each feature’s contribution to predicted outcomes. Each data point represents an individual patient sample, with the color gradient (blue to red) indicating actual feature values from low to high. Features with positive SHAP values increase the probability of 1-year survival, whereas negative values indicate an increased risk of mortality. The y-axis ranks features by overall importance in descending order. AFP, alpha-fetoprotein; AJCC, American Joint Committee on Cancer; RF, random forest; SEER, Surveillance, Epidemiology, and End Results; SHAP, SHapley Additive exPlanations.

The SHAP force plot for a representative patient (Sample 2, Figure 5) decomposes the model prediction into individual feature contributions. This is visualized as a waterfall sequence from the baseline expected value to the final output. Features are vertically ordered by absolute SHAP magnitude, and horizontal bar length represents contribution strength. Red bars denote positive SHAP values (indicating survival), whereas blue bars indicate negative values (indicating mortality).

Figure 5 SHAP waterfall plot for the RF model. This plot provides individualized prediction interpretation for a representative patient. The value f(x) indicates the final log-odds prediction for survival. For Patient 2, surgery (SHAP =+0.2) significantly increased predicted survival probability, while advanced SEER stage (SHAP =−0.08) decreased it. The plot demonstrates how cumulative feature effects shape the final prediction. AFP, alpha-fetoprotein; AJCC, American Joint Committee on Cancer; RF, random forest; SEER, Surveillance, Epidemiology, and End Results; SHAP, SHapley Additive exPlanations.

Surgery exerted the strongest protective effect (SHAP =+0.20), reflecting the patient’s surgical intervention status. This contribution significantly elevated the predicted log-odds above baseline. Conversely, several adverse prognostic factors accumulated negative SHAP values: SEER stage (−0.08), consistent with regional or distant disease; macrovascular invasion (−0.08), indicating advanced local progression; and tumor size (−0.06), representing dimensions exceeding cohort medians. Additionally, AJCC stage (−0.05), race (−0.04), AFP level (−0.04), histological grade (−0.04), and sex (−0.02) further reduced predicted survival. Four minor-impact features collectively contributed +0.04.

Summing these contributions, positive effects (+0.24) and negative effects (−0.45), with the population baseline (E[f(X)] =0.525) resulted in a final log-odds prediction of f(x) =0.358. Logistic transformation [σ(0.358) = 1/(1 + e^−0.358)] yielded a survival probability of 58.9%, categorizing the patient as intermediate risk. Despite the protective effect of surgery, negative contributions from staging and vascular involvement dominated, highlighting how the model integrates competing prognostic signals.

The waterfall visualization clearly illustrates each variable’s directional impact, providing clinical transparency. SHAP decomposition reveals that although surgery substantially improved survival predictions, concurrent adverse tumor characteristics and advanced staging reduced overall survival probability. Such patient-level explanations allow clinicians to identify modifiable risk factors and understand prediction rationales, addressing interpretability limitations inherent in ensemble tree models.

The SHAP interaction heatmap clarified feature correlations, with each cell numerically representing the strength of associations. Analysis revealed that SEER stage exhibited the strongest correlation with AJCC stage (Figure 6).

Figure 6 Feature correlation heatmap of the RF model. The heatmap displays pairwise correlation coefficients among the 13 clinical features. Color intensity indicates the strength and direction of correlations between variables. AFP, alpha-fetoprotein; AJCC, American Joint Committee on Cancer; RF, random forest; SEER, Surveillance, Epidemiology, and End Results.

Discussion

Key findings

eHCC exhibits distinct clinical and biological characteristics compared to late-onset HCC. Pathological analyses indicate that eHCC often presents with larger tumors, increased vascular invasion, and poorly differentiated histology (4,15). These aggressive features highlight the need for prognostic tools tailored to younger patients, rather than models derived primarily from older cohorts.

Most previous prognostic studies on eHCC used Cox proportional hazards regression to identify survival determinants (16). Although informative, these traditional models rely on linear assumptions and cannot adequately capture complex interactions among multiple clinical variables. The multifactorial nature of eHCC outcomes, influenced by patient characteristics, tumor biology, and treatment response, likely exceeds the predictive capacity of conventional regression. Consequently, traditional methods may underutilize prognostic information available in clinical databases.

ML algorithms offer advantages in handling high-dimensional datasets. These algorithms detect nonlinear relationships and variable interactions without requiring predefined functional forms (12,13). By integrating demographic data, tumor characteristics, and treatment variables, ML models can achieve superior discriminative performance (17).

This study developed and compared five ML algorithms alongside a conventional AJCC staging-based model for predicting survival in eHCC, identifying the RF model as optimal. The AJCC staging system, while widely utilized in clinical practice, demonstrated limited discriminative capacity with AUCs of 0.67 and 0.66 in the training and validation sets, respectively. This result aligns with recent evidence demonstrating the superior predictive accuracy of ML methods across various cancers (18-20). The RF model incorporates 13 routinely available clinical variables from standard diagnostic evaluations. Thus, it is practical for integration into clinical decision-making.

Explanations of findings

This analysis identified surgical intervention, tumor stage, pathological grade, and AFP levels as primary prognostic determinants in eHCC patients, consistent with previous studies (7,8,16). Surgical resection was strongly associated with improved survival outcomes. SEER staging, incorporating regional lymph node involvement, provided significant prognostic discrimination. Tumor characteristics, including lymph node metastasis, advanced AJCC stage (III/IV), and increased tumor diameter, were strongly associated with higher mortality risk, which is consistent with findings from previous studies (21). AFP levels, obtained through routine serological tests, offered independent prognostic information beyond anatomical staging alone.

SHAP is a model-agnostic explainable artificial intelligence method that has been widely applied in previous studies to interpret complex ML models (22). SHAP analysis quantified each variable’s marginal contribution to predictions. This framework enhances interpretability by measuring how individual features influence predictions relative to baseline expectations. SHAP analysis quantified each variable’s marginal contribution to predictions. Analysis revealed that certain feature combinations increase mortality risk multiplicatively, rather than additively. Patients with large tumors, elevated AFP, and who are ineligible for surgery exhibited higher mortality risks than predicted by summing individual effects alone. These high-risk profiles identify potential therapeutic targets. For instance, neoadjuvant protocols could downstage tumors, making surgery feasible, while adjuvant interventions might address residual disease post-resection. Postoperative transcatheter arterial chemoembolization has demonstrated survival benefits in selected high-risk groups (23). This supports early adjunctive treatments guided by model-based risk stratification.

The RF algorithm demonstrated superior discriminative performance compared to other tested ML methods. Model calibration showed acceptable consistency between predicted probabilities and observed outcomes across risk strata. As the model exclusively utilizes variables obtained during routine clinical evaluation, no additional diagnostic procedures or specialized testing infrastructure are necessary. This feature allows easy integration into existing clinical workflows, supporting treatment decisions through quantitative risk assessment.

Limitations

This study is subject to inherent limitations of the SEER database, which lacks critical clinical variables such as genetic information, comorbidities, detailed treatment data, and key indicators of hepatic functional reserve, including cirrhosis status and Child-Pugh components. As these parameters are essential for surgical eligibility assessment, liver transplantation decision-making, and perioperative risk stratification in hepatocellular carcinoma, their absence may limit the model’s applicability in real-world clinical decision-making, particularly for high-risk patients (24,25). Additionally, the cohort includes fibrolamellar carcinoma (FLC), a biologically distinct entity whose molecular and clinical features differ substantially from conventional hepatocellular carcinoma (26-29). The inclusion of FLC cases could theoretically introduce heterogeneity that affects model performance. However, FLC represented an extremely small proportion of the overall cohort in this study (n=57, <0.3%), and such minimal representation is unlikely to produce a detectable impact on model training or survival prediction. Given this limited sample size, any potential confounding effect attributable to FLC is effectively diluted by the overwhelming predominance of conventional HCC cases. Finally, due to the constraints of real-world conditions and data availability, Validation relied solely on internal cross-validation within the SEER dataset. External validation using independent, multicenter cohorts from diverse geographic and clinical settings is required to confirm the model’s generalizability and reliability.

Conclusions

This study developed and validated a ML model based on the RF algorithm, which demonstrated robust predictive performance for early (1-year) survival outcomes in patients with eHCC. The resulting model provides evidence-based guidance to clinicians for formulating individualized treatment strategies.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2808/rc

Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2808/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2808/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Guo C, Liu Z, Lin C, et al. Global epidemiology of early-onset liver cancer attributable to specific aetiologies and risk factors from 2010 to 2019. J Glob Health 2023;13:04167. [Crossref] [PubMed]
Li Y, Zhang Z, Shi J, et al. Risk factors for naturally-occurring early-onset hepatocellular carcinoma in patients with HBV-associated liver cirrhosis in China. Int J Clin Exp Med 2015;8:1205-12. [PubMed]
Zhang ZY, Guan J, Wang XP, et al. Outcomes of adolescent and young patients with hepatocellular carcinoma after curative liver resection: a retrospective study. World J Surg Oncol 2022;20:210. [Crossref] [PubMed]
Ren J, Tong YM, Cui RX, et al. Comparison of survival between adolescent and young adult vs older patients with hepatocellular carcinoma. World J Gastrointest Oncol 2020;12:1394-406. [Crossref] [PubMed]
Wu ZQ. Comparison of tumor characteristics and prognosis between early onset and non-early-onset HCC; 2022. doi: 10.27232/d.cnki.gnchu.2022.000400.10.27232/d.cnki.gnchu.2022.000400
Liu SL, Li B, Wang YP, et al. Clinical features of hepatitis B virus-related early-onset and late-onset liver cancer: a comparative analysis. J Clin Hepatol 2025;41:1837-44.
Guan Y, Gan Y, An J. Clinical Characteristics and Prognosis of Early-Onset Hepatocellular Carcinoma: A Retrospective Cohort Study Based on Population Data. Dig Dis Sci 2024;69:3563-73. [Crossref] [PubMed]
Kuang T, Ma W, Zhang J, et al. Construction of a Nomogram to Predict Overall Survival in Patients with Early-Onset Hepatocellular Carcinoma: A Retrospective Cohort Study. Cancers (Basel) 2023;15:5310. [Crossref] [PubMed]
Li L, Liu ZP. Detecting prognostic biomarkers of breast cancer by regularized Cox proportional hazards models. J Transl Med 2021;19:514. [Crossref] [PubMed]
Kar İ, Kocaman G, İbrahimov F, et al. Comparison of deep learning-based recurrence-free survival with random survival forest and Cox proportional hazard models in Stage-I NSCLC patients. Cancer Med 2023;12:19272-8. [Crossref] [PubMed]
Zhang W, Tan Y, Shen S, et al. Prognostic nomogram for hepatocellular carcinoma in adolescent and young adult patients after hepatectomy. Oncotarget 2017;8:106393-404. [Crossref] [PubMed]
Deo RC. Machine Learning in Medicine. Circulation 2015;132:1920-30. [Crossref] [PubMed]
Handelman GS, Kok HK, Chandra RV, et al. eDoctor: machine learning and the future of medicine. J Intern Med 2018;284:603-19. [Crossref] [PubMed]
Rigatti SJ. Random Forest. J Insur Med 2017;47:31-9. [Crossref] [PubMed]
Cheah DS, Tsai K, Kawano F, et al. Are More Young, Western Patients Also Developing Hepatocellular Carcinoma? Ann Surg Oncol 2025;32:3924-32. [Crossref] [PubMed]
Yang R, Yu X, Zeng P. Construction and validation of a SEER-based prognostic nomogram for young and middle-aged males patients with hepatocellular carcinoma. J Cancer Res Clin Oncol 2023;149:10099-108. [Crossref] [PubMed]
Pantanowitz L, Pearce T, Abukhiran I, et al. Nongenerative Artificial Intelligence in Medicine: Advancements and Applications in Supervised and Unsupervised Machine Learning. Mod Pathol 2025;38:100680. [Crossref] [PubMed]
Yaqoob A, Verma NK, Mir MA, et al. SGA-Driven feature selection and random forest classification for enhanced breast cancer diagnosis: A comparative study. Sci Rep 2025;15:10944. [Crossref] [PubMed]
Yan L, Ye Q, Shi B, et al. Random forest-based model for the recurrence prediction of borderline ovarian tumor: clinical development and validation. J Cancer Res Clin Oncol 2025;151:160. [Crossref] [PubMed]
Jin Y, Lan A, Dai Y, et al. Development and testing of a random forest-based machine learning model for predicting events among breast cancer patients with a poor response to neoadjuvant chemotherapy. Eur J Med Res 2023;28:394. [Crossref] [PubMed]
Guo H, Chen X, Li R, et al. Clinical Implications and Novel Insights into Adolescent Primary Liver Cancer: A Nightmare for Adolescents? J Hepatocell Carcinoma 2025;12:2513-40. [Crossref] [PubMed]
Ma B, Zheng K, Lee FC, et al. Racial Disparities in Comorbidity Patterns of Early-Onset Liver Cancer: A Machine Learning Analysis. Cancer Control 2025;32:10732748251363687. [Crossref] [PubMed]
Zhu R, Chen SR, Zou MQ, Yang YF. Clinicopathological characteristics of early-onset hepatocellular carcinoma and its comparison with non-early-onset hepatocellular carcinoma. Chinese Hepatology 2025;30:1093-6.
Melandro F, Centonze L, Celsa C, et al. Role of Liver Function in the Multiparametric Assessment of Hepatocellular Carcinoma. Medicina (Kaunas) 2026;62:138. [Crossref] [PubMed]
Hayata N, Hosui A, Kurahashi T, et al. Impact of Serum Albumin Levels on Prognosis and Recurrence in Patients with Hepatocellular Carcinoma. Cancers (Basel) 2025;17:2971. [Crossref] [PubMed]
Long D Jr, Chan M, Han M, et al. Proteo-metabolomics and patient tumor slice experiments point to amino acid centrality for rewired mitochondria in fibrolamellar carcinoma. Cell Rep Med 2024;5:101699. [Crossref] [PubMed]
Aziz H, Brown ZJ, Panid Madani S, et al. Fibrolamellar Hepatocellular Carcinoma: Comprehensive Review of Diagnosis, Imaging, and Management. J Am Coll Surg 2023;236:399-410. [Crossref] [PubMed]
Hackenbruch C, Bauer J, Heitmann JS, et al. FusionVAC22_01: a phase I clinical trial evaluating a DNAJB1-PRKACA fusion transcript-based peptide vaccine combined with immune checkpoint inhibition for fibrolamellar hepatocellular carcinoma and other tumor entities carrying the oncogenic driver fusion. Front Oncol 2024;14:1367450. [Crossref] [PubMed]
Chan M, Zhu S, Nukaya M, et al. DNAJ-PKAc fusion heightens PLK1 inhibitor sensitivity in fibrolamellar carcinoma. Gut 2025;74:1680-93. [Crossref] [PubMed]

Cite this article as: Ma H, Zeng Z, Xiao C, Li C, Tan X, Du D, Wu Y. Machine learning-based early survival prediction in early-onset hepatocellular carcinoma: a SEER-based multi-model comparative study. Transl Cancer Res 2026;15(4):310. doi: 10.21037/tcr-2025-1-2808

Machine learning-based early survival prediction in early-onset hepatocellular carcinoma: a SEER-based multi-model comparative study

Highlight box

Introduction

Background

Rationale and knowledge gap

Objective

Methods

Data collection

Data cleaning

Data segmentation

Statistical analysis

Feature selection

Parameter selection and modeling

Validation

Results

Population distribution

Table 1

Feature selection

Validation

Table 2

Table 3

Interpretability analysis of the RF model using SHAP

Discussion

Key findings

Explanations of findings

Limitations

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share