Integrating innovative multiomics and machine learning strategies for prognostic biomarker discovery in hepatocellular carcinoma: guiding personalized treatment strategies with single-cell analysis
Highlight box
Key findings
• Multi-omics consensus clustering identifies two distinct hepatocellular carcinoma (HCC) subtypes (CS1/CS2) with significantly different prognosis.
• CS1 subtype shows higher immune infiltration but poorer survival, linked to an immunosuppressive microenvironment.
• A 3-gene risk score (CS1RS: SMS, PSMD1, YBX1) is constructed using 13 machine learning algorithms.
• CS1RS predicts overall survival effectively [training area under the curve (AUC): 0.73/0.67/0.67; validation AUC: 0.71/0.73/0.73] and is an independent prognostic factor (hazard ratio =1.77, P<0.001).
• Protein expression of the three genes is validated via Human Protein Atlas immunohistochemistry.
What is known and what is new?
• HCC is heterogeneous, and reliable prognostic biomarkers are lacking.
• This study integrates multi-omics and single-cell analysis to reveal CS1 biology and develops a robust machine-learning-derived 3-gene signature for risk stratification and personalized treatment guidance.
What is the implication, and what should change now?
• The CS1RS nomogram provides a clinically applicable tool for HCC prognosis prediction.
• The signature may help identify patients who could benefit from targeted therapies (e.g., axitinib, nilotinib), though these predictions remain hypothesis-generating and require experimental validation.
Introduction
Hepatocellular carcinoma (HCC) is the third most deadly cancer globally, with a high mortality rate and difficulties in diagnosis and treatment. The number of cases and deaths is constantly rising, adding a heavy burden to global health (1,2). Finding immunotherapies, such as checkpoint inhibitors, vaccines, cell infusions, etc., to activate the body’s immune response against tumor cells is one way to deal with HCC (3-6). There have been advances in hepatocellular tumor research, but the prognosis of patients is not satisfactory: the median survival time of patients with metastatic HCC is about 9 months, and the 5-year survival rate is about 10% (7) .These results indicate the value of finding new treatment methods and prognostic markers. Because the immunotherapy regimen has economic burdens and potential serious side effects, it is necessary to rely on a large amount of multi-dimensional biological data and advanced machine learning techniques to construct a biomarker system, which is used to predict the prognosis of hepatocellular tumor patients and optimize the immunotherapy regimen.
By systematically analyzing and integrating multi-dimensional omics information, we have a profound understanding of the complex pathological mechanisms of HCC, and promote the improvement and optimization of the molecular classification system. In clinical practice, discovering new highly specific biomarkers is the key to evaluating prognosis and screening potential benefit populations for immunotherapy, and has far-reaching guiding value for improving the individualization of diagnosis and treatment efficiency. Based on this, we analyze the expression patterns of messenger RNA (mRNA), long noncoding RNA (lncRNA), microRNA (miRNA), genomic variations, and epigenetic genome DNA methylation information of HCC, cross-integrate multi-omics data, construct and validate comprehensive cancer subtypes (CS) classification system for HCC. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2870/rc).
Methods
Data collection
We downloaded the RNA sequencing (RNA-seq) expression profiles and the clinical data of patients with liver hepatocellular carcinoma (LIHC) from The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/) and the International Cancer Genome Consortium Data Portal (ICGC) (https://dcc.icgc.org). The single-cell dataset (GSE166635) used in this study was retrieved from the public database GEO (https://www.ncbi.nlm.nih.gov/geo/), which included two HCC samples sequenced on the platform GPL24676 Illumina NovaSeq 6,000 (Homo sapiens). Both samples were utilized for subsequent single-cell analyses. The unit of inference for conclusions derived from single-cell RNA sequencing (scRNA-seq) is at the sample level, based on sample-level summaries.
Integrative clustering of multiomics data for HCC subtype classification
Using the MOVICS package, our study focused on five omics levels: mRNA, lncRNA, miRNA, DNA methylation and gene mutations. A two-step approach of “MAD (top 1,500 features) + univariate Cox regression (P<0.05)” was employed to screen for mRNA, lncRNA, miRNA and methylation features, whilst gene mutation features were selected based on a mutation frequency of ≥5%; Missing values were primarily handled by direct deletion; for the miRNA omics data, additional filtering was performed using the ‘na.omit’ function. Furthermore, this study included only samples with complete multi-omics data and did not involve the imputation of missing omics data. We then utilized the getMOIC function to apply 10 streamlined clustering algorithms (iClusterBayes, IntNMF, ConsensusClustering, LRAcluster, MoCluster, COCA, NEMO, PINSPlus, CIMLR, and SNF) to classify patients with LIHC. After obtaining the clustering results from each algorithm, we used the getConsensusMOIC function to integrate the results and generate a robust CS. To display the prognostic outcomes of the different CSs, we drew Kaplan-Meier (KM) curves.
Molecular characterization and differential analysis of CSs
In this study, the single-sample gene set enrichment analysis (ssGSEA) was used to quantify the enrichment levels of various molecular features in different samples, so as to deeply analyze the specific molecular characteristics of CSs. Subsequently, we analyzed the differences among the subtypes regarding potential regulators related to cancer chromatin remodeling and 23 specific transcription factors. This analysis elucidated the relationship between the constructed subtypes and chromatin remodeling, as well as differences in immune checkpoints and immune infiltration. After conducting differential expression analysis across the subtypes, we selected the top 20 upregulated genes for each subtype to generate a gene expression heatmap. We then employed these 20 genes as a classifier to subtype the validation set and displayed the results using KM curves. Utilizing the nearest template prediction (NTP) algorithm, we assessed the classification accuracy. Finally, we selected the genes highly expressed in subtypes with poorer prognosis as CS-related genes for subsequent analysis.
Single-cell RNA-seq analysis data collection and processing
Single-cell RNA sequencing data of two HCC samples from the GSE16635 database was used, and a systematic analysis was carried out with the help of the Seurat bioinformatics toolkit (8). In the data preprocessing stage, First, a preliminary screening of low-quality data is carried out using the criteria min.cells =3 (the gene must be detected in at least 3 cells) and min.features =250 (at least 250 genes must be detected in a single cell); Building on this, stricter thresholds were applied (number of detected genes: 100–7,500; proportion of mitochondrial gene expression <15%; UMI count >1,000) to filter out low-quality cells such as apoptotic cells and empty droplets. Following data cleaning, the expression matrix was normalised using the LogNormalize method, and sequencing technical biases were eliminated through data scaling; Subsequently, 2,000 highly variable genes were identified using the vst method. Principal component analysis (PCA) was performed on this set of highly variable genes to reduce dimensionality, and the core dimensions for clustering were determined by combining the PCA scatter plot with the elbow plot; finally, through clustering with a resolution gradient of 0.1–1.0 and visual analysis of the clustering tree, the optimal resolution was selected and cell clustering was completed. To mitigate batch effects across the seven samples, we utilized the Harmony package. After we generated cell clusters using the FindClusters and FindNeighbors functions, we visualized these clusters using the t-distributed stochastic neighbor embedding (t-SNE) approach. Finally, we conducted cell type analysis based on the cells’ respective marker genes, using the AddModuleScore function in the Seurat package to measure the activity of specific gene sets in individual cells. To identify differentially expressed genes (DEGs) between two groups, we utilized the FindMarkers function within the Seurat package. Using the Wilcoxon test, we evaluated the statistical significance of the DEGs [adjusted P(p.adj) <0.05]. We considered genes showing differential expression between cells with high and low cluster of subtypes 1 (CS1) scores relevant at the single-cell transcriptome level and subjected them to further weighted gene co-expression network analysis (WGCNA) analysis at the bulk transcriptome level. At the same time, we used the CellChat package in R language to systematically explore the intercellular communication patterns (9). The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
ssGSEA and gene set enrichment analysis
ssGSEA is an important tool to evaluate the changing trend of expression levels of specific gene sets within a single biological sample. Its core lies in quantifying the degree of overall upregulation or downregulation of the gene set in the sample. In this study, we performed ssGSEA analysis with the help of the GSVA package in R language (10) to calculate the CS1 score of each HCC case sample in the TCGA database. In order to explore the biological pathways potentially associated with this feature, we used the limma package (11) to perform quantitative calculation on the gene set variation analysis (GSVA) score of the top canonical pathways. To elucidate the biological processes (BPs), cellular components (CCs), and molecular functions (MFs) associated with the different risk subgroups, we performed gene set enrichment analysis (GSEA) of GO gene sets between the two risk groups according to the criteria of false discovery rate <0.25 and |normalized enrichment score| >1 employing the R package clusterProfiler.
WGCNA using WGCNA
We applied the R package WGCNA (12) to perform WGCNA using bulk RNA-seq data from the TCGA-LIHC dataset. After determining an appropriate soft threshold to ensure the construction of a scale-free network, we transformed the weighted adjacency matrix into a topological overlap matrix and derived the dissimilarity. For gene clustering and module detection, we employed the dynamic tree cut method and then selected the module with the strongest correlation with the CS1 score for subsequent investigation.
Construction of a CS1RS in HCC using integrative machine learning approaches
Within the TCGA bulk RNA-seq dataset, we employed the R package limma to conduct differential expression analysis, comparing normal and tumor samples applying thresholds of |logfold change (logFC)| >1.5 and P.adj <0.05. We then determined the intersection between the DEGs identified in this analysis and the genes found in the CS1-related module through WGCNA. As these intersected genes are considered to play a role in CS1 at both the bulk and single-cell transcriptome levels, we labeled them CS1-related genes (CS1RGs).
To create a prognostic model with high predictive power, we adhered to three steps. First, we conducted univariate Cox regression analysis to identify CS1RGs with potential prognostic significance in the TCGA-LIHC dataset. Second, after designating the TCGA-LIHC dataset as the training set and the ICGC dataset as the external validation set, and ten types of machine learning algorithms are selected: LASSO regression, Ridge regression, forward stepwise Cox model, Cox Boost ensemble algorithm, random survival forest (RSF), elastic net regression method, Cox-partial least squares regression, supervised PCA, generalized boosted regression modeling, survival support vector machine. In the hepatocellular training dataset of the TCGA-LIHC database, 10-fold cross-validation is used to test the model performance, and the algorithms are integrated and optimized for high-dimensional biomarker feature selection as well as the construction of an accurate prediction model. Finally, we calculated the concordance index (C-index) for each model to evaluate predictive performance in both the training and external validation sets. By selecting the combination of algorithms that exhibited both strong performance and clinical translational potential, we developed a signature termed the CS1-related signature (CS1RS) that predicts overall survival (OS) in patients with HCC. Calculate the correlation coefficients between genes in the CS1RS dataset using Spearman’s correlation analysis, and visualise the results using the ‘circlize’ package.
Survival analysis and construction of a predictive nomogram
Based on the median CS1RS risk score, divide the TCGA training set and ICGC validation set into high-risk groups and low-risk groups. Use the R package survminer to carry out KM survival analysis to see if there are obvious differences in OS between the high-risk group and the low-risk group. Use the time receiver operating characteristic (ROC) package in R for ROC curve analysis to evaluate the sensitivity and specificity of CS1RS for OS prediction in HCC patients. Additionally, we determined the area under the curve (AUC) for the CS1RS and compared it against other clinical parameters. We also examined the association between the CS1RS and other clinical characteristics, including age, sex, grade, and tumor, node, and metastasis stage. Univariate and multivariate Cox regression analyses were performed on the TCGA-LIHC dataset to verify whether CS1RS can serve as an independent prognostic indicator for survival prediction in patients with HCC. Combined with CS1RS and clinicopathological factors, a nomogram was constructed to improve the accuracy and predictive ability of the model. The accuracy, discriminative ability and clinical net benefit of the nomogram were evaluated by plotting receiver operating characteristic (ROC) curves and calibration curves, calculating the concordance index (C-index), and conducting decision curve analysis (DCA).
Analysis of genomic variation between CS1RS risk subgroups
The Mutant Allele Tumor Heterogeneity (MATH) score has been confirmed to provide a measurable and quantitative assessment of intra-tumor heterogeneity (ITH) in different tumors, including brain and hepatocellular tumors, by analyzing the distribution of mutant alleles using whole-exome sequencing of tumor and matched normal DNA samples (13-15). The MATH scores of all subjects were calculated using the previously validated method (16), and survival analysis was conducted based on each score. To identify somatic mutations associated with CS1RS, a waterfall plot was generated using the maftools package in R. After dividing the subjects into high- and low-risk groups, their mutation distribution was presented. Copy number variation analysis will be conducted by focusing on the genes with the most significant differences between the high-risk and low-risk groups.
Comprehensive analysis of immune characteristics
To analyze the immune landscape associated with the CS1RS in HCC, we calculated the immune score, Estimation of STromal and Immune cells in MAlignant Tumour tissues using Expression data (ESTIMATE) score, and tumor purity using the ESTIMATE algorithm to assess immune status in the high- and low-risk groups. After analyzing differences in immune-related pathway activities using GSEA, we quantified the abundance of tumor microenvironment (TME)-infiltrated cell types using Cell-type Identification By Estimating Relative Subsets Of RNA Transcripts (CIBERSORT) (17) and ssGSEA, making comparisons between the risk groups. We evaluated correlations between TME-infiltrated cells and the CS1RGs before drawing a Venn diagram to identify overlapping cell types based on differential, correlation, and survival analyses.
Significance of the CS1RS in drug sensitivity
The role of CS1RS in tumor chemotherapy response. To promote personalized medicine, the pRRophetic package (18) of R was used to evaluate the chemotherapy drug sensitivity of HCC patient populations based on different CS1RS expression levels. The tissue gene expression was compared with the cell line data and the drug half maximal inhibitory concentration (IC50) values were calculated. The Wilcoxon test was used to analyze the IC50 differences between the high-/low-risk subgroups, and a P<0.05 was set as the standard for statistically significant.
Immunohistochemistry
By accessing the Human Protein Atlas (https://www.proteinatlas.org/) database, we obtained immunohistochemical results for HCC tissue and normal liver tissue, and compared the expression levels of SMS, PSMD1 and YBX1 in normal liver tissue and HCC, thereby further validating the results.
Statistical analysis
For the statistical analysis of the TCGA and ICGC data, we used R v4.2.2 and v4.3. 1 software. When comparing two samples, we analyzed data with normal distribution and uniform variance using Student’s t-tests and data with uneven variances using the Wilcoxon test. We set statistical significance at P<0.05.
Results
Multiomics consensus prognosis-related molecular subtypes of HCC
Utilizing 10 multiomics ensemble clustering methods, we successfully identified two CSs. We based the decision regarding subtype number on a comprehensive assessment that included consideration of cluster prediction index, gap statistic, and silhouette score. Using a consensus ensemble approach, we selected the final subtypes, all of which exhibited unique molecular expression patterns across mRNA, lncRNA, miRNA, epigenetic methylation, and somatic mutations (Figure 1A-1C). The classification system based on CS1 and CS2 was strongly linked to OS (P<0.001; Figure 1D), with CS1 showing the poorest survival outcomes among the identified subtypes.
Partitioning of HCC integrative consensus molecular subtypes
In this study, the mol enrich module of ssGSEA was used to measure the enrichment level and find the differences between the two types of CSs. CS2 is highly expressed in the EGFR ligand pathway and gene sets such as ACEVEDO_LIVER_CANCER_DN, and CS1 focuses on the cell cycle and DNA replication pathways (Figure 2A). Analysis of 23 HCC-specific transcription factors (TFs) found that there is an association between transcriptomic variation and chromatin remodeling (chrom remo) (Figure 2B). The status of key regulatory proteins is associated with CSs, thus confirming the importance of virus biology (biol imp). In CS2, the functions of androgen receptor (AR), estrogen receptor 1 (ESR1), human epidermal growth factor receptor 2 (ERBB2), and retinoic acid X receptor α (RXRA) are enhanced, and in CS1, the expressions of fibroblast growth factor receptor 3 (FGFR3), peroxisome proliferator-activated receptor γ (PPARG), retinoic acid receptor γ (RARG), and GATA binding protein 3 (GATA3) are increased. There are differences in the regulatory mechanisms of chromatin regulatory activity patterns among cancer subpopulations. KDM5C and others are highly active in CS1; CLOCK and others are highly expressed in CS2. Quantitatively from the tumor microenvironment, the infiltration of immune cells in the CS1 model is significantly increased, and the CS2 model is slightly increased (Figure 2C). Based on the differential expression analysis of subpopulations, 20 genes specifically upregulated in each subpopulation are selected as classifiers, and verified in external cohorts to confirm the stability of subpopulations. NTP classifies each sample in the external cohort as one of the identified CSs. In the GSE14520 cohort, CS1 has the worst prognosis (P<0.001), and it is roughly the same in other external cohorts (Figure 2D,2E). The consistency of CSs is evaluated using NTP and partitioning around the centroid (P<0.001; Figure 2F-2H). Finally, 100 genes with high expression in CS1 are marked as CS1-related genes.
CS1 characteristic in single-cell transcriptome
Using the Harmony package to address batch effects and integrate the two samples, we analyzed single-cell RNA-seq data from 24,832 cells derived from two HCC samples. We then performed dimensionality reduction using PCA and t-SNE on the top 2,000 most variable genes, followed by clustering that grouped cells into 21 distinct clusters at a resolution of 0.4 (Figure S1A). Based on specific marker genes, we classified the cells into 10 major types: B cells, endothelial cells, fibroblasts, HCC, macrophages, mast cells, monocytes, plasma cells, T cells, and tumor-associated endothelial cells (Figure 3A). The top four marker genes for each cell type are displayed in Figure 3B. Employing the AddModuleScore function in Seurat, we assessed the expression of 100 CS1RGs across the 10 cell types in the cell-level statements (Figure 3C). Notably, we observed significantly higher CS1 activity in endothelial cells and HCC (Figure 3D). Based on this activity, we categorized the cells into high- and low-CS1 groups, leading to the identification of 2,028 DEGs for further analysis.
Identification of CS1-related genes and their prognostic significance
To calculate CS1 activity scores for all TCGA–LIHC samples, we employed ssGSEA. Subsequently, we used these scores as phenotype data for WGCNA analysis, which we conducted to discover modules closely linked to CS1 scores. We employed the 2,028 CS1-related DEGs that we discovered at the single-cell RNA sequencing level to construct a co-expression network after removing outlier samples (Figure 4A). To ensure a scale-free topology, we set the optimal soft threshold to power =8 (R2=0.85) (Figure S1B). We identified a total of 5 modules with a minimum gene count of 50 per module (Figure 4B). Our results revealed that the MEyellow module was highly correlated with the CS1 score in bulk RNA-seq data [correlation (cor) =0.63] (Figure 4C). When we drew a scatter plot to illustrate gene significance against module membership for the yellow module, it revealed a significant correlation (cor =0.66, P=29e−43) (Figure 4D), suggesting that the genes in this module may hold functional importance related to CS1.
Figure 4E highlights the DEGs between tumor and normal liver tissues in TCGA-LIHC bulk RNA-seq (|logFC| >1.5, P.adj <0.05). By identifying the intersection of the 135 genes in the yellow module with the DEGs from the bulk RNA-seq data, we pinpointed 52 CS1RGs (Figure 4F), which play a role in CS1 across both bulk and single-cell transcriptome levels. GO enrichment analysis of CS1Rgenes revealed significant enrichment in BPs such as mRNA 3’-UTR binding, single-stranded RNA binding, and protein folding; CCs such as the spliceosome and microtubules; and MFs such as RNA splicing, mRNA splicing, and RNA localization (Figure 4G).
Univariate Cox regression analysis of 52 CS1RGs revealed 21 significant genes (P<0.001). When we compared the TCGA gene list with the ICGC dataset to develop and validate the model further, we identified 18 overlapping genes. The outcomes of the univariate Cox analysis and the gene interrelationship analysis are shown in Figure 4H. When we analyzed the CNV frequency among these 18 genes, we found that HDGF exhibited a copy number increase exceeding 15%, as shown in Figure 4I.
Development and validation of a CS1RS using machine learning models
To build a CS1RS, we employed various machine learning algorithms to assess 18 prognostic genes derived from univariate Cox regression analysis and calculated the C index across both the training and validation sets, as shown in Figure 5A. Among the 13 models, we selected the RSF and stepwise Cox regression (StepCox) [backward] + RSF models, which exhibited strong predictive performance in both the training and external validation sets. The RSF model incorporated all 18 genes, and although the StepCox [backward] + RSF model used only 3 genes, it delivered similar predictive effectiveness. Therefore, we identified the StepCox [backward] + RSF model as the most accurate and relevant model for clinical translation. For each patient, we calculated a risk score by weighting the expression levels of the three genes with the corresponding regression coefficients from the model, allowing us to categorize patients into high-risk and low-risk groups using the median risk score. Notably, as the risk score increased, there was a corresponding rise in the mortality rate and expression of the three genes across both the training and validation sets (Figure 5B-5G). Patients in the high-risk group demonstrated significantly poorer OS than those in the low-risk group across both the training and external validation sets (P<0.001, log-rank test (Figure 5H,5I). ROC curve analysis indicated that the CS1RS achieved AUC scores of 0.73, 0.67, and 0.67 in the training set and 0.71, 0.73, and 0.73 in the ICGC dataset for the 1-, 2-, and 3-year periods, respectively (Figure 5J,5K). These findings demonstrate the strong discriminatory performance of the CS1RS. Supplementary Figure 1C illustrates the strong interactions between the three selected CS1RS, demonstrating the synergistic role of these genes in the pathophysiology of HCC.
The CS1RS as an independent prognostic factor and development of a nomogram for predicting OS in HCC
We conducted univariate and multivariate Cox regression analyses of data from the TCGA-LIHC dataset to determine whether the CS1RS is an independent prognostic factor for HCC (Figure 6A,6B). Both the univariate analysis [hazard ratio (HR) 1.768, confidence interval (CI): 1.472–2.123, P<0.001] and multivariate analysis (HR 1.77, CI: 1.365–2.295, P<0.001) confirmed CS1RS as an independent predictor of OS, affirming its strong prognostic capacity in patients with HCC. To enhance the clinical utility of the CS1RS, we created a nomogram that integrates the CS1RS with other clinical characteristics (Figure 6C). The calibration curves display strong agreement between the nomogram, s predictions and actual outcomes (Figure 6D), while the C index shows stable and superior predictive performance for OS compared to other clinical features across 1- to 10-year timepoints (Figure 6E). DCA confirmed that the nomogram provides greater clinical benefit than other clinical characteristics (Figure 6F).
Exploration of the potential mechanisms of the CS1RS
To better understand the molecular mechanisms underlying the CS1RS and HCC, we performed functional enrichment analysis. GSEA revealed that the low-risk group was enriched in bile acid metabolism, coagulation, and fatty acid metabolism (Figure 7A), while the high-risk group was enriched in allograft rejection, E2F targets, G2M checkpoint, mTORC1 signaling, and MYC targets V1 (Figure 7B). The GSVA results confirmed that the high-risk group had increased activity in MYC targets V1, G2M checkpoint, and mTORC1 signaling, whereas the low-risk group showed elevated activity in bile acid, xenobiotic, and fatty acid metabolism (Figure 7C). Correlation analysis between the CS1RS and hallmark pathways further validated these results (Figure 7D), confirming the strong association between the CS1RS and cancer-related processes and metabolic pathways.
Integration of ITH and genomic mutations for prognostic assessment in HCC
ITH is a genomic characteristic of cancer driven by genetic mutations and associated with increased malignancy and resistance to chemotherapy (19,20). Utilizing the MATH algorithm to quantify ITH in patients with HCC, we observed that high-risk patients had higher MATH scores, indicating greater heterogeneity (Figure 8A). When we examined the correlation between ITH and prognosis, we found that higher MATH scores were associated with significantly poorer OS (log-rank test, P=0.23) (Figure 8B). Combining ITH with the CS1RS indicated that the high-risk + high MATH group had significantly poorer prognosis than the low-risk + low MATH group (log-rank test, P=0.0012 (Figure 8C), highlighting the enhanced predictive power of this combined prognostic approach.
To investigate the genomic differences between the CS1RS subgroups, we compared the mutation landscapes of the high- and low-risk groups and observed distinct mutation patterns (Figure 8D,8E). TP53, which encodes the tumor suppressor p53 protein, becomes a key regulator of cancer-promoting genes when mutated and modifies the tumor immune microenvironment to aid immune evasion (21,22). The mutation frequency of TP53 was 44% in the high-risk group, significantly higher than the 14% observed in the low-risk group. We discovered mutations in CTNNB1, which promotes liver cancer progression through immune evasion (23), in 27% of the high-risk group compared to 25% of the low-risk group. Analyzing the CNVs in the most divergent genes between the immunogenic cell death-related signature subgroups, we observed CNV gains in FLG, RYR2, and HMCN1 and CNV losses in ALB, XIRP2, AXIN1, and CTNNB1 (Figure 8F).
Association between the CS1RS and single-cell characteristics
We investigated the role of the CS1RS in the TME at the single-cell transcriptome level by examining SMS, PSMD1, and YBX1 expression in various cell types. The analysis revealed that these genes are primarily expressed in immune cells, such as macrophages and monocytes, and in HCC tumor cells (Figure 9A). Kyoto Encyclopedia of Genes and Genomes enrichment analysis of DEGs in the high- and low-risk cell groups revealed significant enrichment in phagosomes and lysosomes (Figure 9B), revealing that these organelles may be closely related to the occurrence of HCC, while GSEA highlighted the GO terms enriched in low-risk cells (Figure 9C). Key enriched pathways include angiogenesis, coagulation, complement, MYC targets V1, and the reactive oxygen species pathways.
Immune infiltration analysis and correlation with risk subgroups in HCC
Applying the ESTIMATE algorithm, we assessed immune infiltration in HCC samples by calculating the stromal, immune, and ESTIMATE scores for the CS1RS risk subgroups. We discovered that the high-risk group had significantly elevated immune, ESTIMATE, and tumor purity scores (Figure 10A). Applying the ssGSEA algorithm to assess immune-related pathways, we found that the high-risk group had notably increased activity in the hematopoietic cell lineage, natural kill (NK) cell-mediated cytotoxicity, B and T cell receptor pathways, and leukocyte trans-endothelial migration (Figure 10B).
To explore differences in immune cell infiltration between the high- and low-risk groups, we quantified the immune cell abundance in each sample using the CIBERSORT algorithm (Figure 10C). We observed that the high-risk group had a higher infiltration of CD4 memory activated T cells, regulatory T cells, M0 macrophages, and neutrophils, whereas the low-risk group had a higher prevalence of non-anticancer cells, including naïve B cells, resting CD4 memory T cells, resting NK cells, and resting mast cells. These findings remained consistent when we analyzed them with the ssGSEA and Xcell algorithms (Figure 10D). Furthermore, we found that three genes within the CS1RS were positively correlated with tumor-infiltrating immune cells, showing strong correlations with M0 macrophages and activated CD4 memory T cells (Figure 10E). These results further demonstrated that the CS1RS signature genes were significantly associated with tumor invasion, proliferation, and malignant phenotypes, and were highly correlated with tumor-promoting immune cells such as M0 macrophages and activated CD4 memory T cells. The CS1RS does not merely reflect alterations in cellular composition or tumor purity, but simultaneously captures the intrinsic malignancy of tumor cells and the characteristics of the tumor-promoting immune microenvironment.
Spearman’s correlation analysis conducted to identify immune cell types significantly associated with CS1RS revealed 10 cell types (P<0.05) (Figure 10F). Assessment of the impact of TME cell infiltration on OS in patients with HCC indicated that eight TME cell types were significantly correlated with patient prognosis (log-rank test, P<0.05), underscoring the vital role of TME cell infiltration in HCC prognosis (Figure S2). These results emphasize the importance of these eight immune cell types in HCC prognosis and progression.
Analysis of the correlation between the CS1RS and drug sensitivity
When we evaluated the sensitivity of the CS1RS risk subgroups to the tyrosine kinase inhibitors axitinib, gefitinib, nilotinib, and pazopanib, we observed that the IC50 for axitinib and gefitinib were markedly lower in the low-risk group, suggesting higher sensitivity (Figure 11A,11B). The IC50 values of nilotinib and pazopanib in the high-risk group were relatively low, which means that drug sensitivity is improved (Figure 11C,11D). Additionally, we observed a direct correlation between lower risk scores and reduced IC50 values for axitinib and gefitinib (Figure 11E,11F). The risk score is negatively correlated with IC50 (Figure 11G,11H). The study shows that low-risk patients may benefit more from axitinib and gefitinib, and nilotinib and pazopanib are more effective for high-risk patients.
Validation of SMS/PSMD1/YBX1 expression by immunohistochemistry
Immunohistochemical results from the Human Protein Atlas (Figure 12) show that, compared with normal liver tissue, PSMD1 and YBX1 exhibit enhanced immunoreactivity in HCC tissue, with more extensive positive areas and stronger staining intensity, whereas SMS shows weak expression or is negative in both normal and cancerous tissue. Specifically, PSMD1 exhibited moderate cytoplasmic staining in normal hepatocytes and was strongly positive in some HCC cases; YBX1 exhibited moderate cytoplasmic/membrane staining in normal liver tissue and was strongly positive in 25-75% of tumour cells; SMS was weakly expressed in normal hepatocytes and was negative in HCC cells.
Discussion
Our study presents a cutting-edge approach to discovering prognostic biomarkers in HCC by integrating multiomics data with advanced machine learning and single-cell analysis techniques. Our research employed the MOVICS package to combine mRNA, lncRNA, miRNA, and methylation data, resulting in robust and precise HCC subtype classification. By using 10 advanced clustering algorithms to form consensus subtypes, the detailed experimental procedure is shown in Figure S3, our study enhanced the clinical relevance and robustness of these classifications, facilitating improved patient stratification and treatment outcomes. We achieved molecular characterization of these subtypes using ssGSEA, focusing on key molecular features, particularly in chromatin remodeling and immune response pathways. This approach, combined with differential expression analysis, aided us in identifying biomarkers associated with poor prognosis, which is vital for the development of targeted therapies.
Moreover, our incorporation of single-cell RNA sequencing data enabled the identification of cell-type-specific gene expression patterns within the TME. This high-resolution analysis uncovered cellular dynamics that drive HCC progression, allowing for the discovery of biomarkers that are both prognostic and predictive of therapeutic responses at the single-cell level. Our construction of a CS1RS using a rigorous machine learning framework, followed by its validation across multiple datasets to ensure its accuracy and clinical applicability, also greatly contributed to the field. Using algorithms such as CIBERSORT and ESTIMATE to analyze immune cell infiltration and immunogenicity, we were able to confirm the biomarkers’ potential in guiding immune checkpoint inhibitor therapy and drug sensitivity.
To define prognostic predictions and the immune landscape in patients with HCC, we developed three CS1 features, aiming to further elucidate the potential mechanisms of CS1 in hepatocarcinogenesis. Using machine learning algorithms, we selected the most effective StepCox [backward] + RSF models, identifying three CS1 models that accurately stratified patients with HCC into distinct high- and low-risk groups (SMS, PSMD1, and YBX1) as independent prognostic variables for HCC. Current research demonstrates that SMS overexpression in HCC enhances immunosuppression within the TME by upregulating immune checkpoints and increasing the infiltration of immunosuppressive cells, thereby reducing the efficacy of immune checkpoint blockade therapies and contributing to tumor progression (24). PSMD1 is overexpressed in HCC, promoting tumor growth by regulating protein degradation and ubiquitination (25). YBX1 has also been confirmed to play a critical role in HCC malignancy and poor outcomes by enhancing tumor cell survival and proliferation, as well as by interacting with circRNA-SORE, which prevents YBX1 degradation, leading to increased drug resistance, tumor growth, and progression (26). In accordance with these findings, we found that high expression levels of SMS, PSMD1, and YBX1 were associated with poor prognosis in patients with HCC. Consequently, the high levels of immune infiltration and strong expression of immune checkpoints observed in the CS1 subtype, despite being associated with a poor prognosis, may be linked to the role of SMS in reducing the efficacy of immune checkpoint inhibitor therapy, as well as to the role of YBX1 in enhancing tumour resistance and promoting tumour growth and progression. ROC curve and calibration curve analyses confirmed that the three CS1 features demonstrated high accuracy in prognostic prediction. To demonstrate the practical application and enhance the predictive potential of these CS1 features in determining HCC prognosis, we developed a nomogram plot incorporating clinical parameters and risk levels. Our findings provide evidence that these three CS1 features can guide personalized immunotherapy and targeted therapy in patients with HCC to improve outcomes.
Omics approaches and machine learning algorithms are widely used tools in cancer research, having demonstrated notable efficacy in identifying diagnostic and prognostic biomarkers and features (27-29). The CS1RS that we identified using these methods exhibits robust performance, as evidenced by ROC curve analysis, highlighting its potential as a powerful prognostic tool in HCC. With AUC values of 0.73, 0.67, and 0.67 at 1-, 2-, and 3-year intervals, respectively, in the training set and consistent results in the ICGC dataset, the CS1RS exhibits remarkable discriminatory capabilities. Its performance was superior to that of traditional clinical characteristics, such as age, gender, cancer stage, and cancer grade, further emphasizing the value of integrating multiomics and machine learning approaches in clinical practice. The significant correlations observed between the CS1RS and advanced clinical stages suggest that this signature could be particularly valuable in identifying high-risk patients, who may benefit from more effective treatment strategies. Additionally, the strong prognostic ability of the CS1RS across various subgroups, as confirmed by KM curve analysis, indicates its broad applicability across diverse patient populations. Given the confirmation of these findings by external databases such as the ICGC, the CS1RS could serve as a reliable biomarker for guiding personalized treatment strategies in HCC, potentially improving patient outcomes through more tailored therapeutic interventions.
In this study, we conducted functional enrichment analysis to explore the molecular mechanisms linking the CS1RS with prognosis in HCC. GSEA analysis revealed that the low-risk group showed enrichment in bile acid metabolism, coagulation, and fatty acid metabolism, while the high-risk group exhibited significant enrichment in graft rejection, E2F targets, G2M checkpoint, mTORC1 signaling, and MYC targets V1. These findings suggest that the significant differences in BPs and metabolic pathways between the risk groups may be closely related to their prognoses. Enrichment of bile acid metabolism and fatty acid metabolism in the low-risk group suggests a more stable metabolic state, which plays a critical role in normal liver function. Previous studies have shown that dysregulation of bile acid and fatty acid metabolism is closely associated with liver diseases, and abnormalities in these pathways may influence tumorigenesis and progression (30-32). Therefore, enrichment of these metabolic pathways in the low-risk group may reflect lower tumor malignancy and a better prognosis. In contrast, enrichment of E2F targets, G2M checkpoint, mTORC1 signaling, and MYC targets V1 in the high-risk group suggests that these patients may experience more active proliferative signaling and abnormal cell cycle regulation. E2F targets and the G2M checkpoint play key roles in cell cycle regulation and tumor proliferation, and their aberrant activation is often associated with rapid tumor growth and poor prognosis (33,34). Additionally, the enrichment of mTORC1 signaling and MYC targets V1 further supports the possibility of tumor growth and metabolic dysregulation in the high-risk group, as these pathways are widely recognized as critical regulators of cell growth and metabolic reprogramming in various cancer types (35,36).
The GSVA results confirmed these findings, showing that the high-risk group exhibited greater activity in tumor growth and cell cycle-related pathways such as MYC_TARGETS_V1 and G2M_CHECKPOINT, while the low-risk group exhibited greater activity in metabolism-related pathways such as BILE_ACID_METABOLISM, XENOBIOTIC_METABOLISM, and FATTY_ACID_METABOLISM. These results underscore the close relationship between metabolic differences and tumor biological behavior across different risk groups. Additionally, the correlation analysis results between the CS1RS and tumor marker pathway scores support these conclusions, indicating that the CS1RS is closely related to cancer-associated BPs and metabolic pathways. These findings not only reveal the molecular mechanisms underlying HCC prognosis but also provide a more precise foundation for guiding personalized treatment and developing new therapeutic targets based on the CS1RS in the future.
We analyzed ITH in patients with HCC and explored its relationship with prognosis and the mutational landscape. Driven by differences in gene expression, ITH influences tumor growth, invasion, and metastasis (37). Previous studies have shown that ITH not only increases genetic diversity, promoting cancer progression, but also contributes to drug resistance (38). When we quantified ITH in patients with HCC using the MATH algorithm, we found that patients in the high-risk group had significantly higher MATH scores, as well as that those with higher MATH scores exhibited significantly shorter OS than those with lower scores. Combining ITH with the CS1RS, we observed that patients in the high-risk + high MATH group had significantly worse prognosis than those in the low-risk + low MATH group, suggesting that these two indicators together provide a more accurate assessment of HCC prognosis than that of one indicator alone. In this study, the high-risk group in the CS1RS analysis exhibited elevated immune scores, ESTIMATE scores and tumour purity simultaneously. This may be attributed to the fact that, in the high-risk group, not only was there increased immune cell infiltration, but the tumour cells themselves also exhibited greater malignancy and proliferative activity. Consequently, within the overall transcriptomic profile, both tumour-intrinsic and immune-infiltration signals were elevated; this requires further clarification through cellular and animal experiments.
Moreover, our incorporation of single-cell RNA sequencing data enabled the identification of cell-type-specific gene expression patterns within the TME. This high-resolution analysis uncovered cellular dynamics that drive HCC progression, allowing for the discovery of biomarkers that are both prognostic and predictive of therapeutic responses at the single-cell level. Our construction of a CS1RS using a rigorous machine learning framework, followed by its validation across multiple datasets to ensure its accuracy and clinical applicability, also greatly contributed to the field. Using algorithms such as CIBERSORT and ESTIMATE to analyze immune cell infiltration and immunogenicity, we were able to confirm the biomarkers’ potential in guiding immune checkpoint inhibitor therapy and drug sensitivity.
Although this study made several notable findings regarding the integration of multiomics and machine learning approaches for prognostic biomarker discovery in HCC, several limitations should be acknowledged. First, the use of a limited sample size in single-cell analysis may not have fully captured tumor heterogeneity, limiting the generalizability of the findings. Second, the interpretability of machine learning models might have been insufficient, potentially affecting their clinical applicability. Finally, the technical challenges of integrating different omics data could have affected the reproducibility and reliability of the results. Further validation is needed for clinical translation, particularly in assessing the generalizability and effectiveness of the findings across diverse patient populations.
Conclusions
The innovative integration of multiomics data, single-cell analysis, and advanced machine learning techniques in this study provides a comprehensive framework for discovering and validating prognostic biomarkers in HCC. These innovations are poised to significantly impact personalized treatment strategies, offering hope for improved outcomes in patients with HCC.
Acknowledgments
We thank LetPub (www.letpub.com.cn) for its linguistic assistance during the preparation of this manuscript.
Footnote
Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2870/rc
Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2870/prf
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2870/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- McGlynn KA, Petrick JL, El-Serag HB. Epidemiology of Hepatocellular Carcinoma. Hepatology 2021;73:4-13.
- Sung H, Ferlay J, Siegel RL, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021;71:209-49. [Crossref] [PubMed]
- Teng W, Wu TC, Lin SM. Hepatocellular carcinoma systemic treatment update: From early to advanced stage. Biomed J 2025;48:100815. [Crossref] [PubMed]
- Xu F, Jin T, Zhu Y, et al. Immune checkpoint therapy in liver cancer. J Exp Clin Cancer Res 2018;37:110. [Crossref] [PubMed]
- Xu G, Feng D, Yao Y, et al. Listeria-based hepatocellular carcinoma vaccine facilitates anti-PD-1 therapy by regulating macrophage polarization. Oncogene 2020;39:1429-44. [Crossref] [PubMed]
- Bujak JK, Pingwara R, Nelson MH, et al. Adoptive cell transfer: new perspective treatment in veterinary oncology. Acta Vet Scand 2018;60:60. [Crossref] [PubMed]
- Kim DY. Changing etiology and epidemiology of hepatocellular carcinoma: Asia and worldwide. J Liver Cancer 2024;24:62-70. [Crossref] [PubMed]
- Stuart T, Butler A, Hoffman P, et al. Comprehensive Integration of Single-Cell Data. Cell 2019;177:1888-1902.e21. [Crossref] [PubMed]
- Jin S, Guerrero-Juarez CF, Zhang L, et al. Inference and analysis of cell-cell communication using CellChat. Nat Commun 2021;12:1088. [Crossref] [PubMed]
- Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 2013;14:7. [Crossref] [PubMed]
- Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015;43:e47. [Crossref] [PubMed]
- Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008;9:559. [Crossref] [PubMed]
- Peterson RK, Ng R, Ludwig NN, Jacobson LA. Working Memory and Processing Speed Predict Math Skills in Pediatric Brain Tumor Survivors. J Pediatr Hematol Oncol 2023;45:e350-5. [Crossref] [PubMed]
- Liu A, Gao Y, Wang Q, et al. The heterogeneity and clonal evolution analysis of the advanced prostate cancer with castration resistance. J Transl Med 2023;21:641. [Crossref] [PubMed]
- Li L, Chen C, Wang X. DITHER: an algorithm for Defining IntraTumor Heterogeneity based on EntRopy. Brief Bioinform 2021;22:bbab202. [Crossref] [PubMed]
- Ritz T, Stueven AK, Demes M, et al. Proteomic subtyping highlights tumor heterogeneity of human hepatocellular carcinoma. Virchows Arch 2025;487:959-969. [Crossref] [PubMed]
- Chen B, Khodadoust MS, Liu CL, et al. Profiling Tumor Infiltrating Immune Cells with CIBERSORT. Methods Mol Biol 2018;1711:243-59. [Crossref] [PubMed]
- Geeleher P, Cox N, Huang RS. pRRophetic: an R package for prediction of clinical chemotherapeutic response from tumor gene expression levels. PLoS One 2014;9:e107468. [Crossref] [PubMed]
- Safri F, Nguyen R, Zerehpooshnesfchi S, et al. Heterogeneity of hepatocellular carcinoma: from mechanisms to clinical implications. Cancer Gene Ther 2024;31:1105-12. [Crossref] [PubMed]
- Zhang X, Wang Y, Li H, et al. Proteomic subtyping highlights tumor heterogeneity of human hepatocellular carcinoma. Virchows Arch 2025;487:959-969.
- Hassin O, Oren M. Drugging p53 in cancer: one protein, many targets. Nat Rev Drug Discov 2023;22:127-44. [Crossref] [PubMed]
- Veeraiyan D, Kurpad V, Munirathnam V, et al. Immune infiltration in TP53 missense mutant contributes to poor prognosis in hepatocellular carcinoma, unlike CTNNB1 mutations. Malig Spect 2025;2:117-27.
- Cai N, Cheng K, Ma Y, et al. Targeting MMP9 in CTNNB1 mutant hepatocellular carcinoma restores CD8(+) T cell-mediated antitumour immunity and improves anti-PD-1 efficacy. Gut 2024;73:985-99. [Crossref] [PubMed]
- Xiang L, Piao L, Wang D, et al. Overexpression of SMS in the tumor microenvironment is associated with immunosuppression in hepatocellular carcinoma. Front Immunol 2022;13:974241. [Crossref] [PubMed]
- Kim MY, Park ER, Cho EH, et al. Depletion of proteasome subunit PSMD1 induces cancer cell death via protein ubiquitination and DNA damage, irrespective of p53 status. Sci Rep 2024;14:7997. [Crossref] [PubMed]
- Xu J, Ji L, Liang Y, et al. CircRNA-SORE mediates sorafenib resistance in hepatocellular carcinoma by stabilizing YBX1. Signal Transduct Target Ther 2020;5:298. [Crossref] [PubMed]
- Liu J, Shi Y, Zhang Y. Multi-omics identification of an immunogenic cell death-related signature for clear cell renal cell carcinoma in the context of 3P medicine and based on a 101-combination machine learning computational framework. EPMA J 2023;14:275-305. [Crossref] [PubMed]
- Chu G, Ji X, Wang Y, et al. Integrated multiomics analysis and machine learning refine molecular subtypes and prognosis for muscle-invasive urothelial cancer. Mol Ther Nucleic Acids 2023;33:110-26. [Crossref] [PubMed]
- Liu Z, Liu L, Weng S, et al. Machine learning-based integration develops an immune-derived lncRNA signature for improving outcomes in colorectal cancer. Nat Commun 2022;13:816. [Crossref] [PubMed]
- Lu Y, Feng X, Wang Z, et al. Bile acid metabolism and hepatocellular carcinoma: mechanisms of drug resistance and intervention strategies. Precis Clin Med 2025;8:pbaf020. [Crossref] [PubMed]
- Najt CP, Adhikari S, Heden TD, et al. Organelle interactions compartmentalize hepatic fatty acid trafficking and metabolism. Cell Reports 2023;42:112435.
- Seo J, Jeong DW, Park JW, et al. Fatty-acid-induced FABP5/HIF-1 reprograms lipid metabolism and enhances the proliferation of liver cancer cells. Commun Biol 2020;3:638. [Crossref] [PubMed]
- Jia X, Zhu X, Chen S, et al. Comprehensive multi-omics analyses expose a precision therapy strategy that targets replication stress in hepatocellular carcinoma using WEE1 inhibition. J Adv Res 2025;78:497-515. [Crossref] [PubMed]
- Huang P, Wen F, Li Q. LncRNAs and E2F1: interactions and regulatory networks involved in cancer progression. Front Pharmacol 2024;15:1371036.
- Liu HJ, Du H, Khabibullin D, et al. mTORC1 upregulates B7-H3/CD276 to inhibit antitumor T cells and drive tumor immune evasion. Nat Commun 2023;14:1214. [Crossref] [PubMed]
- Schulze A, Oshi M, Endo I, et al. MYC Targets Scores Are Associated with Cancer Aggressiveness and Poor Survival in ER-Positive Primary and Metastatic Breast Cancer. Int J Mol Sci 2020;21:8127. [Crossref] [PubMed]
- Gavish A, Tyler M, Greenwald AC, et al. Hallmarks of transcriptional intratumour heterogeneity across a thousand tumours. Nature 2023;618:598-606. [Crossref] [PubMed]
- Dentro SC, Leshchiner I, Haase K, et al. Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes. Cell 2021;184:2239-2254.e39. [Crossref] [PubMed]
- Voskarides K, Giannopoulou N. The Role of TP53 in Adaptation and Evolution. Cells 2023;12:512. [Crossref] [PubMed]
- Oversoe SK, Clement MS, Weber B, et al. Combining tissue and circulating tumor DNA increases the detection rate of a CTNNB1 mutation in hepatocellular carcinoma. BMC Cancer 2021;21:376. [Crossref] [PubMed]

