Interpretable machine learning-based survival prediction and key gene identification in cancer using gene expression and clinical data
Original Article

Interpretable machine learning-based survival prediction and key gene identification in cancer using gene expression and clinical data

Shicai Liu1 ORCID logo, Han Zhang2 ORCID logo

1School of Medical Information, Wannan Medical College, Wuhu, China; 2School of Basic Medical Sciences, Wannan Medical College, Wuhu, China

Contributions: (I) Conception and design: S Liu; (II) Administrative support: S Liu; (III) Provision of study materials: Both authors; (IV) Collection and assembly of data: Both authors; (V) Data analysis and interpretation: Both authors; (VI) Manuscript writing: Both authors; (VII) Final approval of manuscript: Both authors.

Correspondence to: Shicai Liu, PhD. School of Medical Information, Wannan Medical College, No. 22 Wenchang West Road, Wuhu 241002, China. Email: liushicainj@163.com; Han Zhang, MS. School of Basic Medical Sciences, Wannan Medical College, No. 22 Wenchang West Road, Wuhu 241002, China. Email: zhanghan@wnmc.edu.cn.

Background: Gastrointestinal cancer is a common malignant tumor with high incidence and poor prognosis. Accurate prediction of prognosis can improve the treatment of cancer patients, but the clinical features currently used provide insufficient information. This study aimed to establish an efficient survival prediction model for gastrointestinal cancer based on gene expression and clinical data.

Methods: Based on the gastrointestinal cancer samples in The Cancer Genome Atlas, we established efficient gastrointestinal cancer survival prediction models with gene expression profiling data as input molecular features. A series of bioinformatics methods were applied to conduct a comprehensive analysis of the identified gastrointestinal cancer-related genes. The molecular mechanism by which newly identified gastrointestinal cancer-related genes mediate cancer occurrence was preliminarily explored.

Results: Random forest-based model (I) had an accuracy of 94.98% with Mathew’s correlation coefficient (MCC) of 0.8995. Support vector machine-based model (II) had an accuracy of 94.98% with MCC of 0.9000. We found a significant difference in survival between the two subtypes (S1 and S2, 3-year survival rates ≥75% and ≤45%, respectively). These subtypes have independent predictive value for patient survival. The models constructed in this study exhibit inherent interpretability. Twenty key genes related to gastrointestinal cancer were successfully identified. The comprehensive functional analysis in this study provides important clues for elucidating the potential mechanisms of action of the selected cancer-related genes in tumor initiation and progression. Most importantly, we conducted drug target predictions for these genes and successfully identified potential targeted drugs for seven genes (NR3C1, HNF4A, DNAAF9, CDX2, ATP2B4, RBMS3, LIFR).

Conclusions: The findings of this study hold significant implications for predicting survival and treatment decisions in gastrointestinal cancer.

Keywords: Gastrointestinal cancer; interpretable machine learning; transcriptomics; survival prediction; prognosis


Submitted Oct 09, 2025. Accepted for publication Dec 23, 2025. Published online Feb 04, 2026.

doi: 10.21037/tcr-2025-aw-2200


Highlight box

Key findings

• Developed two highly accurate survival prediction models for gastrointestinal cancer using gene expression data.

• Identified two distinct molecular subtypes with significantly different survival outcomes.

• Successfully identified 20 key genes significantly associated with gastrointestinal cancer prognosis.

• Predicted potential targeted drugs for 7 key genes (NR3C1, HNF4A, DNAAF9, CDX2, ATP2B4, RBMS3, LIFR).

What is known and what is new?

• Clinical features alone provide insufficient information for accurate gastrointestinal cancer prognosis prediction. Gene expression profiling holds potential but requires robust modeling.

• This study establishes novel molecular feature-based models with exceptional accuracy and inherent interpretability. It defines prognostically critical subtypes validated for independent predictive value and integrates findings with functional analysis and drug target prediction, directly linking prognosis to potential therapeutic strategies for specific genes.

What is the implication, and what should change now?

• The models and identified subtypes provide powerful tools for personalized survival prediction, while the key genes and their predicted targeted drugs offer direct leads for developing novel therapeutic strategies tailored to molecular profiles.


Introduction

Cancer represents a significant global public health challenge (1). According to the latest data, there are close to 20 million new cases of cancer in the world in 2022 (The Global Cancer Observatory, https://gco.iarc.fr/). In 2022, 49.2% of new cancer cases and 56.1% of cancer deaths occurred in Asia, while the Asian population accounted for 59.2% of the world’s population. Compared with other regions, the proportion of cancer deaths in Asia (56.1%) is higher than that of incidence (49.2%). Among the top ten cancers with cancer incidence and mortality, five are gastrointestinal cancer, namely liver cancer, stomach cancer, esophageal cancer, pancreatic cancer and colorectal cancer (2). In China, according to the 2022 cancer statistics, gastrointestinal cancer accounted for 33.52% of new cancer cases and 44.10% of cancer deaths (3). Cancer has become a major burden on the Chinese healthcare system, and gastrointestinal cancer is the most common cause of death for cancer patients (4). The incidence of gastrointestinal cancer is high in China, and most patients are already in advanced stage when diagnosed (5,6). Our lab once established a two-layer cancer early diagnosis model for gastrointestinal cancer (7). Although early-stage cancers can be effectively treated with surgery and radiation, advanced cancers are often uncontrollable (8).

Accurate survival prediction following a cancer diagnosis is crucial for both medical professionals and patients. The prognosis largely depends on the malignancy of the cancer cells, making it essential to assess the disease’s severity and anticipate its progression (9). Reliable survival prediction enables patients to establish realistic goals, pursue timely preventive measures, and receive appropriate treatments, thus minimizing the risk of making ineffective treatment choices, such as overtreatment. However, predicting survival is a difficult task because of the high level of heterogeneity of cancer cells and the complex etiology. In general, cancer patients are divided into different groups according to tumor staging systems, such as Tumor-Node-Metastasis (TNM) staging (10).

However, molecular features are rarely considered in this staging system, and there are a number of patient subtypes with different survival outcomes within a given TNM staging. In addition, clinical features (e.g., TNM staging) do not provide sufficient predictive information to accurately predict survival. For instance, an analysis of several clinical factors from The Cancer Genome Atlas (TCGA) dataset (such as age at initial diagnosis, gender, race, and clinical stage) in patients with gastrointestinal cancer revealed that while clinical stage was the only factor significantly associated with survival, its ability to discriminate survival outcomes was limited (Table 1). As a result, there is a pressing need for the development of more effective cancer survival prediction strategies.

Table 1

Clinical information

Characteristics Patients (n=1,710) (100%) Log-rank P value
Age (mean ± SD, years) 64.55±12.42
Gender, n (%)
   Female 651 (38.07) 0.20
   Male 1,059 (61.93)
Stage, n (%)
   Stage I 382 (22.34) <0.0001
   Stage II 643 (37.60)
   Stage III 466 (27.25)
   Stage IV 152 (8.89)
Grade, n (%)
   Grade 1 111 (6.49) 0.003
   Grade 2 480 (28.07)
   Grade 3 438 (25.61)
   Grade 4 16 (0.94)
Race, n (%)
   Asian 294 (17.19) 0.08
   White 997 (58.30)
   Black or African American 109 (6.37)
   American Indian or Alaska Native 2 (0.12)
   Native Hawaiian or other Pacific Islander 1 (0.06)
Risk factors, n
   HBV 95
   HCV 52
   Alcohol Yes: 280; no: 111 0.40
   Smoking 300
   Family history of cancer Yes: 77; no: 314 0.20
   Others 48
Survival, n
   Alive 1,133
   Dead 577

HBV, hepatitis B virus; HCV, hepatitis C virus; SD, standard deviation.

Molecular features (such as gene expression levels, non-coding RNA expression, and gene mutations, etc.) imply a lot of information of cancer cells, including the degree of malignancy, metastatic ability, and sensitivity to treatment. Some cancers (e.g., colorectal, cervical, and breast cancer) have been classified into different subtypes using molecular profiles from cancer data repositories (e.g., TCGA) (11-13). This underscores the potential of developing molecular signature-based prediction models to enhance the accuracy of tumor survival predictions. However, most studies did not rely on survival to explore molecular subtypes in defining subtypes (14). Instead, survival data were often used retrospectively to evaluate the clinical relevance of these subtypes (15). Therefore, some molecular subtypes display similar survival patterns, making them redundant subtypes in terms of survival differences (16). Gastrointestinal cancer research urgently needs new approaches to discover molecular subtypes based on survival sensitivity and omics data. Furthermore, machine learning offers solutions to improve the accuracy of cancer survival predictions (17).

In this study, based on the gastrointestinal cancer samples in TCGA, we established efficient gastrointestinal cancer survival prediction models with gene expression profiling data as input molecular features. We found a significant difference in survival between the two subtypes. These subtypes have independent predictive value for patient survival. Twenty key genes related to gastrointestinal cancer were successfully identified. The comprehensive functional analysis in this study provides important clues for elucidating the potential mechanisms of action of the selected cancer-related genes in tumor initiation and progression. Most importantly, we conducted drug target predictions for these genes and successfully identified potential targeted drugs for seven genes. In conclusion, the study findings provide valuable insights into the molecular basis of gastrointestinal cancer survival and suggest potential avenues for personalized treatment strategies. The accuracy and prognostic value of the developed models suggested that they could be used as a valuable tool for guiding clinical decision-making in the management of gastrointestinal cancer patients. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-aw-2200/rc).


Methods

Data source and collation

The gene expression profiles and related clinical information of gastrointestinal cancer samples were downloaded from TCGA Data Portal (https://cancergenome.nih.gov/), including 1,710 gastrointestinal cancer tissues samples (Figure 1). The RNA expression data were standardized using Fragments Per Kilobase per Million (FPKM), a normalization method for raw count data. To prepare for analysis, the FPKM-normalized gene expression data were log-transformed. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Figure 1 Methods flowchart. Flow diagram outlining the methods of this work. MDA, mean decrease in accuracy; PCA, principal component analysis; S1, subtype 1; S2, subtype 2.

Subtypes identification

Firstly, we used principal component analysis (PCA) to reduce the dimension of high-dimensional data. The gene expression profiles dataset of gastrointestinal cancer samples, including 37,218 identifiers and 1,710 samples, as the input for the PCA. Following dimensionality reduction, we constructed a univariate Cox proportional hazards (Cox-PH) model for each of the transformed features. Features that resulted in a significant Cox-PH model (log-rank P value <0.05) were selected for further analysis. Cox-PH model was implemented using the survival package in R. Next, we employed these reduced features to cluster the samples using the K-means algorithm, implemented with the factoextra package in R. The optimal number of clusters was determined using two methods: “silhouette” (for average silhouette width) and “wss” (for total within sum of square).

Kaplan-Meier analysis

Kaplan-Meier analysis was employed to estimate the survival rates of stratified patient groups and generate survival curves. This analysis was conducted using the survival package in R, with log-rank P values reported for each comparison.

Feature selection

In selecting key features for classification, we applied the mean decrease in accuracy (MDA), calculated using the randomForest package (18). MDA reflects the average reduction in classification accuracy on Out-Of-Bag (OOB) samples when the values of a specific feature are randomly shuffled. A higher MDA value indicates greater importance of the feature. Thus, permutation-based MDA serves as an effective method for assessing the contribution of each feature to the classification performance.

Prediction model

Machine learning techniques, particularly random forest (RF) and support vector machine (SVM), have gained prominence in biomedical research (19-21). In this study, RF and SVM were respectively used to model the multivariate correlation between gene expression and survival subtypes of cancer patients. For the implementation of our machine learning models, we utilized R programming language packages. The RF algorithm was executed using the randomForest, while the SVM was deployed through the e1071. These R packages provided efficient and reliable frameworks for applying the respective algorithms to our dataset.

RF (18) is an ensemble learning. This method employs multiple decision trees, with each tree developed from a bootstrapped subset of the training information. The algorithm selects random groups of predictor variables as potential candidates for node division. Such an approach aims to preserve the robustness of individual trees while simultaneously decreasing their interdependence. By utilizing this technique, RF maintains the predictive power of the model while mitigating overfitting risks associated with highly correlated trees.

SVM (22) is a supervised learning algorithm. Its basic principle is to map data to a special high-dimensional linearly separable space using kernel function, and then find the optimal linear function in the high-dimensional space of the mapping. SVM always deduce the hyperplane with the maximum margin between two types of samples.

Evaluating performance

Once the models were developed, their performance was evaluated using several metrics: accuracy, sensitivity, specificity, and Mathew’s correlation coefficient (MCC). These metrics were calculated with the following formulas:

Accuracy=TP+TNTP+FP+TN+FN

Sensitivity=TPTP+FN

Specificity=TNTN+FP

MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)

Here, TP, FP, TN, and FN represent the numbers of true positives, false positives, true negatives, and false negatives, respectively.

In addition to these standard evaluation metrics, we assessed the accuracy of survival prediction within the identified subgroups using three additional measures: concordance index (C-index), Log-rank P value of Cox-PH, and Brier score.

The model’s performance was further validated using a five-fold cross-validation technique. To ensure the robustness of the results, an independent dataset consisting of 513 samples, randomly selected from a pool of 1,710 cancer samples, was created. These samples were not involved in the model’s training or parameter optimization process, providing an unbiased evaluation of the model's predictive capabilities.

Functional analysis

Gene Ontology (GO) analysis was performed on the target gene list using DAVID (23). DAVID integrates various biological database resources to identify terms that are significantly enriched at the three ontological levels of Biological Process, Cellular Component, and Molecular Function.

Pathway analysis was performed by the Reactome Knowledgebase (https://reactome.org) (24). Reactome is a high-quality, manually curated database that contains rich annotations for classic signaling, metabolic pathways, and biological processes.

Utilize the STRING database (25) to construct protein-protein interaction (PPI) network for the target gene list. STRING integrates large-scale PPI information from various sources, including experimental verification, database mining, co-expression analysis, and literature prediction.

Drug prediction

Drug prediction using the Drug-Gene Interaction Database (DGIdb, https://dgidb.org) (26). DGIdb integrates drug-gene interaction, target and related annotation information from multiple authoritative databases, providing strong support for drug repositioning and targeted therapy research.

Gene expression analysis

Gene expression analysis across different TCGA cancers was performed using the Gene_DE module of TIMER3.0 (Tumor Immune Estimation Resource 3.0) (27). TIMER3.0 is freely accessible at https://compbio.cn/timer3/. Single-cell RNA expression analysis was conducted using TISCH2 (Tumor Immune Single-cell Hub 2; http://tisch.comp-genomics.org/), a comprehensive database focused on the tumor microenvironment (28).

Statistical analysis

All statistical analyses were performed using R software (version 4.3.0) and its associated packages. A P value less than 0.05 is considered statistically significant.


Results

Clinical information analysis

We downloaded the clinical characteristics and survival information of gastrointestinal cancer from the TCGA, and detailed clinical information was shown in Table 1. Some clinical information and survival rate were analyzed, and the results showed that only Stage (Log-rank P value <0.0001) and Grade (Log-rank P value =0.003) had statistical differences with survival time.

Subtypes identification

As outlined in the “Methods” section, we applied PCA to reduce the dimensionality of the high-dimensional data, resulting in 100 transformed features. We then performed univariate Cox-PH regression analysis on each of these 100 features. Fifteen features were identified as significantly associated with survival, with a log-rank P value of less than 0.05. According to the 15 features screened out, k-means clustering analysis was conducted, and K value was set between 2 and 8. Using metrics “silhouette” and metrics “wss”, it is found that when k=2, the scores of these two metrics are the best (Figure 2), indicating that k=2 is optimal. In addition, when k was set between 2 and 8 for survival analysis, the results showed that when k=2, it could be significantly divided into two different survival groups (Figure 3). Therefore, we determine k=2 as the classification label for subsequent supervised machine learning processes.

Figure 2 Selection of the best sub cluster K. (A) Selection of the best sub cluster K according to “silhouette” score. (B) Selection of the best sub cluster K according to “wss” score.
Figure 3 Kaplan-Meier plots show the separation of subtypes in terms of survival profiles from k=2 to 8. S1, subtype 1; S2, subtype 2; S3, subtype 3; S4, subtype 4; S5, subtype 5; S6, subtype 6; S7, subtype 7; S8, subtype 8.

Feature selection

In feature selection, we used the MDA method. After filtering, 20 genes identified by MDA were used for further analysis (Figure 4A). In the heatmap of this gene set, a distinct contrast between the two subtypes was clearly observed (Figure 4B). Through database and literature searches, we found that among the 20 genes identified in this study, 16 are known genes related to gastrointestinal cancer. Therefore, it is inferred that the remaining 4 genes are also related to gastrointestinal cancer (Figure 4C). Detailed results of the database and literature searches were provided in Table S1.

Figure 4 Feature selection. (A) Feature (gene set) selection with MDA. (B) Heatmap analysis of the set of biomarker candidates between S1 and S2. (C) The 20 genes related to gastrointestinal cancer discovered in this study were compared with those reported in the literatures. MDA, mean decrease in accuracy; S1, subtype 1; S2, subtype 2.

Survival prediction model

We use the two subtypes identified above as labels, and the 100 genes obtained by screening as input features, and use RF and SVM to construct survival prediction models respectively. Figure 1 shows the whole process of modeling. The statistical data for the models are presented in Table 2. As can be seen from the table, both RF-based model (I) (accuracy of 94.98% with sensitivity, specificity, MCC, C-index, Brier score and Log-rank P value of 95.08%, 94.87%, 0.8995, 0.7245, 0.1558 and 1.00e−5 respectively) and SVM-based model (II) (accuracy of 94.98% with sensitivity, specificity, MCC, C-index, Brier score and Log-rank P value of 96.72%, 93.16%, 0.9000, 0.7473, 0.1543 and 4.00e−6 respectively) have shown good predictive effects. In the analysis of independent datasets, the RF-based model (I) demonstrated strong performance, achieving an accuracy of 96.10%, along with MCC, C-index, Brier score, and Log-rank P value of 0.9221, 0.7200, 0.1986, and 9.00e−8, respectively (Table 3). Similarly, the SVM-based model (II) exhibited good results with an accuracy of 94.74%, and corresponding MCC, C-index, Brier score, and Log-rank P value of 0.8944, 0.7044, 0.1957, and 2.00e−8, respectively (Table 3). In addition, we also conducted survival analysis (Figure 5). The results showed that these two subtypes have significant differences in survival, which further illustrates the reliability of our results. To further assess the stability of our models, we performed a five-fold cross-validation (Table 4), and the results showed that RF-based model (I) and SVM-based model (II) still showed good predictive performance with accuracies of 94.45% and 94.76%, the corresponding C-index values being 0.6592 and 0.6727 respectively. These findings indicated that the classification models based on cluster labels were robust and effective in predicting survival-specific clusters.

Table 2

Performance of the models on training dataset

Models Accuracy Sensitivity Specificity MCC C-index Brier score Log-rank P value
RF-based model (I) 94.98% 95.08 % 94.87% 0.8995 0.7245 0.1558 1.00e−5
SVM-based model (II) 94.98% 96.72% 93.16% 0.9000 0.7473 0.1543 4.00e−6

RF, random forest; SVM, support vector machine; MCC, Mathew’s correlation coefficient; C-index, concordance index.

Table 3

Performance of the models on independent dataset

Models Accuracy Sensitivity Specificity MCC C-index Brier score Log-rank P value
RF-based model (I) 96.10% 97.11% 95.20% 0.9221 0.7200 0.1986 9.00e−8
SVM-based model (II) 94.74% 94.21% 95.20% 0.8944 0.7046 0.1957 2.00e−8

RF, random forest; SVM, support vector machine; MCC, Mathew’s correlation coefficient; C-index, concordance index.

Figure 5 Independent data sets were used for survival analysis based on the predicted results of RF-based model (I) and SVM-based model (II) respectively. RF, random forest; SVM, support vector machine; S1, subtype 1; S2, subtype 2.

Table 4

The results of five-fold cross-validation for models

Models Accuracy Sensitivity Specificity MCC C-index Brier score Log-rank P value
RF-based model (I) 94.45% (0.0228) 94.21% (0.0256) 94.71% (0.0252) 0.8891 (0.0456) 0.6592 (0.0623) 0.1804 (0.0337) 0.0079
SVM-based model (II) 94.76% (0.0110) 94.26% (0.0196) 95.31% (0.0141) 0.8955 (0.0219) 0.6727 (0.0583) 0.1821 (0.0320) 0.0065

Data are presented as mean ± standard deviation. RF, random forest; SVM, support vector machine; MCC, Mathew’s correlation coefficient; C-index, concordance index.

Functional analysis

In order to understand the potential mechanism of gene features as input characteristics of survival prediction model, functional analysis was performed on these genes (biomarkers).

The GO analysis showed that in the biological process dimension, processes such as positive regulation of cell population proliferation (GO:0008284), Wnt signaling pathway (GO:0016055), positive regulation of transcription by RNA polymerase II (GO:0045944), hippo signaling (GO:0035329), and digestive tract development (GO:0048565) were significantly enriched (Figure 6A), suggesting that these cancer-related genes actively participated in biological processes potentially related to the occurrence and development of cancer, such as development and cell proliferation regulation. In terms of cellular components, there is enrichment in the plasma membrane (GO:0005886) (Figure 6A), which may be related to the function or localization of cancer-related genes in the plasma membrane. In terms of molecular functions, functions such as sequence-specific double-stranded DNA binding (GO:1990837), DNA-binding transcription factor activity (GO:0003700), methyl-CpG binding (GO:0008327), and RNA polymerase II transcription regulatory region sequence-specific DNA binding (GO:0000977) are significantly enriched (Figure 6A), reflecting that these cancer-related genes play important roles at the molecular functional level of DNA binding and transcriptional regulation. It may be involved in the occurrence and development of cancer by influencing processes such as gene transcription.

Figure 6 Functional analysis based on the 20 genes identified. (A) Gene Ontology analysis. (B) Pathway analysis. (C) Protein-protein interaction analysis. BP, biological processes; CC, cellular components; GO, Gene Ontology; MF, molecular functions; 3D, three-dimensional.

Pathway analysis showed that the pathways enriched by these genes include Gene expression (Transcription), Generic Transcription Pathway, Nuclear Receptor transcription pathway, Signaling by WNT in cancer, Signaling by Hippo, etc. (Figure 6B), all of which are crucial in the occurrence and development of cancer.

PPI analysis revealed the interrelationships of these cancer-related proteins at the molecular level (Figure 6C), suggesting that they may have synergistic or regulatory relationships in the related biological processes or molecular mechanisms of cancer occurrence and development. This provides an intuitive interaction network basis for further in-depth exploration of cancer-related molecular pathways, biological functions, and potential therapeutic targets. The results indicated that there were interactions among HNF4A, NR3C1, CDX1, CDX2 and RNF43.

Targeted drug prediction

The DGIdb (https://dgidb.org) was used to predict the drug targets of the identified genes. Detailed information on the prediction results is in the Table S2. We have identified potential drugs targeting seven genes (NR3C1, HNF4A, DNAAF9, CDX2, ATP2B4, RBMS3, LIFR). In particular, we found that 134 drugs interacted with the NR3C1 gene, 16 drugs interacted with HNF4A, 3 drugs interacted with DNAAF9, 3 drugs interacted with CDX2, 2 drugs interacted with ATP2B4, and 1 drug interacted with RBMS3, 1 drug interacted with LIFR. These findings hold significant potential for the development of new therapeutic strategies. It is worth noting that among the identified compounds, several approved drugs deserve special attentions (Table 5), which are approved drugs related to anti-tumor or digestive system diseases. The possibility of these drugs improving or aggravating gastrointestinal cancer varies depending on the environment and dosage. Further research is crucial for understanding their specific roles in gastrointestinal cancer.

Table 5

Some results of drug predictions for the identified genes

Gene Drug Regulatory approval Indication Interaction score
HNF4A DOCETAXEL ANHYDROUS Approved Antineoplastic agent 0.0786
HNF4A TOREMIFENE Approved SERM, antineoplastic agent 0.2510
CDX2 TEGAFUR Approved Antineoplastic agent 4.3503
NR3C1 AZACITIDINE Approved Antineoplastic agent 0.0097
NR3C1 URSODIOL Approved For prevention of recurrence of colorectal polyps 0.0487
NR3C1 FLUOXYMESTERONE Approved Anabolic agents; Antineoplastic agents 0.0974
DNAAF9 PEGINTERFERON ALFA-2A Approved Antineoplastic agents; immunomodulatory agents, for treatment of hepatitis B and C 1.2001

SERM, selective estrogen receptor modulator.

Gene expression analysis of the newly identified genes

To systematically evaluate the potential role and universality of the four newly identified gastrointestinal cancer-related genes (AHCY, NIBAN1, SPART, and TRABD2A) in the occurrence and development of cancer, this study conducted a bulk RNA-seq analysis at the pan-cancer level. This analysis aims to reveal the overall expression pattern changes of these genes in tissue samples of various cancer types, including gastrointestinal cancer, and to confirm the prevalence of their dysregulation. The results showed that in a large-scale pan-cancer cohort covering 39 major cancer types, these four genes exhibited significant expression dysregulation in most cancers, especially gastrointestinal cancers (Figure 7A-7D). This evidence at the pan-cancer level strongly suggests that the functional abnormalities of these four genes may be common events across multiple cancer types and play an important and widespread role in tumor biology. To further delve into the specific cellular environment in which these gene dysregulation occur, especially their contribution to the core cell population that drives the malignant phenotype-tumor cells themselves, we subsequently conducted high-resolution single-cell RNA sequencing (scRNA-seq) analysis, which focused on gastrointestinal cancers. The results of scRNA-seq showed that the dysregulation of the expression of these four genes was not uniformly distributed in all cell types, but was highly enriched and significantly occurred in the population clearly identified as malignant tumor cells, especially the gene AHCY (Figure 8A-8F). This specific and intense expression change pattern in malignant cells further confirms the direct association between the AHCY gene and the core malignant characteristics of cancer cells (such as uncontrolled proliferation, survival, invasion or metabolic reprogramming, etc.). It is very likely to act as a key effector molecule or regulatory factor, directly participating in maintaining and promoting the malignant phenotype of cancer cells.

Figure 7 Bulk RNA-seq analysis at the pan-cancer level of four newly identified genes. (A) Bulk RNA-seq analysis of AHCY. (B) Bulk RNA-seq analysis of NIBAN1. (C) Bulk RNA-seq analysis of SPART. (D) Bulk RNA-seq analysis of TRABD2A. *, P<0.05; **, P<0.01; ***, P<0.001; ns, P>0.05. ns, not significant; TPM, transcripts per million.
Figure 8 Single-cell RNA expression analysis of four newly discovered genes. (A) Single-cell RNA expression analysis of four newly discovered genes in CHOL. (B) Single-cell RNA expression analysis of four newly discovered genes in CRC. (C) Single-cell RNA expression analysis of four newly discovered genes in ESCA. (D) Single-cell RNA expression analysis of four newly discovered genes in LIHC. (E) Single-cell RNA expression analysis of four newly discovered genes in PAAD. (F) Single-cell RNA expression analysis of four newly discovered genes in STAD. B, B cells; CD4Tconv, conventional CD4 T cells; CD8T, CD8 T cells; CD8Tex, exhausted CD8 T cells; CHOL, cholangiocarcinoma; CRC, colorectal cancer; DC, dendritic cells; ESCA, esophageal cancer; LIHC, liver hepatocellular carcinoma; NK, natural killer cells; PAAD, pancreatic adenocarcinoma; STAD, stomach adenocarcinoma.

Discussion

In this study, we developed a cancer survival prediction model using cancer expression data, clinical information, and machine learning techniques. This model effectively classified gastrointestinal cancer into two molecular subtypes (S1 and S2), with corresponding 3-year survival rates of ≥75% and ≤45%, respectively.

Heterogeneity remains a major challenge in understanding the etiology of cancer. In order to understand cancer heterogeneity between patients, a lot of work has been done to identify cancer molecular subtypes. However, most studies did not rely on survival to explore molecular subtypes in defining subtypes (14). Instead, survival information was typically used retrospectively to evaluate the clinical relevance of the subtypes (15). Therefore, some molecular subtypes display similar survival patterns, making them redundant subtypes in terms of survival differences (16). In order to solve this problem, it is of great significance to consider patient survival prognosis as part of the process of identifying subtypes in this study. We used PCA to reduce the dimensionality of high-dimensional data as described in the section “Methods”. This process generated 100 transformed features, which were then subjected to univariate Cox-PH regression analysis. Fifteen of these features were found to be significantly associated with survival (log-rank P value <0.05). According to the fifteen features screened out, k-means clustering analysis was conducted. Using metrics “silhouette” and metrics “wss”, it is found that when k=2, the scores of these two metrics are the best (Figure 2), indicating that k=2 is optimal. In addition, when k was set between 2 and 8 for survival analysis, the results showed that when k=2, it could be significantly divided into two different survival groups (Figure 3). Therefore, we determine k=2 as the classification label for subsequent supervised machine learning processes.

Accurately predicting the survival of cancer patients is an important factor in deciding treatment. At present, most molecular-based survival prediction models classify patients into two groups based on distinct survival outcomes (29,30). For example, Chai et al. (30) established a prognostic model, which divided gastrointestinal cancer patients into two groups, achieving an average C-index of 0.6364. However, this level of accuracy is insufficient for making reliable treatment decisions.

In this study, two prediction models, RF-based model (I) (accuracy of 94.98% with sensitivity, specificity, MCC, C-index, Brier score and Log-rank P value of 95.08%, 94.87%, 0.8995, 0.7245, 0.1558 and 1.00e−5 respectively) and SVM-based model (II) (accuracy of 94.98% with sensitivity, specificity, MCC, C-index, Brier score and Log-rank P value of 96.72%, 93.16%, 0.9000, 0.7473, 0.1543 and 4.00e−6 respectively), were established by RF and SVM respectively (Table 2). In the results of independent data sets, RF-based model (I) (accuracy of 96.10% with MCC, C-index and Brier score of 0.9221, 0.7200, and 0.1986, respectively) and SVM-based model (II) (accuracy of 94.74% with MCC, C-index, and Brier score of 0.8944, 0.7046, and 0.1957, respectively) also showed superior performance (Table 3). It can also be seen from the results of survival analysis that the models can stably divide cancer patients into two groups (S1 and S2, 3-year survival rates ≥75% and ≤45%, respectively) (Figure 5), which significantly improves the practicability of the model. In addition, the results of the five-fold cross-validation also showed that the models have excellent performance, with accuracies of 94.45% and 94.76%, the corresponding C-index values being 0.6592 and 0.6727 respectively (Table 4). The above results showed that the models have good internal stability and external predictability. Moreover, compared with the model of Chai et al. (30), our models have better performance, with C-index value increasing by 13.8%.

The models constructed in this study exhibit inherent interpretability. Rudin systematically criticized the current phenomenon of overusing “black box” machine learning models in high-risk decision-making fields, pointing out that “explainability” should not be achieved through post-hoc explanation, but rather by directly constructing inherently interpretable models (31). This study has designed and employed “inherently interpretable” models from the beginning. An inherently interpretable model has a constrained and clear structure, allowing users to directly understand its decision-making logic. Among the 20 genes identified in this study, 16 are reported genes related to gastrointestinal cancer, such as CDX1 (32,33), RNF43 (34-38), EHD2 (39,40), etc., further demonstrating the reliability of our research results. Therefore, we infer that the remaining four genes are newly discovered genes related to gastrointestinal cancer. Furthermore, this study integrated pan-cancer analysis and single-cell technology, not only confirming that the newly discovered four genes (AHCY, NIBAN1, SPART, and TRABD2A) are widely dysexpressed in cancers (especially gastrointestinal cancers), but more importantly, by precisely identifying the occurrence of their dysregulation within malignant cells, this provides strong cytological evidence that these genes are directly involved in driving the malignant phenotype of cancer cells. This lays a solid foundation for further exploring their value as diagnostic markers or therapeutic targets, especially for in-depth analysis of the core role of AHCY in the metabolism and epigenetic regulation of cancer cells.

The comprehensive functional analysis in this study provides important clues for elucidating the potential mechanisms of action of the selected cancer-related genes in tumor initiation and progression. GO analysis showed that these genes were significantly enriched in key biological processes such as positive regulation of cell population proliferation, Wnt signaling pathway, positive regulation of transcription by RNA polymerase II, hippo signaling, and digestive tract development. These processes are highly correlated with abnormal proliferation of tumor cells, cell fate determination, and the development of tissues and organs (especially the digestive tract), strongly suggesting that these genes play an important role in driving or maintaining the malignant phenotype of tumors. At the molecular functional level, the significant enrichment of functions such as sequence-specific double-stranded DNA binding, DNA-binding transcription factor activity, and RNA polymerase II transcription regulatory region sequence-specific DNA binding further points to the fact that these genes mainly influence cancer progression by regulating the transcriptional programs of core genes. Its role may involve epigenetic regulation (such as methyl-CpG binding). The significant enrichment of the plasma membrane in the cell component analysis suggests that some gene products may function on the cell surface or be involved in signal reception, and are closely related to cancer-related cell communication or microenvironmental interactions. Pathway analysis provides a more systematic perspective for this. Genes were enriched in core pathways such as Gene expression (Transcription), Generic Transcription Pathway, Nuclear Receptor transcription pathway, Signaling by WNT in cancer, Signaling by Hippo. The WNT and Hippo signaling pathways play a recognized core role in regulating cell proliferation, differentiation and tissue homeostasis, and their dysregulation is a hallmark of various cancers, especially gastrointestinal cancer (41,42). Enrichment in generic and nuclear receptor-mediated transcriptional pathways further solidifies the mechanism hypothesis that these genes drive cancer progression by directly or indirectly regulating transcriptional networks. PPI analysis revealed the close interactions among these marker proteins, particularly the connections between key nodes such as HNF4A, NR3C1, CDX1, CDX2, and RNF43. This not only verifies their potential synergistic role in common biological processes such as development and transcriptional regulation, but also lays the foundation for analyzing the functional modules or regulatory cascades that constitute them. This interaction network provides a valuable structural framework for subsequent in-depth exploration of the specific operational mechanisms of cancer-related molecular pathways and identification of potential, intervenable target groups.

Most importantly, based on the DGIdb database, we conducted drug target predictions for these genes and successfully identified potential targeted drugs for seven genes (NR3C1, HNF4A, DNAAF9, CDX2, ATP2B4, RBMS3, LIFR). Among them, NR3C1 and HNF4A demonstrated strong drugability and interacted with 134 and 16 known drugs respectively. It is worth noting that some of the predicted interacting compounds are mature drugs that have been approved for anti-tumor or other indications (such as digestive system diseases). These findings have significant translational medical value. On the one hand, they provide clues for understanding the possible unknown effects of existing drugs in gastrointestinal cancer (which may improve or exacerbate the condition, with dose and background dependence); on the other hand, and more importantly, they reveal the potential of repurposing existing drugs (Drug Repurposing) to target the marker genes identified in this study, providing an extremely attractive entry point for developing new treatment strategies for gastrointestinal cancer. However, these predicted drug-gene interactions still need to be verified through rigorous in vitro and in vivo experiments, and their specific effects and mechanisms of action in the microenvironment of specific gastrointestinal cancer types should be further explored. These findings further align with the research trends emphasized in recent bioinformatics studies focused on gastrointestinal cancer biomarkers, such as the work of Al-Assi et al. (43), Bareja et al. (44), and Zhang et al. (45), which highlighted the translational potential of computationally identified biomarkers. Just as their research linked bioinformatics-discovered genes to clinicopathological features or therapeutic relevance, our drug-gene interaction predictions extend this direction, providing a more direct translational path from biomarker discovery to actionable therapeutic strategies. This also complements the regulatory mechanism exploration work of Deng et al. (46) and Zhang et al. (47) in gastrointestinal cancer: while their studies focused on the regulatory networks of cancer-related genes, our work provides a practical extension by matching these candidate genes to available drugs, which could accelerate the translation of such biomarker research into clinical practice.

There are limitations in this study. Firstly, although high-precision survival prediction models for gastrointestinal cancer have been successfully constructed, revealing molecular subtypes with significant survival differences (S1 and S2 subtypes) and independent predictive value, in the future, it is necessary to verify the generalization ability and stability of the models in larger prospective independent cohorts to evaluate their practical application in a broader population. Secondly, twenty key genes related to gastrointestinal cancer were successfully identified. However, the study did not conduct in-depth in vitro or in vivo functional experiments to verify these genes to clarify their specific biological functions and mechanisms in the occurrence, development, metastasis or treatment response of cancer. Subsequent experimental studies (such as functional verification) are needed to further clarify them.


Conclusions

In conclusion, we established efficient gastrointestinal cancer survival prediction models with gene expression profiling data as input molecular features: RF-based model (I) had an accuracy of 94.98% with MCC of 0.8995. SVM-based model (II) had an accuracy of 94.98% with MCC of 0.9000. We found a significant difference in survival between the two subtypes (S1 and S2, 3-year survival rates ≥75% and ≤45%, respectively). These subtypes have independent predictive value for patient survival. Twenty key genes related to gastrointestinal cancer were successfully identified. The findings of our study have important implications for both survival prediction and treatment decisions in gastrointestinal cancer.


Acknowledgments

None.


Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-aw-2200/rc

Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-aw-2200/prf

Funding: This work was supported by the Talent Scientific Research Start-up Foundation of Wannan Medical College (No. WYRCQD2023045).

Conflicts of Interest: Both authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-aw-2200/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Siegel RL, Kratzer TB, Giaquinto AN, et al. Cancer statistics, 2025. CA Cancer J Clin 2025;75:10-45. [Crossref] [PubMed]
  2. Bray F, Laversanne M, Sung H, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024;74:229-63. [Crossref] [PubMed]
  3. Han B, Zheng R, Zeng H, et al. Cancer incidence and mortality in China, 2022. J Natl Cancer Cent 2024;4:47-53. [Crossref] [PubMed]
  4. Wang S, Zheng R, Li J, et al. Global, regional, and national lifetime risks of developing and dying from gastrointestinal cancers in 185 countries: a population-based systematic analysis of GLOBOCAN. Lancet Gastroenterol Hepatol 2024;9:229-37. [Crossref] [PubMed]
  5. Lu Z, Chen Y, Liu D, et al. The landscape of cancer research and cancer care in China. Nat Med 2023;29:3022-32. [Crossref] [PubMed]
  6. Qi J, Li M, Wang L, et al. National and subnational trends in cancer burden in China, 2005-20: an analysis of national mortality surveillance data. Lancet Public Health 2023;8:e943-55. [Crossref] [PubMed]
  7. Liu SC, Tang HL, Liu HD, et al. Multi-label Learning for the Diagnosis of Cancer and Identification of Novel Biomarkers with High-throughput Omics. Current Bioinformatics 2021;16:261-273.
  8. Zhan T, Betge J, Schulte N, et al. Digestive cancers: mechanisms, therapeutics and management. Signal Transduct Target Ther 2025;10:24. [Crossref] [PubMed]
  9. Ben-Hamo R, Jacob Berger A, Gavert N, et al. Predicting and affecting response to cancer therapy based on pathway-level biomarkers. Nat Commun 2020;11:3296. [Crossref] [PubMed]
  10. Delattre JF, Selcen Oguz Erdogan A, Cohen R, et al. A comprehensive overview of tumour deposits in colorectal cancer: Towards a next TNM classification. Cancer Treat Rev 2022;103:102325. [Crossref] [PubMed]
  11. Comprehensive molecular portraits of human breast tumours. Nature 2012;490:61-70. [Crossref] [PubMed]
  12. Cancer Genome Atlas Research Network. Integrated genomic and molecular characterization of cervical cancer. Nature 2017;543:378-84. [Crossref] [PubMed]
  13. Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012;487:330-7. [Crossref] [PubMed]
  14. Huang S, Chaudhary K, Garmire LX. More Is Better: Recent Progress in Multi-Omics Data Integration Methods. Front Genet 2017;8:84. [Crossref] [PubMed]
  15. Liu G, Dong C, Liu L. Integrated Multiple "-omics" Data Reveal Subtypes of Hepatocellular Carcinoma. PLoS One 2016;11:e0165457. [Crossref] [PubMed]
  16. Boyault S, Rickman DS, de Reyniès A, et al. Transcriptome classification of HCC is related to gene alterations and to new therapeutic targets. Hepatology 2007;45:42-52. [Crossref] [PubMed]
  17. Arjmand B, Hamidpour SK, Tayanloo-Beik A, et al. Machine Learning: A New Prospect in Multi-Omics Data Analysis of Cancer. Front Genet 2022;13:824451. [Crossref] [PubMed]
  18. Breiman L. Random Forests. Machine Learning 2001;45:5-32. [Crossref] [PubMed]
  19. Nguyen J, Jézéquel P, Gillois P, et al. Random Forest of Perfect Trees: Concept, Performance, Applications and Perspectives. Bioinformatics 2021;37:2165-74. [Crossref] [PubMed]
  20. Alghamdi W, Attique M, Alzahrani E, et al. LBCEPred: a machine learning model to predict linear B-cell epitopes. Brief Bioinform 2022;23:bbac035. [Crossref] [PubMed]
  21. Basith S, Hasan MM, Lee G, et al. Integrative machine learning framework for the identification of cell-specific enhancers from the human genome. Brief Bioinform 2021;22:bbab252. [Crossref] [PubMed]
  22. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2011;2:1-27.
  23. Sherman BT, Hao M, Qiu J, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res 2022;50:W216. [Crossref] [PubMed]
  24. Gillespie M, Jassal B, Stephan R, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res 2022;50:D687-92. [Crossref] [PubMed]
  25. Szklarczyk D, Kirsch R, Koutrouli M, et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res 2023;51:D638-46. [Crossref] [PubMed]
  26. Cannon M, Stevenson J, Stahl K, et al. DGIdb 5.0: rebuilding the drug-gene interaction database for precision medicine and drug discovery platforms. Nucleic Acids Res 2024;52:D1227-35. [Crossref] [PubMed]
  27. Cui H, Zhao G, Lu Y, et al. TIMER3: an enhanced resource for tumor immune analysis. Nucleic Acids Res 2025;53:W534. [Crossref] [PubMed]
  28. Han Y, Wang Y, Dong X, et al. TISCH2: expanded datasets and new tools for single-cell transcriptome analyses of the tumor microenvironment. Nucleic Acids Res 2023;51:D1425-31. [Crossref] [PubMed]
  29. Wang Y, Zhu GQ, Tian D, et al. Comprehensive analysis of tumor immune microenvironment and prognosis of m6A-related lncRNAs in gastric cancer. BMC Cancer 2022;22:316. [Crossref] [PubMed]
  30. Chai H, Zhou X, Zhang Z, et al. Integrating multi-omics data through deep learning for accurate cancer prognosis prediction. Comput Biol Med 2021;134:104481. [Crossref] [PubMed]
  31. Rudin C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat Mach Intell 2019;1:206-15. [Crossref] [PubMed]
  32. Aoki K, Nitta A, Igarashi A. CDX1 and CDX2 suppress colon cancer stemness by inhibiting beta-catenin-facilitated formation of Pol II-DSIF-PAF1C complex. Cell Death Dis 2025;16:408. [Crossref] [PubMed]
  33. Khayyat A, Esmaeil Pour MA, Poursina O, et al. Evaluations of Biomarkers CDX1 and CDX2 in Gastric Cancer Prognosis: A Meta-analysis. Int J Mol Cell Med 2024;13:1-19. [Crossref] [PubMed]
  34. Huang K, Ding S, Chen K, et al. RNF43 mutation as a predictor of immunotherapeutic efficacy in colorectal cancer. Am J Cancer Res 2023;13:5549-58.
  35. Hosein AN, Dangol G, Okumura T, et al. Loss of Rnf43 Accelerates Kras-Mediated Neoplasia and Remodels the Tumor Immune Microenvironment in Pancreatic Adenocarcinoma. Gastroenterology 2022;162:1303-1318.e18. [Crossref] [PubMed]
  36. Neumeyer V, Brutau-Abia A, Allgäuer M, et al. Loss of RNF43 Function Contributes to Gastric Carcinogenesis by Impairing DNA Damage Response. Cell Mol Gastroenterol Hepatol 2021;11:1071-94. [Crossref] [PubMed]
  37. Belenguer G, Mastrogiovanni G, Pacini C, et al. RNF43/ZNRF3 loss predisposes to hepatocellular-carcinoma by impairing liver regeneration and altering the liver lipid metabolic ground-state. Nat Commun 2022;13:334. [Crossref] [PubMed]
  38. Li S, Niu J, Smits R. RNF43 and ZNRF3: Versatile regulators at the membrane and their role in cancer. Biochim Biophys Acta Rev Cancer 2024;1879:189217. [Crossref] [PubMed]
  39. Guan C, Lu C, Xiao M, et al. EHD2 Overexpression Suppresses the Proliferation, Migration, and Invasion in Human Colon Cancer. Cancer Invest 2021;39:297-309. [Crossref] [PubMed]
  40. Zhang MS, Cui JD, Lee D, et al. Hypoxia-induced macropinocytosis represents a metabolic route for liver cancer. Nat Commun 2022;13:954. [Crossref] [PubMed]
  41. Prajapati D, Ambere G, Mathure D, et al. Evolutionary conservation and cancer implications of the WNT signaling pathway. Med Oncol 2025;42:434. [Crossref] [PubMed]
  42. Chan SW, Ong C, Hong W. The recent advances and implications in cancer therapy for the hippo pathway. Curr Opin Cell Biol 2025;93:102476. [Crossref] [PubMed]
  43. Al-Assi G, Abdulsahib WK, Mustafa WW, et al. Computational identification and validation of non-coding rna biomarkers in gastrointestinal cancer. Funct Integr Genomics 2025;25:247. [Crossref] [PubMed]
  44. Bareja C, Dwivedi K, Uboveja A, et al. Identification and clinicopathological analysis of potential p73-regulated biomarkers in colorectal cancer via integrative bioinformatics. Sci Rep 2024;14:9894. [Crossref] [PubMed]
  45. Zhang J, Wang X, Li Z, et al. Comprehensive bioinformatics analysis was used to identify and verify differentially expressed genes in targeted therapy of colon cancer. Sci Rep 2025;15:14922. [Crossref] [PubMed]
  46. Deng ZJ, OuYang LY, Guo JP, et al. CircZFAND6 suppresses gastric cancer metastasis and reduces resistance to TKI therapy. Mol Cancer 2025;24:305. [Crossref] [PubMed]
  47. Zhang L, Li Y, Fu C, et al. Exploration and validation of ceRNA regulatory networks in colorectal cancer based on associations whole transcriptome sequencing. Sci Rep 2024;14:20446. [Crossref] [PubMed]
Cite this article as: Liu S, Zhang H. Interpretable machine learning-based survival prediction and key gene identification in cancer using gene expression and clinical data. Transl Cancer Res 2026;15(2):96. doi: 10.21037/tcr-2025-aw-2200

Download Citation