Construction of a diagnostic model for colorectal cancer based on exosome-related genes: integration of immune cell differentials and molecular docking
Original Article

Construction of a diagnostic model for colorectal cancer based on exosome-related genes: integration of immune cell differentials and molecular docking

Yulai Yin1#, Li Li2,3#, Shuang Liu4#, Yixuan Xie4#, Yuwei Li2,3, Yuan Gao5, Chen Xu2,3, Yan Wang6

1School of Medicine, Nankai University, Tianjin, China; 2Department of Colorectal Surgery, Tianjin Union Medical Center, The First Affiliated Hospital of Nankai University, Nankai University, Tianjin, China; 3Tianjin Institute of Coloproctology, Tianjin, China; 4School of Integrative Medicine, Tianjin University of Traditional Chinese Medicine, Tianjin, China; 5Department of Colorectal Surgery, Inner Mongolia Autonomous Region Hospital of Traditional Chinese Medicine, Hohhot, China; 6Department of Traditional Chinese Medicine, Shanghai Pudong New Area People’s Hospital, Shanghai, China

Contributions: (I) Conception and design: Y Yin, L Li, S Liu; (II) Administrative support: Y Li, Y Gao, C Xu, Y Wang; (III) Provision of study materials or patients: Y Yin; (IV) Collection and assembly of data: Y Yin, Y Xie; (V) Data analysis and interpretation: Y Yin, L Li; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

#These authors contributed equally to this work.

Correspondence to: Yuan Gao, MD, PhD. Department of Colorectal Surgery, Inner Mongolia Autonomous Region Hospital of Traditional Chinese Medicine, No. 11, Jiankang Street, Xincheng District, Hohhot 010050, China. Email: gaoyuan_0524@sina.com; Chen Xu, MD, PhD. Department of Colorectal Surgery, Tianjin Union Medical Center, The First Affiliated Hospital of Nankai University, Nankai University, No. 190, Jieyuan Road, Hongqiao District, Tianjin 300121, China; Tianjin Institute of Coloproctology, Tianjin 300121, China. Email: xc198129@163.com; Yan Wang, MD, PhD. Department of Traditional Chinese Medicine, Shanghai Pudong New Area People’s Hospital, No. 490, Chuanhuan South Road, Chuansha New Town, Pudong New District, Shanghai 201299, China. Email: 41703965@qq.com.

Background: Colorectal cancer (CRC) is one of the most common malignancies of the digestive tract, with conventional clinical diagnoses often made at advanced stages. There is an urgent need for a genetic diagnostic model to predict the onset of CRC at early stages, thereby reducing the disease burden associated with it. This study aimed to construct and validate a CRC diagnostic model based on exosome-related genes and to explore potential target drugs.

Methods: Gene expression differences between CRC and normal groups were first analyzed using datasets from the Gene Expression Omnibus (GEO) database, and CRC-related genes were identified. Exosome-related genes were then obtained through GeneCards and literature searches. The intersection of CRC-related genes and exosome-related genes was analyzed, and the intersection genes underwent Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Gene Set Enrichment Analysis (GSEA) enrichment analysis. Three machine learning models were used to screen the model genes. The model was validated using both the training and validation datasets. Potential target drugs were identified based on the number of enriched model genes and were subjected to molecular docking analysis.

Results: Gene expression analysis from the GEO dataset identified 316 differentially expressed genes associated with CRC. A total of 1,052 exosome-related genes were found through GeneCards and literature searches. The intersection of CRC-related genes and exosome-related genes yielded 21 intersection genes. These intersection genes were enriched in pathways such as interleukin-17 (IL-17) signaling and microRNA in cancer. Three machine learning methods identified four model intersection genes: EXOSC4, MMP9, ABCB1, and SOX2. Based on the number of enriched model genes, butyrate, dimethyl sulfoxide, and vorinostat were identified as potential target drugs.

Conclusions: EXOSC4, MMP9, ABCB1, and SOX2 are important exosome-related biomarkers for the diagnosis of CRC, and butyrate, dimethyl sulfoxide, and vorinostat have the potential to serve as targeted therapeutic drugs for CRC.

Keywords: Exosome; colorectal cancer (CRC); machine learning; molecular docking; immune cells


Submitted Dec 13, 2025. Accepted for publication Mar 04, 2026. Published online Mar 27, 2026.

doi: 10.21037/tcr-2025-1-2789


Highlight box

Key findings

• Constructed and validated a colorectal cancer (CRC) diagnostic model based on exosome-related genes.

• Identified key model genes (EXOSC4, MMP9, ABCB1, SOX2) associated with CRC diagnosis.

• Performed gene enrichment analysis and machine learning to screen potential model genes.

• Identified butyrate, dimethyl sulfoxide, and vorinostat as potential targeted drugs for CRC.

• Demonstrated the importance of exosome-related genes and immune cell interactions in CRC prognosis.

What is known and what is new?

• Identified key model genes (EXOSC4, MMP9, ABCB1, SOX2) associated with CRC diagnosis.

• Identified butyrate, dimethyl sulfoxide, and vorinostat as potential targeted drugs for CRC.

• This study provides valuable insights for the early diagnosis and treatment of CRC. Further validation through experimental studies is necessary.

What is the implication, and what should change now?

• This study provides valuable insights for the early diagnosis and treatment of CRC. Further validation through experimental studies is necessary.


Introduction

Colorectal cancer (CRC) (1) is a malignancy originating in the colon and rectum, predominantly presenting as adenocarcinoma. Due to the lack of obvious early symptoms and signs, it is often diagnosed at later stages when symptoms such as weight loss, hematochezia, and constipation appear. Coupled with its biological propensity for liver metastasis, CRC is widely recognized and understood by the general public (2). Exosomes (3,4) have become a prominent concept in recent years and are being extensively studied by researchers across various disciplines. These small membrane vesicles are secreted by various cell types through their cell membranes and contain a range of biomolecules, including lipids, proteins, RNAs [such as messenger RNA (mRNA), microRNA, etc.], and certain small molecular metabolites. These molecules play a crucial role in intercellular signaling and information exchange. Exosomes are widely distributed in human fluids such as blood, urine, saliva, breast milk, cerebrospinal fluid, feces, and pleural and peritoneal fluids. By interacting with recipient cells, exosomes participate in intercellular communication and regulate biological processes such as immune responses, cell proliferation, migration, and metastasis. Through carrying biomarkers from their parent cells, exosomes can serve as biomarkers for early disease diagnosis. Therefore, this study aims to investigate the potential of exosome-related genes as early diagnostic markers for CRC, and to construct and validate a diagnostic model for CRC, providing a reference for the further analysis of targeted drugs. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2789/rc).


Methods

Materials

The Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/gds) was accessed to search for “CRC” with the following filters: experimental studies, expression profiling by array, and species limited to humans. Four datasets were selected based on sample size and inclusion of both tumor and normal groups: GSE164191 (5), GSE126092 (6), GSE110224 (7), and GSE87211 (8). The Series Matrix File and GPL files for each dataset were downloaded.

Data preprocessing

The Series Matrix File and GPL files were preprocessed to ensure that gene symbols corresponded with gene names in the gene expression matrix. Data correction was performed using R software (version 4.2.2) and the limma package. Subsequently, datasets GSE164191, GSE126092, and GSE110224 were merged and batch-corrected to serve as the training set for machine learning model construction.

Differential gene expression analysis

Differential gene expression analysis was conducted between the tumor and normal groups using the batch-corrected merged data. A log fold change (FC) filter of 0.585 (FC >1.5) was applied, and a corrected P value threshold of 0.05 was used. Genes with significant differential expression were identified and their expression levels were output. A heatmap was generated for the top 50 upregulated and downregulated genes. A volcano plot was also created using the same filtering criteria.

Exosome-related gene identification and intersection genes

Exosome-related genes were retrieved from GeneCards (https://www.genecards.org/) by searching for “Exosome” and downloaded for integration with genes identified in the literature (9). The merged list of exosome-related genes is provided in Supplementary Material 1 available at https://cdn.amegroups.cn/static/public/tcr-2025-1-2789-1.xls. The intersection of these exosome-related genes with the differentially expressed genes was determined to obtain the final set of intersection genes.

Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Gene Set Enrichment Analysis (GSEA) enrichment analysis

First, the intersection genes obtained from the exosome-related and differentially expressed genes were analyzed for enrichment in biological processes, cellular components, and molecular functions using GO analysis. The P value threshold for GO analysis was set to 0.05, with a corrected P value threshold of 0.05. KEGG pathway enrichment analysis was then performed with a P value threshold of 0.05 and a corrected P value threshold of 1. Both GO and KEGG analyses were visualized using bar plots and bubble charts. Additionally, GSEA was conducted on the differential gene expression results, and the top five most significantly enriched pathways were visualized.

Univariate logistic regression for model gene exploration

The intersection genes obtained from exosome-related and differentially expressed genes were subjected to univariate logistic regression with disease status as the outcome variable. Genes with statistically significant associations were selected as the initial gene set for further model construction.

Machine learning for model gene selection

Lasso, Support Vector Machine (SVM), and Random Forest machine learning methods were employed to identify model genes for each respective method (10). In Lasso, alpha =1 was set to apply full L1 regularization, with 10-fold cross-validation used to perform Lasso regression and select the optimal regularization parameter. In SVM, a machine learning-based SVM-Recursive Feature Elimination (SVM-RFE) algorithm was employed, with 10 features eliminated at each step and 10-fold cross-validation utilized. For Random Forest, 500 decision trees were constructed, and genes with an importance score greater than 2 were selected. The intersection of genes identified by all three methods was subsequently chosen. Differential expression analysis of these intersecting genes was conducted using batch-corrected merged data, followed by visualization via box plots and correlation heatmaps.

Model construction and validation

The genes identified through machine learning were used to construct a diagnostic model. The merged dataset was used as the training set, and GSE87211 served as the validation set. Receiver operating characteristic (ROC) curves for individual genes and the overall model were plotted. Additionally, a nomogram was constructed using the merged dataset. Calibration curves and decision curves were plotted using both the merged training set and the GSE87211 validation set.

Immune cell differential and correlation analysis

Immune cell scores were assigned to the samples using immune cell-related gene data after batch effect correction. Differential expression analysis of immune cells between the tumor and normal groups was performed, and results were visualized using box plots. Correlation analysis was then conducted between the immune cell distribution and model genes to assess the relationship between model gene expression and immune cell populations.

Targeted drug selection and molecular docking

Targeted drugs were selected by downloading drug data from the DSigDB database (11) (https://dsigdb.tanlab.org/DSigDBv1.0/download.html), filtering drugs based on their association with three or more model genes. Molecular docking was performed using the CB-Dock2 online platform (12) (https://cadd.labshare.cn/cb-dock2/php/index.php), with blind docking enabled. The docking results with the lowest free energy were visualized.

Statistical methods

Bioinformatics analysis and visualization were performed using R 4.3.3 and R 4.4.2, with a P value of <0.05 considered statistically significant. The R packages “limma”, “dplyr”, “pheatmap”, “ggplot2”, “igraph”, “ggrepel”, “ggvenn”, “reshape2”, and “ggpubr” were used in this study. *, P<0.05; **, P<0.01; ***, P<0.001; ****, P<0.0001. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.


Results

Data preprocessing and differential expression analysis

The datasets GSE164191, GSE126092, and GSE110224 were merged and batch-corrected. Visualizations of the datasets before and after correction were performed. The results indicated significant differences in the datasets prior to correction (Figure 1A), with principal component analysis (PCA) showing clustering by dataset (Figure 1B). After batch correction, no obvious differences were observed among the three datasets (Figure 1C), and PCA revealed random distribution of the sample data (Figure 1D). Differential gene expression analysis was conducted on the merged data with batch effects removed, comparing gene expression between the tumor and normal groups. The analysis revealed significant differences in gene expression between the tumor and normal groups (Figure 1E,1F).

Figure 1 Data preprocessing and differential expression analysis. (A) Boxplot before batch effect removal (different colors represent different datasets, with each individual box corresponding to an independent sample). (B) Boxplot after batch effect removal (different colors represent different datasets, with each individual box corresponding to an independent sample). (C) PCA before batch effect removal (different colors represent different datasets, with each individual point corresponding to an independent sample). (D) PCA after batch effect removal (different colors represent different datasets, with each individual point corresponding to an independent sample). (E) Differential expression heatmap of the merged dataset after batch effect removal (color intensity towards red indicates higher gene expression, while color intensity towards blue indicates lower gene expression. The color-coded lines at the top correspond to different datasets and groups). (F) Differential expression volcano plot of the merged dataset after batch effect removal (gray points represent genes with no significant expression differences, red points represent genes upregulated in the tumor group, and blue points represent genes downregulated in the tumor group). FC, fold change; PC, principal component; PCA, principal component analysis.

Exosome-related intersection gene identification and enrichment analysis

The intersection of differentially expressed genes and exosome-related genes was identified, yielding 21 intersection genes (Figure 2A). GO enrichment analysis of these intersection genes revealed the highest enrichment in the biological process “extracellular matrix organization” (GO: 0071492) (Figure 2B). Significantly enriched biological processes included responses to exogenous stimuli, collagen catabolic processes, and cellular responses to ultraviolet A radiation. In terms of cellular components, specific granules and phagocytic vesicle membranes were notably enriched. Regarding molecular functions, metalloprotease activity and endopeptidase activity were significantly enriched (Figure 2C,2D). The gene enrichment network diagram highlighted genes such as MMP9, MMP7, and MMP1, which were closely associated with multiple enriched biological processes, including collagen degradation and response to exogenous stimuli. Additionally, genes such as SFRP1, CRYAB, and IL1RN were found to be strongly linked with cellular responses to ultraviolet radiation and exogenous stimuli. Genes with higher fold-change values, such as MMP9 and CHI3L1, played particularly prominent roles in these processes (Figure 2E).

Figure 2 Exosome-related intersection gene identification and enrichment analysis. (A) Venn diagram of exosome-related genes and DEG (the red circle represents DEG, and the blue circle represents exosome-related genes). (B) Circular plot of GO enrichment analysis (the outermost circle represents GO pathways, the second circle shows the total number of genes in each pathway, the third circle shows the number of intersection genes enriched in each pathway, and the fourth circle displays the proportion of intersection genes in each pathway relative to all related genes in the pathway). (C) Bar plot of GO enrichment analysis. (D) Bubble plot of GO enrichment analysis. (E) GO enrichment network plot (darker blue indicates higher fold change values for genes, and larger circles indicate more genes enriched in the respective pathway). (F) Bar plot of KEGG enrichment analysis. (G) Bubble plot of KEGG enrichment analysis. (H) KEGG enrichment network plot (darker blue indicates higher fold change values for genes, and larger circles indicate more genes enriched in the respective pathway). (I) GSEA of differential expression genes (the curve above the dashed line represents pathways enriched in the colorectal cancer group, while the curve below the dashed line represents pathways enriched in the normal group). DEG, differentially expressed genes; GO, Gene Ontology; GSEA, Gene Set Enrichment Analysis; KEGG, Kyoto Encyclopedia of Genes and Genomes.

KEGG enrichment analysis indicated that the intersection genes were significantly associated with the interleukin-17 (IL-17) signaling pathway and microRNA signaling in cancer (Figure 2F,2G). The gene enrichment network diagram revealed that pathways such as the Wnt signaling pathway, the role of microRNAs in cancer, and bladder cancer exhibited notable gene expression changes, suggesting that these pathways may play significant roles in gene regulation. Meanwhile, the IL-17 signaling pathway and Relaxin signaling pathway also showed significant enrichment, with pronounced gene expression changes, indicating their potential involvement in immune responses and related diseases. Genes such as MMP9, MMP1, and MMP7 were strongly associated with multiple pathways, including bladder cancer, IL-17 signaling, and the role of microRNAs in cancer, highlighting their important roles in cancer development, immune responses, and signal transduction. Moreover, genes such as SFRP1, LCN2, and CA9 also demonstrated strong associations with these pathways, suggesting their potential roles in tumor regulation and pathway modulation (Figure 2H).

Additionally, GSEA enrichment analysis was performed on the results of the differential gene expression analysis, visualizing the top five most significantly enriched pathways. The IL-17 signaling pathway, fatty acid degradation, bladder cancer, and cell cycle pathways were significantly enriched in the CRC group (Figure 2I).

Model gene selection

Univariate logistic regression was performed to explore model genes, with genes showing significant statistical significance selected as the initial gene set for further model construction. The initial gene set was consistent with the intersection of exosome-related genes, resulting in 21 initial genes. Three machine learning methods—Lasso, SVM, and Random Forest—were employed to select model genes from the initial gene set.

Lasso regression (13) analysis revealed that the smallest binomial deviance occurred when 10 genes were included (Figure 3A,3B). The 10 model genes selected by Lasso regression were EXOSC4, MMP9, ABCB1, ENPP4, PDCD4, CA9, CP, ALDH1A1, DPEP1, and SOX2. SVM analysis showed that when the number of feature genes was 6, the 10-fold cross-validation accuracy was highest, and the 10-fold cross-validation error rate was lowest (Figure 3C,3D). The 6 model genes selected by SVM were EXOSC4, MMP9, ABCB1, SOX2, S100A11, and LCN2. Random Forest analysis indicated that as the number of trees in the Random Forest increased, the model’s prediction error gradually decreased, improving both stability and accuracy (Figure 3E). Genes with importance scores greater than 2 were selected, resulting in 20 genes. Among them, EXOSC4 was the most important feature with an importance score of 10, followed by ABCB1, MMP9, and other features, which made significant contributions to the model’s prediction. Other features, such as SFRP1 and NXPE4, had lower importance (Figure 3F). The intersection of the three machine learning methods identified 4 model genes: EXOSC4, MMP9, ABCB1, and SOX2 (Figure 3G).

Figure 3 Model gene selection. (A) LASSO path plot. (B) Binomial deviance plot of the Lasso model (the dashed line indicates the model gene number and lambda value corresponding to the minimum binomial deviance). (C) 10-fold cross-validation accuracy plot of SVM. (D) 10-fold cross-validation error rate plot of SVM. (E) Plot showing the relationship between the error rate and the number of trees in the RF model. (F) RF feature importance plot (the redder the circle, the higher the importance of the gene). (G) Venn diagram of model genes from the three machine learning methods. (H) Chromosomal distribution circular plot of model genes. (I) Correlation heatmap of model genes (the redder the color, the stronger the positive correlation between genes; the bluer the color, the stronger the negative correlation between genes). (J) Differential expression boxplot of model genes (blue represents the control group, red represents the tumor group). *, denotes P<0.05; **, denotes P<0.01; ***, denotes P<0.001. CV, Cross-Validation; LASSO, Least Absolute Shrinkage and Selection Operator; RF, Random Forest; SVM, Support Vector Machine.

Chromosomal localization of the 4 model genes showed that EXOSC4 is located on chromosome 12, ABCB1 on chromosome 7, MMP9 on chromosome 20, and SOX2 on chromosome 3 (Figure 3H). Correlation analysis of the 4 genes using the merged dataset revealed a significant positive correlation between the expression of EXOSC4 and MMP9, a significant negative correlation between EXOSC4 and ABCB1, and a significant negative correlation between MMP9 and ABCB1 (Figure 3I). Furthermore, significant differential expression was observed for the 4 genes in the merged dataset (Figure 3J).

Model construction and validation

The combined dataset was used as the training set, and GSE87211 served as the validation set to plot the ROC curves for the model genes and the overall ROC curve. The results showed that the overall ROC curve for the training set had an area under the curve (AUC) [95% confidence interval (CI)] value of 0.902 (0.850–0.941) (Figure 4A,4B), while the overall ROC curve for the validation set had an AUC (95% CI) value of 0.961 (0.938–0.979) (Figure 4C,4D). A nomogram was constructed using the training set data (Figure 4E). The calibration curve and decision curve for the training set indicated that the CRC prediction model, constructed using the four genes (EXOSC4, MMP9, ABCB1, and SOX2), exhibited high predictive efficacy (Figure 4F,4G). Similarly, the calibration curve and decision curve for the validation set also demonstrated that the CRC prediction model constructed with the four genes (EXOSC4, MMP9, ABCB1, and SOX2) maintained high predictive efficacy (Figure 4H,4I).

Figure 4 Model construction and validation. (A) ROC curve for model genes in the training set. (B) Overall ROC curve for the training set model. (C) ROC curve for model genes in the validation set. (D) Overall ROC curve for the validation set model. (E) Nomogram for the training set (higher total scores indicate a higher risk of disease occurrence). (F) Calibration curve for the training set (the closer the corrected curve is to the 45° diagonal line, the more robust and accurate the model). (G) Decision curve for the training set (the more the red line is above the horizontal line, the better the model’s predictive performance and net benefit). (H) Calibration curve for the validation set (the closer the corrected curve is to the 45° diagonal line, the more robust and accurate the model). (I) Decision curve for the validation set (the more the red line is above the horizontal line, the better the model’s predictive performance and net benefit). AUC, area under the curve; CI, confidence interval; ROC, receiver operating characteristic.

Gene-drug network construction

The merged dataset, with batch effects removed, was used to score immune cells based on immune cell-related genes. Differential expression analysis of immune cells was conducted between the tumor and normal groups, and box plots were generated. The results showed significant differences in various immune cells between the tumor and normal groups, indicating differences in the immune microenvironment of the two groups (Figure 5A). Correlation analysis was then performed between the immune cell distribution and the model genes to examine the correlation between model genes and immune cell populations. The results revealed that MMP9 was significantly positively correlated with various immune cells, including CD56 bright natural killer cells, central memory CD8 T cells, immature dendritic cells, and macrophages, highlighting the important role of MMP9 in the tumor immunity of CRC (Figure 5B).

Figure 5 Gene-drug network construction. (A) Differential immune cell composition in the combined dataset (asterisks above indicate significant differences in immune cells between the tumor and normal groups). *, denotes P<0.05; **, denotes P<0.01; ***, denotes P<0.001. (B) Correlation heatmap of model genes and immune cells (the redder the square, the stronger the positive correlation between the two; the bluer the square, the stronger the negative correlation between the two). (C) Bar plot of drug-gene enrichment analysis. (D) Bubble plot of drug-gene enrichment analysis (the larger the circle, the more model genes enriched in that drug). (E) Gene-drug network diagram (orange circles represent model genes, purple squares represent drugs, and the lines indicate associations and enrichment between the genes and drugs).

For the gene-drug enrichment analysis of the four model genes, the results indicated that butyrate, dimethyl sulfoxide, and vorinostat were enriched with three or more model genes, suggesting these as potential targeted drugs for CRC treatment (Figure 5C,5D). The gene-drug network diagram showed that a total of 612 drugs were enriched in the model genes, with 84 drugs enriched in two or more model genes. The model genes were positioned at the central location of the gene-drug network (Figure 5E), demonstrating their feasibility as drug targets.

Gene-drug molecular docking

Molecular docking was performed between the drugs butyrate, dimethyl sulfoxide, and vorinostat (which were enriched with three or more model genes) and the corresponding gene proteins. The results showed that butyrate, dimethyl sulfoxide, and vorinostat each interacted with multiple key genes (such as MMP9, ABCB1, SOX2, and EXOSC4) in molecular docking analyses. First, the docking of butyrate with MMP9 revealed strong binding affinity, suggesting that butyrate may exert its effects by inhibiting MMP9 enzymatic activity, disrupting extracellular matrix degradation, and hindering cell migration, thereby playing a role in antitumor and anti-inflammatory responses (Figure 6A). Docking with ABCB1 also demonstrated good binding, indicating that butyrate may enhance drug absorption or efficacy by influencing ABCB1’s role in drug efflux (Figure 6B). Docking with SOX2 suggested that butyrate could impact SOX2 function, potentially playing a critical role in tumor immune evasion and stem cell maintenance (Figure 6C). Next, the docking of dimethyl sulfoxide with MMP9 showed strong affinity, suggesting that dimethyl sulfoxide may regulate MMP9 activity to inhibit tumor cell invasion and metastasis (Figure 6D). Docking with ABCB1 indicated that dimethyl sulfoxide could enhance the accumulation of certain drugs within cells, potentially helping to overcome drug resistance (Figure 6E). The docking with SOX2 revealed that dimethyl sulfoxide may regulate SOX2 function, affecting cell proliferation, differentiation, and immune evasion (Figure 6F). Finally, vorinostat’s docking with EXOSC4 demonstrated strong binding, suggesting that vorinostat may interfere with EXOSC4’s role in RNA degradation, thereby modulating gene expression and impacting cell function and immune responses (Figure 6G). Docking with MMP9 indicated that vorinostat could reduce tumor cell invasiveness by inhibiting MMP9 activity (Figure 6H). Binding with ABCB1 showed that vorinostat might enhance drug accumulation within cells, improving therapeutic efficacy (Figure 6I). Overall, these drugs, through their strong binding interactions with key genes, may play significant roles in antitumor, immune modulation, and drug delivery, providing new theoretical support for tumor immunotherapy and drug combination strategies.

Figure 6 Gene-drug molecular docking. (A) Molecular docking of butyrate with MMP9. (B) Molecular docking of butyrate with ABCB1. (C) Molecular docking of butyrate with SOX2. (D) Molecular docking of dimethyl sulfoxide with MMP9. (E) Molecular docking of dimethyl sulfoxide with ABCB1. (F) Molecular docking of dimethyl sulfoxide with SOX2. (G) Molecular docking of vorinostat with EXOSC4. (H) Molecular docking of vorinostat with MMP9. (I) Molecular docking of vorinostat with ABCB1.

Discussion

CRC, as a malignant tumor of the colon and rectum, poses a significant disease burden. According to the global burden of disease (GBD) 2021 data (14), the burden of early-onset CRC has significantly increased globally from 1990 to 2021, with the incidence rising from 5.43/100,000 to 6.13/100,000 [average annual percent change (AAPC) =0.39] and the prevalence increasing from 29.65/100,000 to 38.86/100,000 (AAPC =0.87). Due to the lack of obvious early symptoms and signs, CRC is often diagnosed at later stages. Early diagnosis of CRC is crucial for timely surgical intervention and improving patient prognosis. Exosomes are widely distributed in human body fluids such as saliva, blood, breast milk, urine, and feces, offering the advantage of minimally invasive or non-invasive collection. Therefore, constructing a diagnostic model for CRC based on exosome-related genes holds significant importance for early diagnosis and intervention, with the aim of improving prognosis.

In this study, four genes—EXOSC4, MMP9, ABCB1, and SOX2—were identified as important exosome-related biomarkers for the diagnosis of CRC. Currently, research on EXOSC4 in the context of CRC is limited. Zhang et al. (15) found that elevated expression of EXOSC2, EXOSC3, EXOSC8, and EXOSC9 was significantly correlated with poor overall survival (OS), disease-specific survival (DSS), and progression-free interval (PFI) in head and neck squamous cell carcinoma. High expression of EXOSC2, EXOSC4, EXOSC5, and EXOSC9 was associated with advanced clinical stages, lymphovascular invasion, and poor treatment outcomes in head and neck squamous cell carcinoma, suggesting that the exosome complex (EXOSC) family plays a key role in RNA metabolism. This finding aligns with the present study, where higher EXOSC4 expression represents a higher risk of CRC, further confirming the key role of EXOSC4 in RNA metabolism. Veljkovic et al. (16) found that MMP-9 activates CRC progression through the ROS/NF-κB signaling pathway. This discovery is consistent with the present study, where higher MMP9 expression correlates with higher CRC risk. Additionally, Li et al. (17) showed that RAD23B promotes CRC metastasis through the Talin1/integrin/PI3K/AKT/MMP9 axis, indicating that MMP9 is downstream of RAD23B and promotes CRC metastasis.

Currently, studies on ABCB1 related to CRC (17,18) suggest that ABCB1-mediated multidrug resistance (MDR) reduces the efficacy of CRC chemotherapy, while studies on the correlation between ABCB1 expression and CRC risk are limited. This study provides evidence of a negative correlation between ABCB1 expression and the risk of CRC. Pei-Shan Hu’s study pointed out that the VDR-SOX2 signaling pathway promotes the stemness and malignancy of CRC in an acidic microenvironment, which seems to contradict the present finding of a negative correlation between SOX2 expression and CRC risk. This discrepancy may arise because our study focuses on early-onset CRC, where the hypoxic and acidic tumor microenvironment is less pronounced, affecting the role of SOX2 in promoting cell proliferation. Furthermore, our study examines the relationship between exosome-related genes and CRC risk, whereas Pei-Shan Hu’s study (19) uses tissue samples from CRC, where gene expression in tumor tissue reflects the biological characteristics of the tumor, whereas exosome gene expression is influenced by multiple cell populations in the tumor microenvironment (including immune cells and fibroblasts). Tumor tissue gene expression better reflects the biological characteristics of the tumor, whereas exosome gene expression in blood may reflect more systemic changes rather than localized tumor features. Therefore, the gene expression of SOX2 in exosomes may reveal systemic changes rather than focusing solely on local tumor characteristics.

The use of EXOSC4, MMP9, ABCB1, and SOX2 as key exosome-related biomarkers for the diagnosis of CRC offers several advantages:

  • Strong biological relevance: EXOSC4, MMP9, ABCB1, and SOX2 have all been shown to play significant roles in the initiation and progression of CRC. As exosome-related biomarkers, these genes are supported by strong biological evidence and can reflect the molecular characteristics and microenvironmental changes associated with the tumor.
  • Potential for early diagnosis: these four genes may exhibit differential expression at early stages of cancer, making a diagnostic model based on these genes suitable for accurate screening in the early phases of CRC, which could contribute to higher early detection rates and improved prognosis.
  • Diverse mechanisms: these genes are involved in various biological processes, such as cell migration, invasiveness, and drug resistance. Therefore, a diagnostic model based on these genes can account for multiple mechanisms of cancer biology, providing robust diagnostic capabilities.

However, there are several limitations to this approach:

  • Insufficient single gene information: while these genes are significant in CRC, relying on a single gene may lack the sensitivity and specificity needed to address the complexity of the disease. The inclusion of additional biomarkers or clinical features may be necessary to achieve more accurate diagnostic results.
  • Lack of experimental validation: although these genes have been shown to be relevant in numerous studies, their clinical applicability has not been fully validated. Experimental validation and clinical trials are essential steps before this model can be applied in practice.
  • Challenges with exosome-based biomarkers: the use of exosomes as diagnostic biomarkers in cancer is still in the research stage. The collection, stability, and biological interpretation of exosomes could present challenges for the widespread application of this diagnostic model.

Compared to traditional single tumor markers, diagnostic models based on exosome-related genes provide a more comprehensive reflection of the tumor’s biological characteristics, offering higher sensitivity and specificity. By integrating immune cell differentials and molecular docking, this approach innovatively combines exosome-related genes with the tumor microenvironment, enhancing the model’s precision. Compared to existing diagnostic methods, models based on these genes could offer new diagnostic tools for clinical practice, particularly in the early screening and monitoring of CRC. Additionally, molecular docking-based predictions of potential targeted drugs provide new avenues for personalized cancer treatment.

This study identified three potential targeted drugs for CRC based on the number of enriched model genes: butyrate, dimethyl sulfoxide, and vorinostat. Peng et al. (20) demonstrated that butyrate alleviates CRC by mediating pyroptosis of M1 tumor-associated macrophages, revealing the efficacy of butyrate in treating CRC, which is consistent with the findings of this study. Sinem Tunçer’s study (21) showed that dimethyl sulfoxide treatment results in growth inhibition and reduced ROS formation, providing strong scientific support for the use of dimethyl sulfoxide as an anti-CRC agent. Vorinostat, a well-known antifungal drug, has been less studied in the context of CRC. Hazrat Bilal’s review on the impact of fungi on cancer initiation, progression, and therapeutic responses highlighted the involvement of fungal metabolites in carcinogenesis, with antifungal drugs potentially interacting with anticancer drugs, including triggering adverse effects and affecting immune responses. This suggests that vorinostat has a scientific basis for cancer prevention and treatment, making it a potential targeted drug for CRC.

This study constructs and validates a CRC prediction model based on exosome-related genes and predicts potential targeted drugs, providing significant clinical implications. However, the study does have some limitations: (I) it did not validate the gene expression differences between the tumor and normal groups in basic experiments; (II) it did not validate the anticancer effects of the predicted targeted drugs using drug-containing serum cultures of CRC cells.

This study identifies EXOSC4, MMP9, ABCB1, and SOX2 as key exosome-related biomarkers for CRC, demonstrating their potential for early detection and accurate diagnosis. These genes are implicated in critical pathways, such as IL-17 signaling and microRNA regulation in cancer, and serve as vital markers for understanding tumor progression and therapeutic responses. Moreover, the identification of butyrate, dimethyl sulfoxide, and vorinostat as potential targeted drugs highlights novel therapeutic avenues for CRC, expanding the possibilities for personalized treatment strategies. The practical significance of this study lies in the potential to revolutionize CRC diagnostics by introducing exosome-related biomarkers, offering new tools for non-invasive screening and early detection. The application prospects of these findings are broad, with the potential to guide clinical decision-making and enhance therapeutic efficacy through the development of targeted therapies. However, several limitations must be considered. While the identification of these biomarkers and potential drugs is based on bioinformatics analysis, further experimental validation is required. The lack of extensive clinical data limits the direct application of these findings to clinical practice. Additionally, the role of these biomarkers in the dynamic tumor microenvironment and their interactions with other molecular pathways need further investigation. Future research should focus on validating these biomarkers in larger, more representative patient cohorts and conducting experimental studies to confirm their clinical diagnostic and therapeutic significance. Furthermore, exploring the mechanisms underlying the interactions between these model genes and potential targeted drugs will be crucial for understanding their therapeutic effects and for the development of personalized treatment regimens for CRC patients.

In conclusion, while the findings of this study provide a promising foundation for the diagnosis and treatment of CRC, ongoing research is essential to fully translate these discoveries into clinical applications and realize their therapeutic potential.


Conclusions

EXOSC4, MMP9, ABCB1, and SOX2 are important exosome-related biomarkers for the diagnosis of CRC, and butyrate, dimethyl sulfoxide, and vorinostat hold promise as targeted therapeutic drugs for CRC.


Acknowledgments

Thanks to Mr. Chen Xu for his outstanding contribution to this article.


Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2789/rc

Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2789/prf

Funding: The study was supported by the Pudong New Area Health Commission Discipline Leader Training Project (No. PWRd2022-15).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2789/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Cao M, Li H, Sun D, et al. Cancer burden of major cancers in China: A need for sustainable actions. Cancer Commun (Lond) 2020;40:205-10. [Crossref] [PubMed]
  2. Pinheiro M, Moreira DN, Ghidini M. Colon and rectal cancer: An emergent public health problem. World J Gastroenterol 2024;30:644-51. [Crossref] [PubMed]
  3. Abdel-Bar HM, Tandiono S, Liam-Or R, et al. Optimizing Exosome Lipid Hybrid Nanoparticles for Enhanced siRNA Delivery and Improved Therapeutic Anticancer Efficacy In Vivo. ACS Nano 2025;19:42658-74. [Crossref] [PubMed]
  4. Pan Y, Zhang B, Li Z, et al. Extracellular vesicles-associated tRNA-derived fragments: Emerging insights into cancer progression and clinical application potential. Genes Dis 2026;13:101682. [Crossref] [PubMed]
  5. Sun Z, Xia W, Lyu Y, et al. Immune-related gene expression signatures in colorectal cancer. Oncol Lett 2021;22:543. [PubMed]
  6. Chen Z, Ren R, Wan D, et al. Hsa_circ_101555 functions as a competing endogenous RNA of miR-597-5p to promote colorectal cancer progression. Oncogene 2019;38:6017-34. [Crossref] [PubMed]
  7. Vlachavas EI, Pilalis E, Papadodima O, et al. Radiogenomic Analysis of F-18-Fluorodeoxyglucose Positron Emission Tomography and Gene Expression Data Elucidates the Epidemiological Complexity of Colorectal Cancer Landscape. Comput Struct Biotechnol J 2019;17:177-85. [Crossref] [PubMed]
  8. Hu Y, Gaedcke J, Emons G, et al. Colorectal cancer susceptibility loci as predictive markers of rectal cancer prognosis after surgery. Genes Chromosomes Cancer 2018;57:140-9. [Crossref] [PubMed]
  9. Hu L, Wan S, Song X. Association between SQSTM1 dysregulation and risk in alopecia areata: a Mendelian randomization study. Front Immunol 2025;16:1652444. [Crossref] [PubMed]
  10. Yang X, Xiao Y, Liu D, et al. Factors Influencing Adoption of Large Language Models in Health Care: Multicenter Cross-Sectional Mixed Methods Observational Study. J Med Internet Res 2025;27:e84918. [Crossref] [PubMed]
  11. Yao B, Yang D, Fu C, et al. A core stemness-associated module reveals PLK1, NUF2, KIF23, CDCA8, TOP2A, CENPF, AURKA, and ASPM as key genes in rectal cancer. Eur J Med Res 2025;31:75. [Crossref] [PubMed]
  12. Hoang NM, Park K. Integrated approaches to testing and assessment for the endocrine disrupting activity of tartrazine based on adverse outcome pathways and OECD frameworks. Food Chem Toxicol 2026;208:115867. [Crossref] [PubMed]
  13. Ranstam J, Cook JA. LASSO regression. Journal of British Surgery 2018;105:1348. [Crossref]
  14. Meng Y, Tan Z, Zhen J, et al. Global, regional, and national burden of early-onset colorectal cancer from 1990 to 2021: a systematic analysis based on the global burden of disease study 2021. BMC Med 2025;23:34. [Crossref] [PubMed]
  15. Zhang X, Zhao M, Chu T, et al. Comprehensive bioinformatics analysis of EXOSC family genes in head and neck squamous cell carcinoma. Sci Rep 2025;15:30361. [Crossref] [PubMed]
  16. Veljkovic A, Stanojevic G, Brankovic B, et al. MMP-9 Activation via ROS/NF-κB Signaling in Colorectal Cancer Progression: Molecular Insights and Prognostic-Therapeutic Perspectives. Curr Issues Mol Biol 2025;47:557. [Crossref] [PubMed]
  17. Li YC, Xiong YM, Long ZP, et al. ML210 Antagonizes ABCB1- Not ABCG2-Mediated Multidrug Resistance in Colorectal Cancer. Biomedicines 2025;13:1245. [Crossref] [PubMed]
  18. Liu YS, Hsu HC, Tseng KC, et al. Lgr5 promotes cancer stemness and confers chemoresistance through ABCB1 in colorectal cancer. Biomed Pharmacother 2013;67:791-9. [Crossref] [PubMed]
  19. Hu PS, Li T, Lin JF, et al. VDR-SOX2 signaling promotes colorectal cancer stemness and malignancy in an acidic microenvironment. Signal Transduct Target Ther 2020;5:183. [Crossref] [PubMed]
  20. Peng S, Liu Z, Song Z, et al. Vinegar-processed frankincense extracts alleviate colorectal cancer by butyric acid mediating M1 tumor-associated macrophage pyroptosis. Chin Med 2025;20:208. [Crossref] [PubMed]
  21. Tunçer S, Gurbanov R, Sheraj I, et al. Low dose dimethyl sulfoxide driven gross molecular changes have the potential to interfere with various cellular processes. Sci Rep 2018;8:14828. [Crossref] [PubMed]
Cite this article as: Yin Y, Li L, Liu S, Xie Y, Li Y, Gao Y, Xu C, Wang Y. Construction of a diagnostic model for colorectal cancer based on exosome-related genes: integration of immune cell differentials and molecular docking. Transl Cancer Res 2026;15(4):290. doi: 10.21037/tcr-2025-1-2789

Download Citation