Developing a prognostic risk model based on circulating tumor cell genes to predict prognosis and provide potential therapeutic strategies in colorectal cancer

Yupeng Zheng; Mian Yang; Hongyi Yi; Tao Peng; Jiaze Sun; Jiazi Yu

doi:10.21037/tcr-2024-2268

Original Article

Developing a prognostic risk model based on circulating tumor cell genes to predict prognosis and provide potential therapeutic strategies in colorectal cancer

Yupeng Zheng, Mian Yang, Hongyi Yi, Tao Peng, Jiaze Sun, Jiazi Yu

Department of Colon Anorectal Surgery, Ningbo Medical Center LiHuiLi Hospital, Ningbo, China

Contributions: (I) Conception and design: Y Zheng, J Yu; (II) Administrative support: J Yu, M Yang; (III) Provision of study materials or patients: Y Zheng, H Yi; (IV) Collection and assembly of data: Y Zheng, J Yu, M Yang, T Peng, J Sun; (V) Data analysis and interpretation: Y Zheng, H Yi; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Jiazi Yu, MD, PhD. Department of Colon Anorectal Surgery, Ningbo Medical Center LiHuiLi Hospital, No. 1111 Jiangnan Road, Ningbo 315000, China. Email: lhlyujiazi@nbu.edu.cn.

Background: Colorectal cancer (CRC) is a major cause of cancer-related deaths worldwide. Understanding the genetic and molecular alterations in CRC can improve patient outcomes. Circulating tumor cells (CTCs) are crucial in cancer metastasis and progression. Analyzing the differentially expressed genes (DEGs) between CTCs and CRC may provide us with new therapeutic strategies. Therefore, this study aims to analyze these DEGs to construct a prognostic risk model that predicts the outcomes of CRC patients and guides clinical treatment.

Methods: We analyzed The Cancer Genome Atlas (TCGA) database to identify 1,727 DEGs between CRC and normal samples, and GSE82198 data to find 3,564 DEGs between CTCs and primary CRC samples. Using enrichment analysis, least absolute shrinkage and selection operator (LASSO) regression, and stepwise Cox regression, we derived eight model genes to construct a prognostic risk model. Various algorithms were employed in the immune microenvironment analysis. Integrating clinical factors with risk grouping, we developed a nomogram. We assessed chemotherapy sensitivity and epithelial-mesenchymal transition (EMT) scores in high-/low-risk groups and explored model gene expression at the single-cell level.

Results: We constructed a prognostic risk model for CRC based on eight DEGs of CTCs. The model effectively predicted treatment outcomes and correlated closely with actual prognosis. Through immune microenvironment analysis, we revealed differences in immune cell infiltration and checkpoint gene expression among different risk groups. Moreover, patients in the high-risk group showed higher sensitivity to chemotherapy drugs compared to those in the low-risk group.

Conclusions: The prognosis model based on CTCs’ DEGs can effectively predict patient outcomes, facilitating precision treatment for patients. This model holds significant guiding implications for immunotherapy and chemotherapy in CRC, offering potential strategies for the clinical treatment of CRC.

Keywords: Colorectal cancer (CRC); circulating tumor cell (CTC); differentially expressed genes (DEGs); prognosis

Submitted Nov 15, 2024. Accepted for publication Mar 19, 2025. Published online May 16, 2025.

doi: 10.21037/tcr-2024-2268

Highlight box

Key findings

• The study conducted a comprehensive analysis of differential genes related to circulating tumor cells (CTCs) in The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) database and developed a prognostic model for colorectal cancer (CRC) patients.

• A comprehensive immune microenvironment analysis was conducted, revealing differences in immune cell infiltration and checkpoint gene expression across risk groups.

• Prognostic models can predict the sensitivity of chemotherapy and immunotherapy in CRC patients, providing enhanced guidance for treatment decisions.

• Single-cell level analysis revealed that all eight model genes exhibited varying degrees of expression, with sushi repeat containing protein x-linked (SRPX) showing particularly prominent expression, potentially guiding future targeted therapy directions.

What is known and what is new?

• CTCs are closely linked to the metastasis of CRC, and constructing a prognostic model based on CTC-related genes can aid in predicting patient outcomes.

• The study conducted a comprehensive analysis of differential genes related to CTCs in the TCGA and GEO database and developed a prognostic model for CRC.

What is the implication, and what should change now?

• We identified eight differential genes related to CTCs to construct a prognostic model, which demonstrated good sensitivity and specificity, as well as predictive value for sensitivity to chemotherapy and immunotherapy.

Introduction

According to statistics from the Global Cancer Epidemiology Database (GLOBOCAN 2020), it was estimated that there were 1.9 million new cases of colorectal cancer (CRC) and 935,000 deaths globally in 2020, making them the third and second most common malignant tumors, respectively (1). Furthermore, some studies indicate that metastasis accounts for the primary cause of mortality among CRC patients, with the liver serving as the most primary and common site of metastatic CRC (mCRC) (2-4). Consequently, the prognosis of CRC patients remains a clinically significant issue worth exploring.

In recent years, with the advancement of bioinformatics technology and the improvement of public databases, an ever-increasing number of prognostic markers for CRC have been discovered, and the prognostic models constructed on the basis of these markers can effectively predict the overall survival of CRC patients (5,6). However, the precise molecular mechanisms underlying CRC metastasis remain poorly understood. Circulating tumor cells (CTCs) are the direct cause of cancer metastasis, and these cells drive metastasis by disseminating from primary tumors to seed metastases in distant organs (7). Some studies suggest that these rare cells can serve as cancer biomarkers to predict prognosis or help identify potential therapeutic targets (8-12). Moreover, the differences in gene expression between primary CRC cells and circulating tumor cells are closely related to tumor metastasis (13). Therefore, further exploration of the differentially expressed genes (DEGs) in CTCs may provide a more accurate assessment of patient prognosis and facilitate precision therapy for CRC patients.

In this study, we downloaded gene datasets for CRC and CTCs from the public databases The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO). Through comprehensive analysis, we identified the DEGs between CRC and normal, as well as between CTCs and primary CRC, and extracted the intersecting genes for inclusion in the study. Subsequently, we used univariate Cox regression and least absolute shrinkage and selection operator (LASSO) regression to obtain eight CTCs’ DEGs. Based on these eight genes, we developed a prognostic model for CRC, which demonstrated favorable results in both the receiver operating characteristic (ROC) curves and the Kaplan-Meier curve analyses. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2024-2268/rc).

Methods

Data collection

The gene expression level data and clinical information for patients were obtained from the TCGA database (http://gdc.cancer.gov). The inclusion criteria for patient samples were as follows: (I) the dataset provides complete gene expression profiles, and (II) comprehensive clinical information, including age, sex, TNM staging, and overall survival. A total of 600 CRC samples were included. 49 normal samples were obtained by TCGA database. Additionally, the GSE72970 dataset from the GEO (https://www.ncbi.nlm.nih.gov/) database was utilized as a validation set, incorporating 124 CRC samples. From the GSE82198 dataset, three CTC samples and three primary CRC samples were included. Single-cell data were sourced from the GSE146771 dataset. Immunotherapy data were derived from the GSE91061 dataset, which includes complete clinical information for 101 patients with malignant melanoma, encompassing overall survival data, follow-up information, and immunotherapy efficacy data. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Screening for significantly DEGs

Based on the TCGA database and the GSE82198 datasets, differential analysis was conducted between CRC vs. normal and CTC vs. primary CRC groups using the “limma” package of R. Furthermore, the Benjamini & Hochberg method was employed for multiple testing corrections. The two groups of DEGs were screened with a threshold of adjusted P value <0.05 & |log2FC| ≥1. Then, the “clusterProfiler” package of R was used for Gene Ontology (GO) functional and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of the two groups of DEGs, with multiple testing corrections using the Benjamini & Hochberg method. DEGs were screened in both groups using thresholds of adjusted P value <0.05 and count >10. Finally, significantly DEGs were created by intersecting the two sets of screened DEGs.

Construction and validation of prognostic risk model

Significantly DEGs were subjected to univariate Cox regression analysis using the “survival” package of R. A threshold of P value <0.05 was chosen to select genes that are significantly associated with prognosis at the expression level. LASSO regression algorithm was performed on the above prognostically significant genes using the “glmnet” package of R. The penalty parameter was adjusted through 10-fold cross-validation to select key genes. Subsequently, the “survminer” package of R was used to conduct stepwise Cox regression analysis on the key genes to select the model genes. We multiplied the expression levels of the model genes by the corresponding regression coefficients obtained from the stepwise Cox regression analysis and performed an exponential operation to derive a comprehensive risk score. The specific formula is as follows:

$R i s k s c o r e = h (t, X) = h 0 (t) * \exp (β 1 X 1 + β 2 X 2 + ... + β n X n)$ [1]

In this formula, β represents the regression coefficients, h(t, X) represents the hazard rate associated with the covariate [gene expression level (X)] at a specific time (t), while h0(t) denotes the baseline hazard rate, reflecting the underlying risk in the absence of covariate effects. Using the risk score calculation formula, we calculated the risk score for each sample in the training set, which is the TCGA dataset. Subsequently, based on the median risk score, we divided the samples into a high-risk group (risk score greater than the median) and a low-risk group (risk score less than or equal to the median). To evaluate the prognostic predictive ability of different risk groups, we plotted Kaplan-Meier survival curves and time-dependent ROC (tROC) curves, and calculated their concordance index (C-index) values and area under the curve (AUC) values. Furthermore, we utilized samples from the GEO dataset as a validation set. Within the validation samples, we employed the stepwise Cox regression analysis to compute the regression coefficients corresponding to the model genes derived from the training set. Thereafter, the regression coefficients corresponding to the model genes and their expression levels were substituted into the risk score calculation formula to calculate the risk score for each sample. Based on the median risk score, we once again divided the samples into a high-risk group and a low-risk group, and plotted Kaplan-Meier survival curves and tROC curves to further validate the effectiveness of the model.

Immunological analysis

This study used the cell-type identification by estimating relative subsets of RNA transcripts (CIBERSORT) algorithm (14), microenvironment cell populations counter (MCPcounter) algorithm (15), and single-sample gene set enrichment analysis (ssGSEA) algorithm (16) to respectively assess the proportions of immune cell types in the analyzed tumor samples. Subsequently, based on the estimation of stromal and immune cells in malignant tumors using expression data (ESTIMATE) algorithm, we utilized the “ESTIMATE” package in R to calculate the ESTIMATE score, immune score, and stromal score. The differences in infiltration among different risk groups were assessed using the Wilcoxon test. Additionally, immune checkpoint gene expression data were extracted based on TCGA expression data. The expression differences of immune checkpoint genes among different risk groups were compared using the Wilcoxon test. Furthermore, the ssGSEA algorithm calculated hallmark gene sets enrichment scores. The correlation between the risk score and hallmark gene set enrichment scores was then analyzed.

Construction of nomogram

To further investigate the prognostic independence between clinical prognostic factors and risk group, we included clinical factors and risk group from the training set in univariate and multivariate Cox regression analyses. We used a P value threshold of less than 0.05 to screen for independent prognostic factors and to generate forest plots. Based on those above-screened independent prognostic factors, we constructed a nomogram using the “rms” package of R. Based on the nomogram model, calibration curves for one, three, and five years were plotted.

Drug sensitivity analysis

We estimated how sensitive each patient is to chemotherapy drugs using the Genomics of Drug Sensitivity in Cancer (GDSC; https://www.cancerrxgene.org/) database. We quantified the half-maximal inhibitory concentration (IC50) using the “pRRophetic” package of R. Then, we compared the IC50 differences of 138 chemotherapy drugs between different risk groups using the Wilcoxon test.

Prediction of immunotherapy response

We utilized the Tumor Immune Dysfunction and Exclusion (TIDE) database(http://tide.dfci.harvard.edu/) to predict each patient’s response to immune checkpoint therapy and represented it using TIDE scores. Based on the response of patients to immunotherapy in the GSE91061 dataset, patients were classified into progressive disease (PD), stable disease (SD) disease group, partial response (PR) group, and complete response (CR) group. We computed the risk score for patients in the GES91061 dataset using normalized raw counts. Subsequently, we stratified the samples into high and low-risk groups using the median risk score as the threshold. Finally, we analyzed the relationship between the efficacy of programmed death-ligand 1 (PD-L1) inhibitors and the risk grouping.

Correlation between prognostic models and epithelial-mesenchymal transition (EMT)

To explore the association between prognostic models and EMT, we used the Msigdb database (http://www.gsea-msigdb.org/gsea/msigdb/index.jsp) to identify relevant gene sets related to EMT using “HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION” as the search term. Using the ssGSEA algorithm, we calculated the EMT score, and compared the differences in EMT scores between high and low-risk groups. We then divided the groups into high and low EMT score groups using the optimal cutoff value. Finally, we utilized the Kaplan-Meier analysis of R to assess the association between the grouping of high and low EMT scores and actual prognosis information.

Single-cell analysis

We leveraged the Tumor Immune Single Cell Hub 2 (TISCH2), a single-cell RNA sequencing database focused on the tumor microenvironment (TME), to analyze the expression of model genes obtained from TCGA across different cell types. This further revealed variations in the TME among patients with CRC, thereby explaining the heterogeneity of CRC to some extent.

Statistical analysis

We identified the DEGs of CTCs through differential analysis. Subsequently, we obtained eight model genes using univariate and multivariate Cox regressions as well as LASSO regression. We used Kaplan-Meier curves and ROC curves at one-year, three-year, and five-year time points to evaluate the prognostic predictive ability of the model in the training cohort, and we further validated the feasibility of the model in the validation cohort using the same methods. Additionally, we constructed a nomogram that incorporated clinical risk factors and further explored the application of this prognostic risk model in the context of the immune microenvironment, immunotherapy, chemotherapy drug sensitivity, EMT levels, and single-cell analysis. Graph plotting and statistical analysis were conducted using R Studio 4.0.4. The distinction between the two groups was determined through either the paired two-tailed Student’s t-test or the Mann-Whitney-Wilcoxon test. Statistical significance was established at P<0.05.

Results

Clinical characteristics of patients.

As presented in Table 1, this study encompassed data from 600 CRC patients obtained from the TCGA database. The mean age of the cohort was 66.1 years, comprising 289 males and 311 females. In terms of the pathologic T classification, the distribution of patients was as follows: 20 patients were classified as T1, 118 as T2, 401 as T3, and 61 as T4. Regarding the Pathologic N classification, the patient distribution was as follows: 339 patients were categorized as N0, 153 as N1, and 108 as N2. According to the pathologic M classification, the study included 515 patients with M0 and 85 patients with M1. For the stage classification, the distribution was detailed as follows: 113 patients were in Stage I, 221 in Stage II, 181 in Stage III, and 85 in Stage IV.

Table 1

Clinical characteristics of patients with colorectal cancer in The Cancer Genome Atlas

Characteristic	Value (n=600)
Age (years), mean ± SD	66.1±12.1
Gender, n (%)
Male	289 (48.2)
Female	311 (51.8)
Pathologic T, n (%)
T1	20 (3.3)
T2	118 (19.7)
T3	401 (66.8)
T4	61 (10.2)
Pathologic N, n (%)
N0	339 (56.5)
N1	153 (25.5)
N2	108 (18.0)
Pathologic M, n (%)
M0	515 (85.8)
M1	85 (14.2)
Stage, n (%)
I	113 (18.8)
II	221 (36.8)
III	181 (30.2)
IV	85 (14.2)

SD, standard deviation.

Identification of DEGs and enrichment analysis

We identified 1,727 DEGs between CTCs and primary CRC samples from the TCGA database (Figure 1A), as well as 3,564 DEGs between CRC and normal samples from the GSE82198 dataset (Figure 1B). Subsequently, GO and KEGG pathway enrichment analyses were conducted for the DEGs in each group to explore the functional terms associated with the key genes. The enrichment analysis results displayed the top seven genes ranked by adjusted P value (Figure 1C,1D). Additionally, 409 intersecting genes were obtained between the two sets of DEGs (Figure 1E). Finally, a heatmap of the intersecting genes was generated, as shown in Figure 1F,1G.

Figure 1 Comparative analysis and functional enrichment of DEGs in CRC and CTCs. (A) Volcano plots comparing CTCs versus primary CRC group; (B) volcano plots comparing CRC versus normal group, examining the effect size (log₂FC)—log₁₀(P value), with blue and red points representing downregulated and upregulated factors, respectively; (C) enrichment bubble plots for DEGs from CTCs versus primary CRC group; (D) enrichment bubble plots for DEGs from CRC versus normal group, with the vertical axis indicating pathway names and the color intensity reflecting the significance of the P values; (E) Venn diagram depicting the intersection of DEGs between the two groups; (F) heatmap of intersected genes based on CRC versus normal group; (G) heatmap of intersected genes based on CTCs versus primary CRC group. BP, biological processes; CC, cellular components; CRC, colorectal cancer; CTCs, circulating tumor cells; DEGs, differentially expressed genes; FC, fold change; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; MF, molecular functions.

Construction and validation of the prognostic model

We conducted a univariate Cox regression analysis on the intersecting genes, thereby identifying 49 genes significantly associated with the prognosis of CRC (Figure 2A). Using the LASSO regression algorithm, we filtered down to 24 key genes (Figure 2B,2C). Moreover, we utilized a stepwise Cox regression algorithm to derive the optimal gene combination, ultimately obtaining eight model genes (Figure 2D). We constructed a risk model based on eight model genes. In the training set, we calculated the risk score for each sample and divided the patients into high-risk and low-risk groups using the median risk score value of 0.95 as a threshold (available online: https://cdn.amegroups.cn/static/public/tcr-2024-2268-1.pdf). The distribution of patients’ risk scores and survival times is shown in Figure 3A. We plotted Kaplan-Meier curves to assess the association between the classification of high-risk and low-risk groups and the actual prognosis of CRC patients (Figure 3B). Based on the risk model, we plotted ROC curves for one year, three years, and five years, with AUC values of 0.686 (95% CI: 0.611–0.754), 0.712 (95% CI: 0.646–0.772), and 0.739 (95% CI: 0.654–0.822), respectively (Figure 3C). We also calculated the C-index for the risk model, which is 0.688. In the validation set, we also calculated the risk score for each sample and divided the samples into high-risk and low-risk groups using the median risk score value of 1.01 as a threshold (Table S1). The distribution of patients’ risk scores and survival times is shown in Figure 3D. We also plotted Kaplan-Meier curves to further validate the model (Figure 3E). Additionally, we generated ROC curves for one year, three years, and five years, with AUC values of 0.666 (95% CI: 0.564–0.762), 0.718 (95% CI: 0.615–0.819), and 0.737 (95% CI: 0.578–0.862), respectively (Figure 3F). The C-index value is 0.669 for the risk model.

Figure 2 Selection of model genes using LASSO and Cox regression. (A) Forest plot of genes significantly associated with the prognosis of colorectal cancer; (B) distribution of LASSO coefficients; (C) likelihood deviance of LASSO coefficients distribution, with two vertical dashed lines representing lambda.min (left line) and lambda.1se (right line); (D) forest plot of the multivariate Cox regression for the eight model genes. *, P<0.05; **, P<0.01; ***, P<0.001. AIC, Akaike information criterion; CI, confidence interval; LASSO, least absolute shrinkage and selection operator.

Figure 3 Prognostic analysis of colorectal cancer using risk scores in TCGA and GEO datasets. (A) Distribution of risk scores (upper plot) and survival status (lower plot) in the TCGA dataset; (B) Kaplan-Meier curves related to the prognosis of colorectal cancer based on the risk score prediction model in the TCGA dataset; (C) ROC curves for the gene prognostic features at one year, three years, and five years in the TCGA dataset; (D) distribution of risk scores (upper plot) and survival status (lower plot) in the GEO dataset; (E) Kaplan-Meier curves related to the prognosis of colorectal cancer based on the risk score prediction model in the GEO dataset; (F) ROC curves for the gene prognostic features at one year, three years, and five years in the GEO dataset. AUC, area under the curve; GEO, Gene Expression Omnibus; ROC, receiver operating characteristic; TCGA, The Cancer Genome Atlas.

Analysis of the immune microenvironment

In the training set, we employed the CIBERSORT algorithm to calculate the proportions of 22 types of immune cells in each sample. We then compared the differences in the proportions of various immune cells among different risk groups, identifying ten types of differentially infiltrating immune cells (DICs, Figure 4A). The ssGSEA algorithm was used to calculate the proportions of 28 types of immune cells in each sample. Similarly, we compared the differences in immune cell proportions among different risk groups, identifying 11 types of DICs (Figure 4B). Using the Estimate algorithm, we computed immune and stromal scores and analyzed the differences in these scores regarding infiltration levels among different risk groups (Figure 4C). The MCPcounter algorithm revealed significant differences in four types of immune cells (Figure 4D). Additionally, we compared the differences in the expression of immune checkpoint genes among different risk groups, identifying seven types of significantly different immune checkpoint genes (Figure 4E). As shown in Figure 4F, we computed the correlation between risk scores and enrichment scores of hallmark gene sets.

Figure 4 Immune analysis between different risk groups. (A) Comparison of immune cell types with significant differences between different risk groups using CIBERSORT algorithms; (B) comparison of immune cell types with significant differences between different risk groups using ssGSEA algorithms; (C) comparison of immune cell types with significant differences between different risk groups using ESTIMATE algorithms; (D) heatmap of immune cells with significant differences between different risk groups based on the MCPcounter algorithm; (E) comparison of immune checkpoint genes with significant differences between different risk groups; (F) heatmap of the correlation between risk scores and gene sets. -, P>0.05; ., P≈0.05; *, P<0.05; **, P<0.01; ***, P<0.001; ****, P<0.0001. CIBERSORT, cell-type identification by estimating relative subsets of RNA transcripts; ESTIMATE, estimation of stromal and immune cells in malignant tumors using expression data; MCPcounter, microenvironment cell populations counter; ssGSEA, single-sample gene set enrichment analysis.

Nomogram model based on independent prognostic factors

We conducted univariate and multivariate Cox regression analyses on clinical factors and Risk Group of CRC samples in the training set to identify significant independent prognostic factors, as shown in Figure 5A,5B. To further analyze the correlation between these independent prognostic factors (age, pathologic T, stage, and risk group) and survival prognosis, we incorporated them into the construction of the nomogram model, as depicted in Figure 5C. The calibration curves for one, three, and five years were plotted to validate the nomogram (Figure 5D).

Figure 5 Screening independent risk factors in the TCGA dataset and constructing and validating a nomogram model. (A) The forest plot of univariate Cox regression analysis. (B) The forest plot of multivariate Cox regression analysis. (C) Nomogram for predicting survival rates based on independent prognostic factors. (D) Calibration plot comparing the one-year, three-year, and five-year survival rates predicted by the nomogram with the actual survival rates. CI, confidence interval; TCGA, The Cancer Genome Atlas.