Integrated analysis of uterine leiomyosarcoma and leiomyoma utilizing TCGA and GEO data: a WGCNA and machine learning approach

Zixin Yang; Fan Yang; Fanlin Li; Ying Zheng

doi:10.21037/tcr-2024-2465

Original Article

Integrated analysis of uterine leiomyosarcoma and leiomyoma utilizing TCGA and GEO data: a WGCNA and machine learning approach

Zixin Yang^1,2#, Fan Yang^1,2#, Fanlin Li^1,2, Ying Zheng^1,2

¹Department of Obstetrics and Gynaecology, West China Second University Hospital, Sichuan University, Chengdu, China; ²Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, West China Second University Hospital, Sichuan University, Chengdu, China

Contributions: (I) Conception and design: Z Yang, F Yang, Y Zheng; (II) Administrative support: Y Zheng; (III) Provision of study materials or patients: Z Yang, F Yang; (IV) Collection and assembly of data: Z Yang, F Yang; (V) Data analysis and interpretation: Z Yang, F Yang, F Li; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work as co-ﬁrst authors.

Correspondence to: Ying Zheng, MD. Department of Obstetrics and Gynaecology, West China Second University Hospital, Sichuan University, No. 20, Renmin South Road, Chengdu 610041, China; Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, West China Second University Hospital, Sichuan University, Chengdu, China. Email: zhy_chd@126.com.

Background: Uterine sarcoma is a gynecological mesenchymal tumor with an elusive pathogenesis. The uterine leiomyosarcoma (LMS) is the most common subtype of uterine sarcoma. LMS is a highly aggressive tumor with a poor prognosis. The genomic landscape of LMS remains unclear. Rare cases of LMS are observed to arise from leiomyoma (LM). We conducted a study to explore the genomic relationship between LMS and LM using public microarray data from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA). Using bioinformatics analysis tools, we would like to provide molecular insight into the pathogenesis of LMS and to discover novel predictive biomarkers for this disease.

Methods: LMS and LM differentially expressed genes (DEGs) were screened by analyzing GEO datasets; GSE764, GSE68312 and GSE64763; and TCGA data. A protein-protein interaction (PPI) network was constructed, and hub genes were identified utilizing the CytoHubba plug-in from Cytoscape software. In addition, weighted gene co-expression network analysis (WGCNA) was performed to identify hub genes. We took the intersection of the hub genes generated from the PPI network and WGCNA. Subsequently, random forest (RF) and support vector machine (SVM) algorithms were used to screen for key genes as predictive biomarkers. Finally, we constructed a nomogram with these genes.

Results: A total of 37 hub genes were selected using WGCNA. A total of 245 DEGs were identified; 63 DEGs were upregulated, and 182 DEGs were downregulated. Functional enrichment analysis revealed that these genes were mainly associated with the cell cycle, extracellular matrix receptor interactions and oocyte meiosis. The final hub genes were CENPA, KIF2C, TTK, MELK and CDC20. Gene set enrichment analysis (GSEA) revealed that these genes were mostly enriched in the cell cycle, mismatch repair and amino sugar and nucleotide sugar metabolism. Tumor-infiltrating immune cell analysis indicated that these genes did not have an obvious correlation with immune cells.

Conclusions: CENPA, KIF2C, TTK, MELK and CDC20 were key genes significantly associated with LMS and LM. Functional enrichment analysis and tumor-infiltrating immune cell analysis indicated that these genes might be correlated with tumor proliferation, which might shed light on the possible pathogenesis and predictive biomarkers of LMS.

Keywords: Uterine sarcoma; uterine leiomyoma; bioinformatics analysis; biomarkers; machine learning

Submitted Dec 05, 2024. Accepted for publication Mar 13, 2025. Published online May 13, 2025.

doi: 10.21037/tcr-2024-2465

Highlight box

Key findings

• CENPA, KIF2C, TTK, MELK, and CDC20 were identified as key genes significantly associated with leiomyosarcoma (LMS) and leiomyoma (LM).

• Functional enrichment and immune cell analysis suggest their potential role in tumor proliferation, offering new insights into LMS pathogenesis and predictive biomarkers.

What is known and what is new?

• Uterine LMS is the most common uterine sarcoma subtype, characterized by high aggressiveness and poor prognosis. The diagnosis of LMS remains challenging due to its clinical similarities with benign LM.

• This study integrated data from two public cancer databases to identify LMS-associated genes and functional modules. Through weighted gene co-expression network analysis (WGCNA), protein-protein interaction (PPI) networks, and machine learning algorithms, we identified novel predictive biomarkers for LMS.

What is the implication, and what should change now?

• Our study suggests that CENPA, MELK, KIF2C, TTK, and CDC20 may represent potential biomarkers for uterine LMS. The findings provide preliminary insights into the molecular mechanisms underlying LMS development and its association with tumor proliferation. Future studies should include experimental validation to further investigate the biological relevance of these candidate biomarkers. Additional research exploring the therapeutic potential of these key genes in LMS treatment would be valuable. Moreover, integrating multi-omics data and analyzing more diverse patient cohorts could help improve the diagnostic accuracy and generalizability of these findings.

Introduction

Uterine sarcoma is a rare and severe gynecologic tumor that derives from mesenchymal cells and constitutes approximately 8% of uterine malignancies (1). The histological landscape of uterine sarcoma is complex, with the histological subtype being one of the main factors contributing to different clinical features and prognosis (2). The uterine leiomyosarcoma (LMS) is the most common subtype of uterine sarcoma. Approximately 70% of uterine sarcomas are LMS (3). LMS is a highly aggressive tumor. Despite the rareness of LMS, it is associated with a considerable number of uterine cancer deaths. The prognosis of LMS patients is poor regardless of the stage. Patients with LMS limited to the uterus have a poor prognosis, with 5-year overall survival rates of 51% for stage I patients and 25% for stage II patients (4). The pathogenesis of LMS remains elusive. Although most LMS occur de novo, rare cases of LMS arising from leiomyoma (LM) have been reported (5-8). LM is the most common type of uterine tumor. Because of the similarities in signs and symptoms between LMS and LM, the diagnosis of LMS and early detection of the malignant tendency of LM are difficult (4).

With the rapid development of genetic microarray technology, large quantities of cancer genetic data have been collected. A growing number of public cancer-related databases have catalyzed the application of bioinformatics analysis (9). Bioinformatics analysis is thought to provide a better understanding of disease mechanisms and relationships between diseases (10). This study provides molecular insight and would like to discover potential predictive biomarkers. Biological data are characterized by a large amount of data and complexity. Due to its nature, machine learning is helpful in processing complex biological data. Machine learning is a process of fitting models to data or identifying groupings within data (11). Despite the growing number of studies applying machine learning to bioinformatics analysis, it has not yet been applied in the study of LM and LMS on analyzing public databases.

In the present study, cross-cohort data aggregation was conducted using three datasets from the Gene Expression Omnibus (GEO) database and data from The Cancer Genome Atlas (TCGA) database. We assessed the patch effect using principal component analysis (PCA). Analysis of differentially expressed genes (DEGs) between LMS tissues and LM tissues and key modules through weighted gene co-expression network analysis (WGCNA) was further conducted. Protein-protein interaction (PPI) networks were constructed to identify hub genes. Overlapping genes from hub genes identified by PPI networks and by WGCNA were extracted for further model fitting by random forest (RF) and support vector machine (SVM) methods. We present the final predictive biomarkers with a nomogram. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2024-2465/rc).

Methods

Data acquisition and processing

We used the key words “leiomyosarcoma” and “leiomyoma” to search the GEO database and “myomatous neoplasms” to search the TCGA website. We included GEO datasets containing both LMS and LM samples. The gene expression profiles GSE764, GSE68312 (12) and GSE64763 (13) from the GEO database (https://www.ncbi.nlm.nih.gov/geo/, September 22, 2023) (14) and data from the TCGA database were downloaded from Genomic Data Commons (GDC) data portal (https://www.cancer.gov/ccg/research/genome-sequencing/tcga, November 21, 2023). A total of 74 LMS samples and 37 LM samples were obtained. We then normalized the GEO data and performed log₂ transformation of the TCGA data. Finally, we obtained a gene matrix with GEO and TCGA data. To assess and reduce the possible batch effect of the data, we used the ComBat function in the R package “sva” (https://cran.r-project.org/) (15), and principal component analysis (PCA) was used to visualize the correction.

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. As the GEO database and TCGA database are public and do not contain sensitive personal information, informed consent and institutional review board approval were not required.

WGCNA

The merged data were analyzed by the “WGCNA” package (16). WGCNA constructs gene expression data as a similarity network. Topological overlap matrix (TOM), which are highly co-expressed genomes, are used to identify gene modules. We built an automated network and detection module and set the soft power value to 3. From all the modules analyzed, the module exhibiting the largest absolute correlation coefficient, coupled with a statistically significant P value (P<0.05), was identified as the key module and subjected to further in-depth analysis. We set the minimum module size to 10 to identify key modules that are most relevant to traits. The hub genes were defined as those with a gene significance (GS) >0.3 and a module membership (MM) >0.8.

Identification of DEGs

We utilized the R package “limma” from Bioconductor to screen for DEGs in the merged data (17,18). DEGs were screened based on the absolute value of log fold change (FC) and statistical significance, with thresholds set at |log FC| >1.0 and P<0.05.

PPI network construction and analysis

We used the Search Tool for the Retrieval of Interacting Genes (STRING; http://string-db.org) (version 12.0) (19) to construct and analyze the PPI network of the obtained DEGs. STRING is an online website that visualizes protein function and interactions. The PPI files obtained from the STRING website were imported into Cytoscape (http://www.cytoscape.org) (version 3.10.2) (20) for further analysis. The MCODE plug-in for the Cytoscape software was used to identify the functional modules in the PPI (21). The selection criteria were as follows: K-core =2, degree cutoff =2, max depth =100, and node score cutoff =0.2. We then utilized the cytoHubba plugin to identify hub genes among the DEGs (22). Cytohubba was applied to calculate the 10 nodes that ranked the highest when using the maximal clique centrality (MCC) method.

Ultimately, the overlapping genes obtained by WGCNA and cytoHubba were identified as hub genes and visualized through Venn plot using Sangerbox (http://sangerbox.com/home.html) (23).

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment analyses

The Database for Annotation, Visualization, and Integrated Discovery (DAVID, http://david.nciferf.gov) (24) is an online bioinformatics database that integrates biological data and analysis tools, providing users with a comprehensive set of gene and protein functional annotation information. GO annotation and KEGG pathway enrichment analyses of the identified hub genes were performed using DAVID. GO analysis included biological process (BP), cellular component (CC), and molecular function (MF) term enrichment. The results were considered statistically significant if P<0.05.

Predictive biomarker identification

RF is a powerful machine learning algorithm that excels in both classification and regression tasks. It is based on ensemble learning, where numerous decision trees are trained and collectively make predictions. SVM is also a powerful machine learning algorithm based on statistical theory and aims to find a decision boundary that separates points from different categories (10). We used the R package “caret” to conduct these two algorithms and 10-fold cross-validation to select parameters for RF (25). The RF gene importance ranking plot and the support vector machine recursive feature elimination (SVM-RFE) cross-validation error plot were plotted. In the RF analysis, we selected genes with a Mean Decrease Gini Index greater than 3. For the SVM approach, genes were ranked based on their log FC values, and the top 30 genes with the largest absolute values were input into the model. From the SVM-RFE cross-validation plot, we identified the number of genes corresponding to the point where the root mean squared error (RMSE) was minimized, and the same number of top-ranked genes was selected. Finally, the overlapping genes identified by both methods were chosen as the key genes for further analysis. Functional related genes were identified by GeneMANIA (http://genemania.org) (26).

Tumor-infiltrating immune cell analysis

The Tumor Immune Estimation Resource (TIMER, https://cistrome.shinyapps.io/timer/) (27) is an online tool that can be used to comprehensively analyze tumor immune microenvironment data from the TCGA database. It can be used to study the composition and function of immune cells in the tumor microenvironment, as well as to identify biomarkers associated with immune infiltration. We applied TIMER to study the potential correlation between the expression of selected hub genes and the infiltration of tumor-infiltrating immune cells, including B cells, CD4⁺ T cells, CD8⁺ T cells, neutrophils, macrophages, and dendritic cells. We also analyzed the relationship between the expression of the target genes and tumor purity.

Gene set enrichment analysis (GSEA)

For GSEA, we obtained GSEA software (version 3.0) from the GSEA (http://software.broadinstitute.org/gsea/index.jsp) (28) website. We performed single-gene GSEA for each key gene individually, and the samples were divided into high-expression (≥50%) and low-expression (<50%) groups based on the expression levels of the respective gene. Additionally, from the Molecular Signatures Database (http://www.gsea-msigdb.org/gsea/downloads.jsp) (29), the c2.cp.kegg.v7.4.symbols.gmt subset was downloaded to evaluate the relevant pathways and molecular mechanisms, the minimum gene set was set to 5, the maximum gene set was 5,000 based on the gene expression profile and phenotypic grouping, and 1,000 resamplings were performed. A P value of <0.05 and a false discovery rate (FDR) of <0.25 were considered to indicate statistical significance.

Nomogram construction and verification

We constructed a nomogram using the R packages “rms” and “rmda” to illustrate the predictive ability of the selected hub genes. The calibration curve and decision curve were plotted to analyze the performance of the nomogram. The predictive performance of the models was evaluated through receiver operating characteristic (ROC) curve analysis, with calculation of accuracy, sensitivity, and specificity.

Statistical analysis

The statistical analysis was performed using the R language (version 4.3.1) and Microsoft Excel 2016 (Microsoft Corp., Redmond, WA, USA). Batch effects were corrected using the “sva” package. WGCNA identified coexpression modules, while “limma” detected DEGs (|log₂FC| >1, P<0.05). PPI networks were analyzed via STRING and Cytoscape (version 3.10.2), with hub genes screened using cytoHubba. Functional enrichment analysis was performed in DAVID, and biomarker selection used RF (Gini >3) and SVM-RFE. Immune infiltration was assessed via TIMER, and enrichment analysis via GSEA software (version 3.0). Nomogram was constructed using “rms” and “rmda”. We constructed calibration curve, decision curve, and ROC curve in R, while computing corresponding performance metrics including the area under the curve (AUC), sensitivity, and specificity. Higher AUC values generally indicate superior discriminative capacity of the model, reflecting more reliable predictive performance.

Results

Data aggregation and PCA

Figure 1 shows the workflow of this research. According to the screening criteria, 3 datasets (GSE68312, GSE764, and GSE64763) from the GEO database and 33 samples from the TCGA database were included for the analysis of DEGs. The final sample size was 111 patients, including 74 patients with LMS and 37 patients with LM. The main characteristics of the GEO datasets, including the GEO accession ID, sample information, platform ID, and platform name, are shown in Table 1. Due to the small number of datasets, we combined the data from two databases, GEO and TCGA, for analysis. PCA was performed to reduce the dimensionality and visualize the data before and after adjusting for batch effects (Figure 2), which indicates that the batch effect among datasets was properly addressed.

Figure 1 Study workflow. DEGs, differentially expressed genes; GEO, Gene Expression Omnibus; GSE, Gene Expression Omnibus Series; GSEA, gene set enrichment analysis; GS, gene significance; LM, leiomyoma; LMS, leiomyosarcoma; MM, module membership; PCA, principal component analysis; PPI, protein-protein interaction; RF, random forest; SVM, support vector machine; TCGA, The Cancer Genome Atlas; TIMER, Tumor Immune Estimation Resource; WGCNA, weighted gene co-expression network analysis.

Table 1

Characteristics of the included datasets

Dataset ID	Number of samples	GPL ID	Platform name
GSE764	9 LM; 13 LMS	GPL80	[Hu6800] Affymetrix Human Full Length HuGeneFL Array
GSE68312	3 LM; 3 LMS	GPL6480	Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version)
GSE64763	25 LM; 25 LMS	GPL571	[HG-U133A_2] Affymetrix Human Genome U133A 2.0 Array

GPL, Gene Expression Omnibus Platform; GSE, Gene Expression Omnibus Series; LM, uterine leiomyoma samples; LMS, uterine leiomyosarcoma samples.

Figure 2 Principal component analysis. Results from the PCA for microarray studies downloaded from the GEO and TCGA database. Dimensionality reduction plots of data distributions from datasets from different sources. Each point represents a sample. (A) Before batch effect adjustment. (B) After batch effect adjustment. GEO, Gene Expression Omnibus; PCA, principal component analysis; TCGA, The Cancer Genome Atlas.

WGCNA and identification of the hub genes

To identify the key module and hub genes, we performed WGCNA using the aggregated data (Figure 3). With a soft-thresholding power of 3, we identified six key modules. We found that among these modules, the turquoise module was the most relevant to clinical traits (correlation coefficient =0.54, P=1E−09). The turquoise module contained a total of 565 genes. By setting GS >0.3 and MM >0.8, we selected 37 hub genes. Enrichment analysis revealed that these hub genes were mainly associated with the cell cycle (Figure 4).

Figure 3 WGCNA results. (A) Soft threshold selection in the WGCNA network analysis. Analysis of the scale-free fit index (left) and the mean connectivity (right) for various soft-thresholding powers. (B) The cluster dendrogram illustrates the hierarchical relationships among co-expressed genes, constructed using a dissimilarity measure defined as 1-TOM value. (C) Module-trait relationships. Each cell contains the correlation coefficient and P value. (D) Correlation analysis of turquoise module in WGCNA. 1-TOM, one minus the topological overlap matrix; uLMS, uterine leiomyosarcoma; UL, uterine leiomyoma; WGCNA, weighted gene co-expression network analysis.

Figure 4 Functional enrichment analysis. Different colors represent different functions. The color of P value is shown in a degree manner. (A) KEGG pathways analysis. (B) GO functional enrichment analysis of biological process. (C) GO functional enrichment analysis of cell component. (D) GO functional enrichment analysis of molecular function. GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Identification of DEGs

Based on the analysis of the aggregated data by R and the cutoff criteria, 245 DEGs were found between the LM and LMS samples. Sixty-three DEGs were upregulated, and 182 DEGs were downregulated (Figure 5). Functional enrichment analysis of these DEGs revealed that, according to KEGG enrichment analysis, these genes were mainly associated with extracellular matrix receptor interactions, the cell cycle and focal adhesion. In terms of BPs in GO analysis, we detected enrichment in cell population proliferation, response to endogenous stimulus and positive regulation in cell population proliferation. For the GO analysis of cell components, DEGs were mainly associated with collagen containing extracellular matrix, external encapsulating structure and cyclin-dependent protein kinase holoenzyme complex. For MF, extracellular matrix structural constituent, cyclin-dependent protein serine threonine kinase regulator activity and signaling receptor binding were the most enriched terms.

Figure 5 DEGs analysis. Identification of DEGs. (A) Volcano plot of leiomyoma and LMS patients after merging and removing any batch effects. Red indicates up-regulation and blue represents down-regulation. Black dots denote genes with |log FC| <1.0 and/or non-significant differential expression (P≥0.05). (B) Heatmap showing DEGs between leiomyoma and LMS patients. Each row represents one gene. Red indicates up-regulation and blue represents down-regulation. DEGs, differentially expressed genes; FDR, false discovery rate; FC, fold change; LMS, leiomyosarcoma; uLMS, uterine leiomyosarcoma.

Hub genes selected by PPI network analysis of DEGs

We analyzed the PPI network using the STRING database and visualized the network with Cytoscape (Figure 6A), which contains 213 nodes and 1,164 interaction pairs. We used the MCODE plug-in to analyze functional modules within the network. There were 7 functional modules in total. Here, we display two modules with the highest MCODE scores (Figure 6B,6C), including 42 DEGs. KEGG pathway analysis revealed that these genes are involved mainly in the cell cycle, oocyte meiosis and cancer pathways (Figure 6D). With the use of the MCC algorithm for the cytoHubba plug-in, we calculated the top 20 hub genes.

Figure 6 PPI network and significant modules. The PPI network construction and functional module identification (A-C), the size and shading of the circles represent the weight of each gene in (A). KEGG enrichment analysis of the modular genes (D). The size of the circle represents the number of genes involved, and the x-axis represents the frequency of the genes involved in the term total genes. FDR, false discovery rate; KEGG, Kyoto Encyclopedia of Genes and Genomes; PPI, protein-protein interaction.

Identification of predictive biomarkers by RF/SVM

We intersected the hub genes obtained from WGCNA with the hub genes obtained from PPI network analysis (Figure 7). A total of 19 genes were obtained. The resulting genes were screened using RF and SVM. Genes are ranked according to their importance in RF (Figure 8A). The node with the smallest cross-validation error was selected according to SVM (Figure 8B), and five genes were output. These five genes, CENPA, KIF2C, TTK, CDC20, and MELK, were consistent with the top 5 genes in the RF ranking, so we selected these five genes for further evaluation. In addition, these five genes were differentially expressed in the merged datasets, and all of them were upregulated.

Figure 7 Identification of overlapping genes. Venn diagram of significant genes was screened via WGCNA and PPI network analysis. Five key genes selected for further evaluation were highlighted. PPI, protein-protein interaction; WGCNA, weighted gene co-expression network analysis.

Figure 8 Identification of predictive biomarkers by random forest and support vector machine. (A) Random forest gene importance ranking plot showing the mean decrease Gini index for key genes, including CENPA, KIF2C, TTK, CDC20, and MELK. (B) SVM-RFE cross-validation error plot illustrating the RMSE as a function of the number of variables, with the optimal number of variables (N=5) indicated. RMSE, root mean squared error; SVM-RFE, support vector machine recursive feature elimination.

GSEA of predictive biomarkers

GSEA was performed on each of the five hub genes identified (Figure 9). CENPA was mostly enriched in amino sugar and nucleotide sugar metabolism. KIF2C was mainly enriched in mismatch repair. Apart from these two genes, the other three genes were enriched in the cell cycle, consistent with the enrichment results of the hub genes screened by WGCNA and PPI.

Figure 9 GSEA of 5 key genes. (A) CENPA; (B) KIF2C; (C) TTK; (D) CDC20; (E) MELK. The upper part displays the ES profile, the middle part indicates the positions of gene set members within the ranked gene list, and the lower part shows the distribution of ranked gene values, with the signal-to-noise ratio for each gene represented by a gray area plot. Each plot highlights the top three enriched functions identified in the GSEA. ES, enrichment score; GSEA, gene set enrichment analysis; NP, nominal P value.

Tumor-infiltrating immune cell analysis

TIMER is an online tool that reveals the potential correlation between targeted genes and tumor purity as well as selected immune cell infiltration. The figures show a scatter plot of the correlation between the expression of the gene of interest and the selected tumor-infiltrating immune cells (Figure 10). The first scatterplot of each row is a scatterplot of the correlation between target gene expression and tumor purity. CENPA (r=0.161), KIF2C (r=0.183), TTK (r=0.239), CDC20 (r=0.161), and MELK (r=0.237) all exhibit positive correlation coefficients, with TTK demonstrating the highest correlation coefficient. This suggests a potential positive relationship between these genes and tumor purity, and TTK may have the strongest linear relationship with tumor purity among them. Conversely, we observed no or weak relationship between the targeted genes and immune cell infiltration.

Figure 10 Tumor-infiltrating immune cell analysis of 5 key genes. Association of 5 key genes’ expression with immune infiltration. (A) CENPA; (B) KIF2C; (C) TTK; (D) CDC20; (E) MELK. Each point represents a sample, with the correlation coefficient and P value annotated in the figure. P<0.05 denotes significance. TPM, transcripts per million; SARC, sarcoma.

Bioinformatics analysis and nomogram construction

The relevant genes that interact with these five predictive genes are listed in Table 2, and the connections were identified by GeneMANIA. Functional analysis revealed that these genes were mainly associated with chromosomes (centromeric regions), mitotic nuclear division and nuclear chromosome segregation (Table 3).

Table 2

List of predictive biomarkers and their interactors predicted by GeneMANIA

Predictive biomarkers

CENPA

KIF2C

TTK

CDC20

MELK

Predictive biomarkers interactors identified by GeneMANIA

MELK

TTK

CENPA

KIF2C

CDC20

HJURP

BUB1B

MAD2L1

BUB1

NDC80

CDC25B

SPC25

CENPH

CDC27

ANAPC11

KIF18B

TGM1

KIF4A

FOXM1

AURKA

PLK4

CENPE

BIRC5

CENPF

TOP2A

Table 3

Top 10 functions in gene functional analysis by GeneMANIA

Function	FDR	Genes in network	Genes in genome
Chromosome, centromeric region	7.52E−21	14	126
Mitotic nuclear division	2.18E−18	15	270
Nuclear chromosome segregation	2.21E−17	14	236
Chromosome separation	2.51E−17	11	78
Chromosome segregation	2.51E−17	14	248
Sister chromatid segregation	2.51E−17	13	176
Metaphase/anaphase transition of mitotic cell cycle	3.79E−17	10	50
Chromosomal region	4.05E−17	14	266
Regulation of metaphase/anaphase transition of cell cycle	4.05E−17	10	51
Metaphase/anaphase transition of cell cycle	5.03E−17	10	53

FDR, false discovery rate.

We constructed a nomogram based on the five potential predictive biomarkers (Figure 11A). A decision curve analysis (DCA) plot showed that the nomogram produced net benefits (Figure 11B). The calibration curve showed that the diagnostic results of the model were basically consistent with the actual diagnostic results (Figure 11C). The ROC curve demonstrates that our model achieves an AUC of 0.87, with a sensitivity of 0.734 and a specificity of 0.937, indicating its potential for clinical application (Figure 11D).

Figure 11 A nomogram was created and verified based on the 5 potential predictive genes. (A) A nomogram for predicting the risk of LMS generated by integrating these 5 genes. (B) Decision curve analysis curves for the nomogram. (C) Calibration plot for predictive value of the nomogram. (D) ROC curve of the model. AUC, area under the curve; LMS, leiomyosarcoma; ROC, receiver operating characteristic.

Discussion

LMS is a malignant and complex disease, and its pathogenesis is unknown (30). The diagnosis of LMS and early detection of the malignant tendency of LM are difficult because of the similarities in signs and symptoms between LMS and LM (4). Given that uterine fibroids are the most common benign tumors in women, with a high incidence rate that can significantly impact women’s health and quality of life (31), there is a growing interest in developing an early warning system to potentially detect the malignant transformation of uterine fibroids. Furthermore, improving the accuracy of preoperative diagnosis could help guide more appropriate treatment strategies and potentially enhance patient outcomes. Therefore, this study seeks to explore potential biomarkers and diagnostic tools that may contribute to the early identification of malignant transformation and support more precise preoperative assessments.

However, identifying reliable biomarkers for LMS has proven to be a challenging task. Although several studies have employed microarray data to investigate LMS in an effort to identify novel biomarkers and therapeutic targets (32-37), these studies have yet to yield consistent conclusions. For instance, a recent study utilizing histologically confirmed LM and LMS samples for differential exome and transcriptome-wide analyses identified 19 significant DEGs, such as AURKA, SPAG5, NUF2, BUB1B, and KIF14, and further modeled them using machine learning techniques (34). However, other studies have reported different sets of key DEGs, highlighting the variability in findings across research efforts. To the best of our knowledge, our study represents the first attempt to integrate data from two major databases, TCGA and GEO, to systematically explore and validate potential biomarkers for LMS. This approach aims to address the inconsistencies in previous research and provide a more comprehensive understanding of the molecular landscape of LMS.

To analyze the DEGs between LMS and LM patients as comprehensively as possible, we integrated data from both the GEO and TCGA databases. We used WGCNA and PPI to analyze the DEGs between LMS and LM and used machine learning to screen the key genes. Based on these key genes, a nomogram was constructed, which not only analyzed the relationship between LM and LMS at the molecular level but also provided a possible method for predicting LMS. We identified five key genes, namely, CENPA, KIF2C, MELK, TTK, and CDC20. In line with prior research, MELK has been identified as a potential biomarker in LMS (32,38). Maternal embryonic leucine zipper kinase (MELK) regulates the cell cycle and apoptosis. It is associated with higher tumor grade and decreased survival in brain and breast cancers, respectively. In addition, MELK has also been proposed to be associated with cancer progression in endometrial cancer patients (39).

Although the other key genes identified in this study have not yet been identified in previous relevant LMS studies, there have been many studies of these genes in other cancers. CENPA has been shown to be more common in hepatocellular carcinoma, glioma, and renal cell carcinoma (40-42). Overexpression of CENPA has an effect on genomic integrity (43), which is consistent with the results of our study showing that CENPA expression was upregulated in LMS. In a recent systematic review and meta-analysis, Kreis et al. (44) demonstrated that KIF2C is associated with poorer prognosis across multiple cancer types. Studies related to TTK and CDC20 have revealed that these two genes are potential biomarkers of diagnosis and treatment for cancer patients (45,46). Consistent with existing research, we revealed that these genes have potential clinical application value as biomarkers for predicting LMS. Previous studies have explored not only the role of these genes in the development of cancer but also their potential role in the treatment of cancer (47-51). This in fact illustrates the potential of these genes in predicting cancer development or as therapeutic targets for cancer.

To further explore the biological functions of the hub genes, our study also explored the relationships between key DEGs and the tumor immune environment. Tumor immune infiltration analysis revealed no significant correlation between key genes and immune cells but was positively correlated with tumor purity. This suggests that the functions of CENPA, MELK, KIF2C, TTK, and CDC20 are not associated with the regulation of the tumor immune microenvironment. Based on the results of functional enrichment analysis, several cell cycle-related pathways were enriched in the high-expression groups of these key genes, indicating a potential link between these genes and tumor proliferation.

While our study provides valuable insights into the role of key genes in LMS, there are certain limitations in this study. First, our research did not include experimental validation, which could have further confirmed the biological relevance of our findings. Second, our analysis was based on publicly available datasets from TCGA and GEO, which may have inherent biases related to sample size, population diversity, or technical variability. Despite these limitations, we believe that our results are robust and contribute meaningfully to the field. Future studies should aim to address these limitations by incorporating experimental validation and exploring additional datasets from diverse populations. Future research could explore the use of additional methods, such as genome-wide association studies (GWAS), RNA sequencing (RNA-seq), or single-cell sequencing, to further validate and expand upon our findings. To balance these limitations, our study leverages methodological innovations that strengthen its validity. In the case of comprehensive use of data, our study also has the advantage of being a multi-cohort integrated study. With such a research strategy, it is easier for us to obtain a larger sample size and to discover new potential biomarkers (9). In addition, we used machine learning to screen and rank biomarkers, which is also one of our strengths.

In summary, combined with other analytical tools, such as WGCNA and machine learning, we performed a comprehensive bioinformatics analysis of LM and LMS. We screened out five key genes, namely, CENPA, MELK, KIF2C, TTK, and CDC20, as biomarkers to predict LMS. The expression of these genes in LMS was upregulated. Based on the functional enrichment analysis of genes and immune analysis, these genes may have a potential link to tumor proliferation in LMS. More work needs to be done to fully uncover their contribution to the pathogenesis of LMS and to validate their usefulness as diagnostic and/or prognostic markers.

Conclusions

In this study, we identified CENPA, MELK, KIF2C, TTK, and CDC20 as key genes significantly associated with LMS, providing new insights into its molecular pathogenesis and potential predictive biomarkers. Through integrated bioinformatics analysis, including WGCNA, PPI networks, and machine learning, we developed a nomogram for LMS prediction, offering a potential tool for clinical application. Functional enrichment analysis revealed a link between these genes and tumor proliferation, while immune analysis suggested their limited role in the tumor immune microenvironment. Despite the lack of experimental validation and potential biases from public datasets, our multi-cohort approach and machine learning strategies strengthen the robustness of our findings. Future studies should focus on experimental validation, therapeutic exploration of these biomarkers, and integration of multi-omics data to further advance the diagnosis and treatment of LMS.

Acknowledgments

We thank the authors of the GSE764, GSE68312 and GSE64763 datasets, and sample data from TCGA for their contribution.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2024-2465/rc

Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2024-2465/prf

Funding: This work was supported by the National Key R&D Program of China (No. 2022YFC2704103).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2024-2465/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Croce S, Devouassoux-Shisheboran M, Pautier P, et al. Uterine sarcomas and rare uterine mesenchymal tumors with malignant potential. Diagnostic guidelines of the French Sarcoma Group and the Rare Gynecological Tumors Group. Gynecol Oncol 2022;167:373-89. [Crossref] [PubMed]
Shushkevich A, Thaker PH, Littell RD, et al. State of the science: Uterine sarcomas: From pathology to practice. Gynecol Oncol 2020;159:3-7. [Crossref] [PubMed]
Ricci S, Stone RL, Fader AN. Uterine leiomyosarcoma: Epidemiology, contemporary treatment strategies and the impact of uterine morcellation. Gynecol Oncol 2017;145:208-16. [Crossref] [PubMed]
Guo J, Zheng J, Tong J. Potential Markers to Differentiate Uterine Leiomyosarcomas from Leiomyomas. Int J Med Sci 2024;21:1227-40. [Crossref] [PubMed]
Fischer JV, Mejia-Bautista M, Vadasz B, et al. Uterine Leiomyosarcoma Associated With Leiomyoma With Bizarre Nuclei: Histology and Genomic Analysis of 2 Cases. Int J Gynecol Pathol 2022;41:552-65. [Crossref] [PubMed]
Yamamoto A, Tateishi Y, Aikou S, et al. The first case of gastric leiomyosarcoma developed through malignant transformation of leiomyoma. Pathol Int 2021;71:837-43. [Crossref] [PubMed]
Ghorbani H, Ranaee M, Vosough Z. Two Rare Cases of Uterine Leiomyosarcomas Originating from Submucosal Leiomyomas Proved by Their Immunohistochemistry Profiles. Int J Fertil Steril 2020;14:256-9. [PubMed]
Felicelli C, Lu X, Coty-Fattal Z, et al. Genomic characterization and histologic analysis of uterine leiomyosarcoma arising from leiomyoma with bizarre nuclei. J Pathol 2025;265:211-25. [Crossref] [PubMed]
Jiang P, Sinha S, Aldape K, et al. Big data in basic and translational cancer research. Nat Rev Cancer 2022;22:625-39. [Crossref] [PubMed]
Greener JG, Kandathil SM, Moffat L, et al. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022;23:40-55. [Crossref] [PubMed]
Choi RY, Coyner AS, Kalpathy-Cramer J, et al. Introduction to Machine Learning, Neural Networks, and Deep Learning. Transl Vis Sci Technol 2020;9:14. [PubMed]
Miyata T, Sonoda K, Tomikawa J, et al. Genomic, Epigenomic, and Transcriptomic Profiling towards Identifying Omics Features and Specific Biomarkers That Distinguish Uterine Leiomyosarcoma and Leiomyoma at Molecular Levels. Sarcoma 2015;2015:412068. [Crossref] [PubMed]
Barlin JN, Zhou QC, Leitao MM, et al. Molecular subtypes of uterine leiomyosarcoma and correlation with clinical outcome. Neoplasia 2015;17:183-9. [Crossref] [PubMed]
Barrett T, Wilhite SE, Ledoux P, et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 2013;41:D991-5. [Crossref] [PubMed]
Leek JT, Johnson WE, Parker HS, et al. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 2012;28:882-3. [Crossref] [PubMed]
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008;9:559. [Crossref] [PubMed]
Gentleman RC, Carey VJ, Bates DM, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004;5:R80. [Crossref] [PubMed]
Smyth G. LIMMA: Linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor 2005:397-420.
Szklarczyk D, Kirsch R, Koutrouli M, et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res 2023;51:D638-46. [Crossref] [PubMed]
Smoot ME, Ono K, Ruscheinski J, et al. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 2011;27:431-2. [Crossref] [PubMed]
Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003;4:2. [Crossref] [PubMed]
Chin CH, Chen SH, Wu HH, et al. cytoHubba: identifying hub objects and sub-networks from complex interactome. BMC Syst Biol 2014;8:S11. [Crossref] [PubMed]
Shen W, Song Z, Zhong X, et al. Sangerbox: A comprehensive, interaction-friendly clinical bioinformatics analysis platform. Imeta 2022;1:e36. [Crossref] [PubMed]
Sherman BT, Hao M, Qiu J, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res 2022;50:W216. [Crossref] [PubMed]
Kuhn M, Wing J, Weston S, et al. caret: Classification and Regression Training [Internet]. 2023 [cited 2024 May 28]. Available online: https://cran.r-project.org/web/packages/caret/index.html
Warde-Farley D, Donaldson SL, Comes O, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res 2010;38:W214. [Crossref] [PubMed]
Li T, Fan J, Wang B, et al. TIMER: A Web Server for Comprehensive Analysis of Tumor-Infiltrating Immune Cells. Cancer Res 2017;77:e108-10. [Crossref] [PubMed]
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005;102:15545-50. [Crossref] [PubMed]
Liberzon A, Subramanian A, Pinchback R, et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 2011;27:1739-40. [Crossref] [PubMed]
Chudasama P, Mughal SS, Sanders MA, et al. Integrative genomic and transcriptomic analysis of leiomyosarcoma. Nat Commun 2018;9:144. [Crossref] [PubMed]
Lou Z, Huang Y, Li S, et al. Global, regional, and national time trends in incidence, prevalence, years lived with disability for uterine fibroids, 1990-2019: an age-period-cohort analysis for the global burden of disease 2019 study. BMC Public Health 2023;23:916. [Crossref] [PubMed]
Adams CL, Dimitrova I, Post MD, et al. Identification of a novel diagnostic gene expression signature to discriminate uterine leiomyoma from leiomyosarcoma. Exp Mol Pathol 2019;110:104284. [Crossref] [PubMed]
Zang Y, Gu L, Zhang Y, et al. Identification of key genes and pathways in uterine leiomyosarcoma through bioinformatics analysis. Oncol Lett 2018;15:9361-8. [Crossref] [PubMed]
Machado-Lopez A, Alonso R, Lago V, et al. Integrative Genomic and Transcriptomic Profiling Reveals a Differential Molecular Signature in Uterine Leiomyoma versus Leiomyosarcoma. Int J Mol Sci 2022;23:2190. [Crossref] [PubMed]
de Almeida TG, Ricci AR, Dos Anjos LG, et al. FOXO3a deregulation in uterine smooth muscle tumors. Clinics (Sao Paulo) 2024;79:100350. [Crossref] [PubMed]
Hu X, Zhang H, Zheng X, et al. STMN1 and MKI67 Are Upregulated in Uterine Leiomyosarcoma and Are Potential Biomarkers for its Diagnosis. Med Sci Monit 2020;26:e923749. [Crossref] [PubMed]
Zhang Q, Kanis MJ, Ubago J, et al. The selected biomarker analysis in 5 types of uterine smooth muscle tumors. Hum Pathol 2018;76:17-27. [Crossref] [PubMed]
Sparić R, Andjić M, Babović I, et al. Molecular Insights in Uterine Leiomyosarcoma: A Systematic Review. Int J Mol Sci 2022;23:9728. [Crossref] [PubMed]
McDonald IM, Graves LM. Enigmatic MELK: The controversy surrounding its complex role in cancer. J Biol Chem 2020;295:8195-203. [Crossref] [PubMed]
Liao J, Chen Z, Chang R, et al. CENPA functions as a transcriptional regulator to promote hepatocellular carcinoma progression via cooperating with YY1. Int J Biol Sci 2023;19:5218-32. [Crossref] [PubMed]
Wang B, Wei W, Long S, et al. CENPA acts as a prognostic factor that relates to immune infiltrates in gliomas. Front Neurol 2022;13:1015221. [Crossref] [PubMed]
Wang Q, Xu J, Xiong Z, et al. CENPA promotes clear cell renal cell carcinoma progression and metastasis via Wnt/β-catenin signaling pathway. J Transl Med 2021;19:417. [Crossref] [PubMed]
Renaud-Pageot C, Quivy JP, Lochhead M, et al. CENP-A Regulation and Cancer. Front Cell Dev Biol 2022;10:907120. [Crossref] [PubMed]
Kreis NN, Moon HH, Wordeman L, et al. KIF2C/MCAK a prognostic biomarker and its oncogenic potential in malignant progression, and prognosis of cancer patients: a systematic review and meta-analysis as biomarker. Crit Rev Clin Lab Sci 2024;61:404-34. [Crossref] [PubMed]
Fuentes-Antrás J, Bedard PL, Cescon DW. Seize the engine: Emerging cell cycle targets in breast cancer. Clin Transl Med 2024;14:e1544. [Crossref] [PubMed]
Xian F, Zhao C, Huang C, et al. The potential role of CDC20 in tumorigenesis, cancer progression and therapy: A narrative review. Medicine (Baltimore) 2023;102:e35038. [Crossref] [PubMed]
Wu G, Fan Z, Li X. CENPA knockdown restrains cell progression and tumor growth in breast cancer by reducing PLA2R1 promoter methylation and modulating PLA2R1/HHEX axis. Cell Mol Life Sci 2024;81:27. [Crossref] [PubMed]
Su P, Lu Q, Wang Y, et al. Targeting MELK in tumor cells and tumor microenvironment: from function and mechanism to therapeutic application. Clin Transl Oncol 2025;27:887-900. [Crossref] [PubMed]
Zhang P, Gao H, Ye C, et al. Large-Scale Transcriptome Data Analysis Identifies KIF2C as a Potential Therapeutic Target Associated With Immune Infiltration in Prostate Cancer. Front Immunol 2022;13:905259. [Crossref] [PubMed]
Bharti V, Kumar A, Wang Y, et al. TTK inhibitor OSU13 promotes immunotherapy responses by activating tumor STING. JCI Insight 2024;9:e177523. [Crossref] [PubMed]
Wu F, Wang M, Zhong T, et al. Inhibition of CDC20 potentiates anti-tumor immunity through facilitating GSDME-mediated pyroptosis in prostate cancer. Exp Hematol Oncol 2023;12:67. [Crossref] [PubMed]

Cite this article as: Yang Z, Yang F, Li F, Zheng Y. Integrated analysis of uterine leiomyosarcoma and leiomyoma utilizing TCGA and GEO data: a WGCNA and machine learning approach. Transl Cancer Res 2025;14(5):2999-3016. doi: 10.21037/tcr-2024-2465

Integrated analysis of uterine leiomyosarcoma and leiomyoma utilizing TCGA and GEO data: a WGCNA and machine learning approach

Highlight box

Introduction

Methods

Data acquisition and processing

WGCNA

Identification of DEGs

PPI network construction and analysis

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment analyses

Predictive biomarker identification

Tumor-infiltrating immune cell analysis

Gene set enrichment analysis (GSEA)

Nomogram construction and verification

Statistical analysis

Results

Data aggregation and PCA

Table 1

WGCNA and identification of the hub genes

Identification of DEGs

Hub genes selected by PPI network analysis of DEGs

Identification of predictive biomarkers by RF/SVM

GSEA of predictive biomarkers

Tumor-infiltrating immune cell analysis

Bioinformatics analysis and nomogram construction

Table 2

Table 3

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share