A panel of machine learning approaches for diagnostic model development and validation in ovarian cancer

Shengyan Shen; Lingling Zhang; Qiran Sun; Yating Huang; Ying Yang; Ningning Hu; Susu Jiang; Liwen Zhang; Xiaoqin Wang; Rujun Chen

doi:10.21037/tcr-2025-1-2580

Original Article

A panel of machine learning approaches for diagnostic model development and validation in ovarian cancer

Shengyan Shen^1,2#, Lingling Zhang^1,2#, Qiran Sun^1,2, Yating Huang^1,2, Ying Yang^1,2, Ningning Hu^1,2, Susu Jiang^1,2, Liwen Zhang^1,2, Xiaoqin Wang^1,2, Rujun Chen^1,2

¹Department of Gynecology and Obstetrics, Shanghai Fifth People’s Hospital, Fudan University, Shanghai, China; ²Center of Community-Based Health Research, Fudan University, Shanghai, China

Contributions: (I) Conception and design: S Shen, X Wang, R Chen; (II) Administrative support: Liwen Zhang; (III) Provision of study materials or patients: N Hu, Y Yang; (IV) Collection and assembly of data: Lingling Zhang, Q Sun, Y Huang; (V) Data analysis and interpretation: S Shen, Lingling Zhang, R Chen, X Wang, S Jiang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Rujun Chen, MD; Xiaoqin Wang, MD. Department of Gynecology and Obstetrics, Shanghai Fifth People’s Hospital, Fudan University, Shanghai, China; Center of Community-Based Health Research, Fudan University, 801 Heqing Road, Minhang District, Shanghai 200240, China. Email: chenrujun@fudan.edu.cn; wangxiaoqin@fudan.edu.cn.

Background: Ovarian cancer is a lethal gynecological malignancy, with ~70% of patients diagnosed at advanced stages (5-year survival <30%) due to suboptimal early detection tools. Current modalities [e.g., carbohydrate antigen 125 (CA125), imaging] lack adequate sensitivity and specificity, while existing machine learning (ML)-based diagnostic models are constrained by small sample sizes and insufficient validation. This study aims to develop and validate a robust ML-based diagnostic model for ovarian cancer using large-scale gene expression data, identify core diagnostic biomarkers, and explore their functional and immune-related mechanisms.

Methods: We analyzed five Gene Expression Omnibus (GEO) datasets, with GSE26712, GSE29156, and GSE40595 assigned as the training cohort, and GSE66957 and GSE119054 as independent validation cohorts. Following batch effect correction, 113 ML algorithms were systematically compared. The optimal model was further validated, and its core genes [model gene (Mgenes)] were subjected to functional enrichment and immune cell correlation analyses.

Results: The least absolute shrinkage and selection operator (Lasso) + NaiveBayes model demonstrated the strongest diagnostic performance, with area under the receiver operating characteristic curve (AUC) values of 0.991 in the training set, 0.889 in GSE119054, and 0.936 in GSE66957, and achieved 100% recall in both validation cohorts. Twelve Mgenes were identified, among which CP exhibited the highest diagnostic value (AUC =0.966). Functional enrichment revealed that Mgenes were predominantly involved in cell cycle regulation and DNA replication pathways, and correlation analyses confirmed their associations with key immune subsets (e.g., MAOB with regulatory T cells, STAR with CD8⁺ T cells).

Conclusions: The Lasso + NaiveBayes model enables robust ovarian cancer diagnosis, with high recall prioritizing the identification of all potential cases. The identified Mgenes act as both high-performance diagnostic biomarkers and functional mediators of tumorigenesis, laying a foundation for early detection strategies and mechanistic research into ovarian cancer.

Keywords: Ovarian cancer; machine learning (ML); diagnostic model; tumor immune microenvironment

Submitted Nov 21, 2025. Accepted for publication Jan 13, 2026. Published online Feb 13, 2026.

doi: 10.21037/tcr-2025-1-2580

Highlight box

Key findings

• The least absolute shrinkage and selection operator (Lasso) + NaiveBayes model achieves robust ovarian cancer diagnosis [area under the receiver operating characteristic curve (AUC) =0.991 in training, 0.889–0.936 in validation] with 100% recall; 12 core genes [model gene (Mgenes)] are identified, among which CP has the highest diagnostic value (AUC =0.966).

What is known and what is new?

• Current ovarian cancer detection tools (e.g., carbohydrate antigen 125, imaging) lack sensitivity/specificity; existing machine learning (ML)-based models are limited by small samples and insufficient validation.

• This study systematically compares 113 ML algorithms using five large-scale Gene Expression Omnibus datasets, develops a well-validated Lasso + NaiveBayes model, and uncovers Mgenes’ roles in cell cycle/DNA replication and immune correlations.

What is the implication, and what should change now?

• The model and Mgenes (e.g., CP) offer promising tools for ovarian cancer early detection; further translational research and clinical trials are needed to validate their utility in clinical practice.

Introduction

Ovarian cancer remains one of the most lethal gynecological malignancies worldwide, with an estimated 313,959 new cases and 207,252 deaths reported globally in 2020 (1). Its dismal prognosis is largely attributed to the lack of effective early detection strategies, as approximately 70% of patients present with advanced-stage disease [International Federation of Gynecology and Obstetrics (FIGO) stages III–IV] at diagnosis, of which the 5-year survival rate drops below 30% (2). In contrast, patients diagnosed at early stages (FIGO I–II) exhibit a 5-year survival rate exceeding 90%, underscoring the critical need for precise diagnostic tools to enable timely intervention (3).

Current diagnostic modalities for ovarian cancer face significant limitations. Serum carbohydrate antigen 125 (CA125), the most widely used biomarker, lacks sufficient sensitivity and specificity—its elevation is observed in only 50% of early-stage cases and can be triggered by benign conditions such as endometriosis or pelvic inflammation (4). Imaging techniques, including transvaginal ultrasound and contrast-enhanced computed tomography, often fail to distinguish early ovarian lesions from benign adnexal masses, leading to high rates of unnecessary surgical interventions (5). Moreover, histopathological examination, the gold standard for confirmation, is invasive and reliant on tissue sampling, which may not be feasible in pre-surgical settings.

The emergence of machine learning (ML) has revolutionized medical diagnostics by enabling the extraction of complex patterns from high-dimensional datasets, particularly genomic profiles—a resource abundant in ovarian cancer research due to public repositories like the Gene Expression Omnibus (GEO). In oncology, ML models have demonstrated promising performance in breast cancer screening (6) and lung nodule classification (7), outperforming traditional approaches in some scenarios. For ovarian cancer, preliminary studies have explored ML-based prediction using gene expression data (8) or CA125 combined with clinical variables (9). However, these efforts are limited by small sample sizes, inadequate external validation, and reliance on narrow algorithm panels (typically 5–10 methods), hindering their translation to clinical practice.

A critical gap exists in the systematic evaluation of diverse traditional ML algorithms for leveraging genomic data to develop robust diagnostic models for ovarian cancer. Unlike single-algorithm approaches, a broad panel of traditional ML methods (e.g., tree-based models, regularization-based regression, Bayesian classifiers) can leverage complementary strengths—for example, random forests (RFs) excel at handling non-linear relationships, while the least absolute shrinkage and selection operator (Lasso) regression aids feature selection—thereby enhancing diagnostic accuracy and generalizability without the large sample requirements of deep learning (10). Furthermore, rigorous validation across independent cohorts is essential to ensure model stability, yet, few studies have implemented such protocols for genomic data-driven ovarian cancer ML models.

In this study, we aim to address these limitations by: (I) developing a comprehensive panel of traditional ML models using genomic expression data from public GEO cohorts; (II) comparing the diagnostic performance of 113 distinct algorithms, including RFs, gradient-boosted machines, support vector machines, and regularization-based methods; (III) validating the optimal model in two independent external cohorts to assess its generalizability; and (IV) identifying key genes [model gene (Mgenes)] driving the optimal model’s diagnostic decisions, with subsequent analysis of their expression patterns and biological relevance to enhance model interpretability. By establishing a robust genomic data-driven framework for precision diagnosis, this work seeks to improve early detection of ovarian cancer, reduce misdiagnosis rates, and ultimately inform clinical decision-making to optimize patient outcomes. Figure 1 presents an outline of the workflow. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2580/rc).

Figure 1 Research flow chart. GO, Gene Ontology; GSEA, gene set enrichment analysis; GSVA, gene set variation analysis; GTEx, Genotype-Tissue Expression; GWAS, genome-wide association study; InteGenes, Intersection Genes; KEGG, Kyoto Encyclopedia of Genes and Genomes; Lasso, least absolute shrinkage and selection operator; ROC, receiver operating characteristic curve; TCGA-OV, The Cancer Genome Atlas-Ovarian Cancer; WGCNA, weighted gene co-expression network analysis.

Methods

Datasets and data preprocessing

This study enrolled 5 ovarian cancer datasets from the GEO, each containing both tumor (T) and normal (N) samples. These datasets included: GSE26712 (N=10, T=185), GSE29156 (N=4, T=8), GSE40595 (N=6, T=32), GSE66957 (N=12, T=57), and GSE119054 (N=3, T=6). The first three datasets served as the training set, and the latter two as the test set. The data of GSE26712, GSE29156, and GSE40595 included in the training set were combined. Batch effect correction was implemented on the merged data utilizing the sva package (11), and the batch-corrected expression data were outputted. The boxplots and principal component analysis (PCA) plots in Figure 2 demonstrate that the batch effects among the datasets were effectively eliminated before and after batch effect removal. Furthermore, we incorporated an early-stage ovarian cancer dataset restricted to FIGO stages I and II—specifically The Cancer Genome Atlas-Ovarian Cancer (TCGA-OV) dataset (tumor cohort, n=426)—alongside a normal ovarian tissue dataset, the Genotype-Tissue Expression (GTEx) dataset (normal cohort, n=193), to evaluate the model’s performance in detecting early-stage ovarian cancer. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Figure 2 Data preparation. (A) Box plots showing expression profiles of each dataset before batch effect correction. (B) Box plots showing expression profiles of each dataset after batch effect correction. (C) PCA plots of each dataset before batch effect correction. (D) PCA plots of each dataset after batch effect correction. PC, principal component; PCA, principal component analysis.

Analysis of differentially expressed genes (DEGs)

DEGs between the tumor and normal groups in the training set were analyzed using the limma package, with filtering criteria of log fold change (FC) =1.5 and adjusted P value =0.05. A heatmap of these DEGs was plotted using the pheatmap package (12), and a volcano plot was generated using the ggplot2 package (13).

Weighted gene co-expression network analysis (WGCNA)

For WGCNA, we incorporated the gene expression data from the integrated training set, which had undergone preprocessing and integration procedures to ensure data consistency and quality. The WGCNA package (14) was employed to perform the comprehensive network analysis. Specifically, sample clustering was first conducted to evaluate the relationships among samples and identify potential outliers, which helps in ensuring the reliability of subsequent network construction. Following sample clustering, module identification was performed based on the co-expression patterns of genes, where genes with similar expression profiles across samples were grouped into distinct modules. Finally, the gene list for each identified module was generated and exported, laying a foundation for further functional enrichment analysis and module-trait association studies.

Functional enrichment analysis

To further explore the biological functions and potential molecular mechanisms of the key genes, we first identified the overlapping genes between the DEGs and the genes from the designated modules obtained from WGCNA. This intersection analysis helps to focus on the most biologically relevant genes that are not only differentially expressed between groups but also exhibit coordinated expression patterns within specific co-expression modules.

Subsequently, the R package “clusterProfiler” (15) was utilized to perform systematic Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses on these overlapping genes. The GO analysis aimed to annotate the genes into three main categories: biological processes (BPs), cellular components (CCs), and molecular functions (MFs), providing insights into their roles in various biological activities. Meanwhile, the KEGG pathway analysis was conducted to identify the significantly enriched signaling pathways, facilitating the understanding of how these genes collectively participate in specific physiological or pathological processes. These enrichment analyses collectively lay a foundation for deciphering the core biological functions and regulatory networks underlying the studied phenotype.

Development of diagnostic models via 113 distinct ML approaches

A total of 113 distinct ML approaches were employed to develop diagnostic models in this study. Specifically, a detailed inventory of these ML models, including their key characteristics and theoretical foundations, is presented in Supplementary Material 1 (available at https://cdn.amegroups.cn/static/public/tcr-2025-1-2580-1.xlsx) for comprehensive reference. Furthermore, the algorithmic code implementing these models, which encapsulates all 113 ML methods with their respective parameter configurations and computational workflows, is accessible in Appendix 1 (available at https://cdn.amegroups.cn/static/public/tcr-2025-1-2580-2.pdf). It is noteworthy that among the variables integrated into the models, the minimum count of variables included in any of the constructed models was 5, ensuring a baseline level of complexity to capture meaningful patterns in the diagnostic context. Select the model with the highest mean area under the receiver operating characteristic curve (AUC) in both the training set and validation set, and name the selected model “sModel”, for subsequent studies. Generate receiver operating characteristic (ROC) curves for the training set and validation set using the pROC package (16). Mgenes were extracted and designated as Mgenes, and volcano plots were subsequently constructed for these Mgenes using the ggplot2 package.

Construction of the confusion matrix

Given that sModel was previously identified as the optimal model—owing to its highest mean AUC performance across both the training and validation sets—it was selected to construct the confusion matrix. This step aims to further quantify the model’s classification performance by detailing true positive, true negative, false positive, and false negative counts, thereby providing a comprehensive breakdown of its predictive accuracy.

Correlation analysis

Within the integrated gene expression dataset, we performed correlation analysis to investigate the expression relationships among Mgenes, aiming to unravel potential co-expression patterns and interdependencies. To visually illustrate these correlation patterns, we generated comprehensive correlation plots using the PerformanceAnalytics package (16), which effectively captures the strength and direction of associations between each pair of Mgenes.

GeneMANIA analysis

To dissect the functional associations and interaction landscapes among Mgenes, we conducted a GeneMANIA (17) analysis via the dedicated online platform (https://genemania.org/). This computational approach integrates diverse biological datasets—encompassing gene co-expression patterns, protein-protein interactions (PPIs), pathway enrichments, genetic interactions, and shared functional annotations—to construct a comprehensive network. The analysis aims to unravel potential functional synergies, regulatory relationships, and interconnected BPs among Mgenes, thereby providing insights into their coordinated roles in the biological system under investigation.

Gene set enrichment analysis (GSEA)

To systematically explore the biological functions and pathway associations of Mgenes, we performed GSEA using the clusterProfiler package—a powerful toolset for functional enrichment studies in R. For this analysis, we utilized the “c2.cp.kegg.Hs.symbols.gmt” reference gene set, which is a curated collection from the Molecular Signatures Database (MSigDB). This specific gene set encompasses canonical pathways (cp) derived from the KEGG, annotated with human (Hs) gene symbols, making it well-suited for identifying enriched signaling and metabolic pathways among Mgenes. By leveraging this pathway-focused reference set, the GSEA not only quantifies the enrichment strength of relevant biological pathways but also reveals potential coordinated mechanisms underlying the functional roles of Mgenes, providing valuable insights into their collective involvement in the biological system under investigation.

Gene set variation analysis (GSVA)

To characterize the pathway activity patterns of Mgenes across samples and explore their functional relevance, we performed GSVA using the GSVA package (18) in R. Unlike traditional gene set enrichment methods that focus on predefined gene sets in a single cohort, GSVA transforms gene-level expression data into gene set-level enrichment scores, enabling quantitative assessment of pathway variation across multiple samples. For this analysis, we utilized the “c2.cp.kegg.Hs.symbols.gmt” reference gene set from the MSigDB, which comprises curated cps derived from the KEGG and annotated with Hs gene symbols. By leveraging this pathway-focused resource, our GSVA aimed to quantify the enrichment dynamics of KEGG pathways among Mgenes across samples, thereby uncovering potential variations in BPs associated with Mgenes and providing insights into their context-dependent functional roles in the studied system.

CIBERSORT analysis

To characterize the immune cell composition within the studied samples, we performed CIBERSORT (19) analysis on the integrated gene expression data. CIBERSORT is a widely used computational algorithm that estimates the relative proportions of 22 immune cell types from bulk gene expression profiles by deconvolving the mixed transcriptional signals. For this deconvolution process, the reference set of immune cell expression signatures—comprising canonical transcriptional profiles of distinct immune cell subsets—was adopted as specified in Supplementary Material 2 (available at https://cdn.amegroups.cn/static/public/tcr-2025-1-2580-3.xlsx). This analysis aimed to quantify the abundance of various immune cell populations (e.g., T cell subsets, B cells, macrophages, and neutrophils) across samples, thereby revealing potential associations between the immune microenvironment and the biological context of interest, as well as providing insights into the interplay between Mgenes and immune regulation.

LinkET analysis

To explore potential associations between the Mgenes and the immune cell landscape, we performed linkET (20) analysis, leveraging the results derived from our prior CIBERSORT analysis. LinkET is a computational approach designed to dissect complex relationships between biological features—here, focusing on quantifying and visualizing the interplay between Mgene expression patterns and the relative proportions of immune cell subsets estimated by CIBERSORT. This analysis aimed to unravel specific correlations (e.g., positive or negative associations) between individual Mgenes and distinct immune cell populations, as well as to identify broader co-regulation patterns. By integrating Mgene expression data with immune cell composition profiles, the linkET analysis sought to illuminate how Mgenes might influence or be influenced by the immune microenvironment, providing a more comprehensive understanding of their functional roles within the biological system under investigation.

Genome-wide association study (GWAS) data curation and Mendelian randomization (MR) analysis

To investigate potential causal relationships between Mgenes and ovarian cancer pathogenesis, we first curated relevant genomic datasets from two authoritative sources: the FinnGen database (21) (https://www.finngen.fi/) and the IEU Open GWAS Project (22) (https://opengwas.io/). Specifically, we retrieved ovarian cancer-associated GWAS datasets—capturing genetic variants linked to ovarian cancer susceptibility across large cohorts—and human gene expression quantitative trait locus (eQTL) datasets, which quantify associations between genetic variants and Mgene expression levels. GWAS datasets included in this study are listed in Table 1.

Table 1

Characteristics of GWAS datasets included in the study

ID	Description	Cases (n)	Controls (n)	Population	URL
finngen_R12_C3_OVARY_ENDOMETROID_EXALLC	Endometroid carcinoma of ovary, excluding all cancers (controls excluding all cancers)	303	222,078	European	https://storage.googleapis.com/finngen-public-data-r12/summary_stats/release/finngen_R12_C3_OVARY_ENDOMETROID_EXALLC.gz
finngen_R12_C3_OVARY_EXALLC	Malignant neoplasm of ovary, excluding all cancers (controls excluding all cancers)	2,339	222,078	European	https://storage.googleapis.com/finngen-public-data-r12/summary_stats/release/finngen_R12_C3_OVARY_EXALLC.gz
finngen_R12_C3_OVARY_GRANULOSA_EXALLC	Granulosa cell carcinoma of ovary, excluding all cancers (controls excluding all cancers)	96	222,078	European	https://storage.googleapis.com/finngen-public-data-r12/summary_stats/release/finngen_R12_C3_OVARY_GRANULOSA_EXALLC.gz
finngen_R12_C3_OVARY_MUCINO_EXALLC	Mucinous and mucinous cystic tumor, excluding all cancers (controls excluding all cancers)	230	222,078	European	https://storage.googleapis.com/finngen-public-data-r12/summary_stats/release/finngen_R12_C3_OVARY_MUCINO_EXALLC.gz
finngen_R12_C3_OVARY_SEROUS_EXALLC	Serous carcinoma of ovary, excluding all cancers (controls excluding all cancers)	1,135	222,078	European	https://storage.googleapis.com/finngen-public-data-r12/summary_stats/release/finngen_R12_C3_OVARY_SEROUS_EXALLC.gz
bbj-a-139	Ovarian cancer	720	89,731	East Asian	https://opengwas.io/datasets/bbj-a-139
ebi-a-GCST90018888	Ovarian cancer	1,588	244,932	European	https://opengwas.io/datasets/ebi-a-GCST90018888
ieu-b-4963	Ovarian cancer	1,218	198,523	European	https://opengwas.io/datasets/ieu-b-4963

GWAS, genome-wide association study; URL, Uniform Resource Locator.

Subsequent MR analysis was performed using the TwoSampleMR package in R, with Mgenes defined as exposure factors (i.e., their expression levels) and ovarian cancer incidence as the clinical outcome. To ensure methodological rigor, genetic instrumental variables (IVs) [single nucleotide polymorphisms (SNPs)] were selected from eQTL datasets based on stringent criteria: they were required to show genome-wide significance with Mgene expression (P<5×10⁻⁸) to ensure strong associations, exhibit low linkage disequilibrium (r²<0.001 within a 10,000 kb window) to avoid correlated instruments, and be excluded if located within 500 kb of known ovarian cancer susceptibility loci to minimize pleiotropy. The ovarian cancer GWAS dataset—used to quantify associations with the outcome (Y, ovarian cancer)—included cases and controls, ensuring robust power to detect genetic links to disease susceptibility. Meanwhile, the eQTL dataset—focused on capturing associations between IVs (Z, SNPs) and the exposure (X, Mgene expression)—encompassed individuals, providing reliable estimates of SNP-Mgene expression correlations. Together, these datasets furnished the statistical rigor required to detect modest causal effects in subsequent MR analyses, where Z serves as the genetic proxy for X to infer its causal relationship with Y.

Multiple MR analytical approaches were employed for cross-validation: inverse variance weighted (IVW) as the primary method to estimate overall causal effects using all valid IVs; MR-Egger regression to account for potential horizontal pleiotropy and provide a pleiotropy-adjusted effect estimate; weighted median estimator, robust to up to 50% invalid IVs; and weighted mode estimator, which prioritizes IVs with consistent effect directions. Heterogeneity among IVs was assessed using Cochran’s Q statistic (significance threshold P<0.05), with the I² statistic quantifying the proportion of variance attributable to heterogeneity. For pleiotropy testing, the MR-Egger intercept test (significance threshold P<0.05) was applied to detect directional horizontal pleiotropy, and the MR Pleiotropy RESidual Sum and Outlier (MR-PRESSO) method was used to identify and correct for outlier SNPs introducing pleiotropic bias. A leave-one-out sensitivity analysis was also performed to assess whether individual SNPs unduly influenced the overall effect estimate, ensuring result robustness.

This comprehensive framework aimed to rigorously evaluate whether genetically predicted Mgene expression levels exert causal effects on ovarian cancer risk, leveraging two-sample MR to minimize confounding and reverse causation biases inherent in observational studies.

Statistical analysis

All statistical analyses conducted in this study were performed using R software (version 4.3.2), a comprehensive platform widely utilized for statistical computing and data visualization in biomedical research. For determining statistical significance, a stringent threshold of P<0.05 was adopted, ensuring that observed associations or differences were unlikely to arise by random chance. Additionally, all figures—including scatter plots, heatmaps, and volcano plots—were generated using the ggplot2 package, a robust and flexible tool within R that enables the creation of high-quality, customizable visualizations to effectively illustrate key findings from the analyses. This standardized approach to statistical testing and visualization ensures reproducibility and clarity in presenting the study results.

Results

Identification of DEGs

In this study, three datasets were included as training sets, namely GSE26712, GSE29156, and GSE40595, comprising a total of 225 ovarian cancer samples and 20 normal control samples. Additionally, two datasets were incorporated as testing sets (GSE66957 and GSE119054), consisting of 63 ovarian cancer samples and 15 normal control samples. Using the limma package, we identified DEGs between the tumor group and the control group, resulting in 251 DEGs, which are detailed in Supplementary Material 3 (available at https://cdn.amegroups.cn/static/public/tcr-2025-1-2580-4.xlsx).

WGCNA for identifying key gene modules

To identify co-expression modules associated with ovarian cancer, we performed WGCNA on the training set data, which yielded 8 distinct gene modules, designated by color-based labels as follows: MEblack, MEblue, MEbrown, MEgrey, MEpink, MEred, MEturquoise, and MEyellow (Figure 3A,3B). Among these modules, 4 exhibited significant associations with the ovarian cancer phenotype. Specifically, one module (MEred) showed a significant positive correlation with the ovarian cancer phenotype (correlation coefficient =0.29, P=4e−6), indicating that the collective expression pattern of genes within MEred may be upregulated in association with ovarian cancer development. In contrast, three modules displayed significant negative correlations: MEpink (cor =−0.19, P=0.003), MEbrown (cor =−0.48, P=2e−15), and MEyellow (cor =−0.3, P=1e−06), suggesting their gene expression profiles may be downregulated in ovarian cancer contexts (Figure 3C).

Figure 3 WGCNA analysis. (A) Gene dendrogram with corresponding module colors. (B) Distribution of GS across different modules. (C) Module-trait relationship heatmap. Blue indicates a negative correlation, while red indicates a positive correlation. (D) Scatter plot of MM (MEred) vs. GS. cor, correlation; GS, gene significance; MM, module membership; WGCNA, weighted gene co-expression network analysis.

Given its significant positive correlation with the ovarian cancer phenotype, MEred was selected for further analysis. Figure 3D presents a scatter plot illustrating the correlation between module membership (MM) and gene significance (GS) within MEred. MM quantifies the importance of individual genes within the MEred module, while GS reflects the strength of association between each gene and the ovarian cancer phenotype. The results revealed a significant positive correlation between GS and MM (cor =0.26, P=6.7e−6), indicating that genes highly associated with the ovarian cancer phenotype (high GS) are also among the most central and influential genes within the MEred module (high MM). This finding supports the biological relevance of MEred, as it highlights that the module’s core genes are closely linked to the disease phenotype, reinforcing MEred as a key module for further investigation into ovarian cancer mechanisms.

Identification of intersection genes and functional enrichment analysis

To further refine candidate genes with both differential expression patterns and close association with the ovarian cancer-related MEred module, we computed the intersection between DEGs and the 292 genes within the MEred module (detailed in Supplementary Material 4, available at https://cdn.amegroups.cn/static/public/tcr-2025-1-2580-5.xlsx). This analysis yielded 21 overlapping genes, which were designated as Intersection Genes (InteGenes) (Figure 4A). These InteGenes represent a prioritized set of candidates, as they simultaneously exhibit differential expression between tumor and control groups and play central roles in the MEred module—strengthening their potential relevance to ovarian cancer pathogenesis.

Figure 4 InteGenes functional enrichment analysis. (A) Venn diagram showing overlapping genes between WGCNA MEred module genes and DEGs. (B) GO enrichment analysis based on InteGenes. (C) KEGG enrichment analysis based on InteGenes. BP, biological process; CC, cellular component; DEG, differentially expressed gene; GO, Gene Ontology; InteGenes, Intersection Genes; KEGG, Kyoto Encyclopedia of Genes and Genomes; MF, molecular function; TCA, tricarboxylic acid; WGCNA, weighted gene co-expression network analysis.

Subsequently, we performed GO and KEGG enrichment analyses to explore the biological functions and pathways associated with InteGenes. For GO enrichment, results were categorized into three domains: BP, CC, and MF. In the BP domain, the top 3 enriched terms were “mitotic spindle assembly checkpoint signaling”, “spindle assembly checkpoint signaling”, and “mitotic spindle checkpoint signaling”, highlighting a prominent role in regulating mitotic checkpoints—critical for maintaining genomic stability during cell division. Within the CC domain, the top 3 terms included “cullin-RING ubiquitin ligase complex”, “lateral plasma membrane”, and “ubiquitin ligase complex”, pointing to involvement in ubiquitin-mediated protein degradation and membrane-associated processes. For the MF domain, the top 3 enriched terms were “NAD binding”, “flavin adenine dinucleotide binding”, and “ubiquitin binding”, indicating associations with coenzyme binding and ubiquitin-dependent molecular interactions (Figure 4B).

In parallel, KEGG pathway enrichment analysis identified 10 primary pathways associated with InteGenes, including “drug metabolism-cytochrome P450”, “Tyrosine metabolism”, “cell cycle”, “tryptophan metabolism”, “virion-hepatitis viruses”, “glutathione metabolism”, “leukocyte transendothelial migration”, “phenylalanine metabolism”, “wnt signaling pathway”, and “histidine metabolism”. These pathways collectively implicate InteGenes in processes ranging from metabolic regulation and cell cycle control to immune cell trafficking and signaling—all of which are known to be dysregulated in ovarian cancer (Figure 4C).

Together, these enrichment results provide insights into the potential biological roles of InteGenes, linking them to key cellular and molecular mechanisms underlying ovarian cancer development.

Construction and validation of ML models

We constructed ovarian cancer diagnostic models using 113 distinct ML methods, with the gene parameters corresponding to each model detailed in Supplementary Material 5 (available at https://cdn.amegroups.cn/static/public/tcr-2025-1-2580-6.xlsx); these models were subsequently validated in two independent validation sets (GSE119054 and GSE66957), and the AUC values with 95% confidence intervals (CIs) [calculated via the DeLong method with stratified bootstrap resampling (1,000 iterations)] were computed for each, as presented in Figure 5A. When ranked in descending order of average AUC values, the top 10 models were identified as RF, Lasso regression combined with Gradient Boosting Machine (Lasso + GBM), GBM, glmBoost integrated with plsRglm (glmBoost + plsRglm), Lasso regression (Lasso), Lasso regression combined with Naive Bayes (Lasso + NaiveBayes), Elastic Net with α=0.8 [Enet (alpha =0.8)], RF combined with Naive Bayes (RF + NaiveBayes), glmBoost integrated with Stepglm (forward selection) [glmBoost + StepgIm (forward)], and Lasso regression combined with Extreme Gradient Boosting (Lasso + XGBoost), with their average AUC values—computed as the mean of their AUC values in the training set, GSE119054, and GSE66957—being 0.943, 0.942, 0.942, 0.939, 0.939, 0.938, 0.938, 0.937, 0.937, and 0.937, respectively.

Figure 5 Construction and validation of diagnostic models via multiple ML methods. (A) Heatmap of AUC values for 113 ML methods in the training set and test set. (B) ROC curve for the training set based on the Lasso + NaiveBayes model. (C) ROC curve for GSE66957 based on the Lasso + NaiveBayes model. (D) ROC curve for GSE119054 based on the Lasso + NaiveBayes model. (E) ROC curve for the early-stage ovarian cancer dataset (FIGO stages I and II, TCGA-OV and GTEx-Ovary) based on the Lasso + NaiveBayes model. (F) Confusion matrix plot for the training set. (G) Confusion matrix plot for GSE66957. (H) Confusion matrix plot for GSE119054. AUC, area under the curve; CI, confidence interval; FIGO, International Federation of Gynecology and Obstetrics; GTEx, Genotype-Tissue Expression; Lasso, least absolute shrinkage and selection operator; ML, machine learning; ROC, receiver operating characteristic curve; TCGA-OV, The Cancer Genome Atlas-Ovarian Cancer.

To further evaluate model performance, we selected the top 10 models with the highest average AUC values for confusion matrix analysis, and based on this comprehensive assessment, the Lasso + NaiveBayes model was ultimately chosen for subsequent analyses. This model exhibited AUC values (95% CI) of 0.991 (0.980-0.998) in the training set, 0.889 (0.667-1.000) in GSE119054, and 0.936 (0.831-0.999) in GSE66957 (Figure 5B-5D). To further highlight the model's diagnostic significance for early-stage ovarian cancer, we additionally evaluated its performance using the dedicated dataset consisting of FIGO stages I and II samples (TCGA-OV and GTEx-Ovary), which achieved an AUC of 0.658 (95% CI: 0.620–0.695) (Figure 5E).

Performance evaluation of the Lasso + NaiveBayes model using confusion matrices yielded detailed results: for the training set, the model achieved a recall of 95.1% [214/(214+11)], a precision of approximately 99.1% [214/(214+2)], and a specificity of approximately 90% [18/(18+2)], with a key strength being its high recall, precision, and specificity for ovarian cancer cases (positive instances) (Figure 5F); in the GSE66957 dataset, it demonstrated a recall of 100% [57/(57+0)], a precision of approximately 91.9% [57/(57+5)], and a specificity of approximately 58.3% [7/(7+5)] (Figure 5G), with its primary advantage lying in exceptional recall for ovarian cancer cases (successfully capturing all actual positive instances) coupled with relatively high precision (ensuring most predicted positives were true positives), though a limitation was insufficient specificity, with approximately 41.7% of actual negatives misclassified as positive; regarding the GSE119054 dataset, the model showed a recall of 100% [6/(6+0)], a precision of approximately 75% [6/(6+2)], and a specificity of approximately 33.3% [1/(1+2)] (Figure 5H), excelling in recall (fully capturing all actual positive cases) but suffering from inadequate precision (25% of predicted positives were false) and low specificity (approximately 66.7% of actual negatives were misclassified as positive). The comprehensive diagnostic performance metrics (including AUC with 95% CIs) of the Lasso + NaiveBayes model are presented in Table S1, and the calibration curve evaluating the consistency between predicted probabilities and actual cancer rates is shown in Figure S1.

Expression patterns of Mgenes in ovarian cancer

A total of 12 genes, namely GSTP1, PEG3, MAOB, CP, SOX17, CXCR4, PRSS8, EPCAM, TRIP13, CLDN15, STAR, and MPZL2, were incorporated into the Lasso + NaiveBayes model for subsequent analytical processes. Among these 12 candidate genes, CLDN15, PEG3, MAOB, and STAR exhibited a statistically significant downregulated expression pattern in the ovarian cancer tissue group when compared with the non-tumor control group. In contrast, the remaining eight genes, including GSTP1, CP, SOX17, CXCR4, PRSS8, EPCAM, TRIP13, and MPZL2, showed a marked upregulation in their expression levels in the tumor group (as illustrated in Figure 6A,6B).

Figure 6 Expression patterns and expression correlation analysis of Mgenes. (A) Volcano plot of differential expression for Mgenes. (B) Expression differences of Mgenes between the tumor group and normal group. (C) Expression correlation plot among Mgenes. (D) ROC curves for single Mgenes. *, P<0.05; **, P<0.01; ***, P<0.001. AUC, area under the curve; FC, fold change; ROC, receiver operating characteristic curve.

To further explore the potential regulatory relationships and co-expression characteristics among the Mgenes, a correlation analysis of their expression levels was conducted. The results of this analysis revealed distinct and meaningful correlations between the expression profiles of the Mgenes, suggesting the existence of potential synergistic or antagonistic regulatory networks among these genes in the context of ovarian cancer progression (Figure 6C).

Subsequently, to evaluate the diagnostic value of each individual Mgene in distinguishing ovarian cancer tissues from non-tumor tissues, ROC curves were generated for each gene, and the corresponding AUC values were calculated (Figure 6D). The AUC values, which reflect the diagnostic accuracy of each gene, were as follows: GSTP1 (AUC =0.944), PEG3 (AUC =0.914), MAOB (AUC =0.905), CP (AUC =0.966), SOX17 (AUC =0.927), CXCR4 (AUC =0.818), PRSS8 (AUC =0.925), EPCAM (AUC =0.858), TRIP13 (AUC =0.923), CLDN15 (AUC =0.941), STAR (AUC =0.871), and MPZL2 (AUC =0.910). Notably, the gene CP demonstrated the highest diagnostic efficacy with an AUC value of 0.966, while CXCR4 showed the relatively lowest but still clinically meaningful diagnostic accuracy (AUC =0.818) among the 12 Mgenes. These findings collectively indicate that the majority of the Mgenes hold considerable potential as promising diagnostic biomarkers for ovarian cancer.

Functional enrichment analysis based on Mgenes

To investigate the potential molecular interaction mechanisms and functional relevance of Mgenes in ovarian cancer, we first performed a PPI network analysis using the GeneMANIA bioinformatics tool (Figure 7A). This analysis identified 20 genes that exhibit significant co-expression correlations with Mgenes, including ST14, F11R, IL13RA2, SPINT1, SPINT2, RBM46, CLDN7, MST1R, FGB, MMD, MPPED2, RPL39L, TMEM131, KLK11, ARHGEF5, FGG, ST6GAL1, DUSP2, PUM3, and METAP2. Notably, these co-expressed genes are known to play crucial roles in BPs such as cell adhesion, signal transduction, and protease activity—pathways closely associated with tumor cell proliferation, invasion, and metastasis in ovarian cancer.

Figure 7 Protein-protein interaction network and GSEA analysis of Mgenes. (A) Protein-protein interaction network diagram of Mgenes. (B) GSEA plot for CLDN15. (C) GSEA plot for CP. (D) GSEA plot for CXCR4. (E) GSEA plot for EPCAM. (F) GSEA plot for GSTP1. (G) GSEA plot for MAOB. (H) GSEA plot for MPZL2. (I) GSEA plot for PEG3. (J) GSEA plot for PRSS8. (K) GSEA plot for SOX17. (L) GSEA plot for STAR. (M) GSEA plot for TRIP13. GSEA, gene set enrichment analysis; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Subsequently, to further delineate the functional pathways enriched by individual Mgenes, we conducted GSEA for each of the 12 Mgenes separately. The results of this analysis are presented in Figure 7B-7M (with each subfigure corresponding to one Mgene), and the enriched KEGG pathways were primarily concentrated in four key functional categories. Firstly, KEGG_CELL_CYCLE highlights the potential involvement of Mgenes in regulating cell cycle progression—a core process driving uncontrolled tumor growth; secondly, KEGG_DNA_REPLICATION suggests that Mgenes may modulate DNA replication fidelity, which is critical for maintaining genomic stability in cancer cells; additionally, KEGG_FOCAL_ADHESION indicates associations between Mgenes and focal adhesion signaling, a pathway that mediates tumor cell migration and extracellular matrix interaction; furthermore, KEGG_DRUG_METABOLISM implies that Mgenes could influence drug metabolism processes, potentially affecting the response of ovarian cancer cells to chemotherapeutic agents.

To validate the robustness of these enrichment results, we additionally performed GSVA on the Mgene set. As shown in Figure 8, the GSVA results were highly consistent with those obtained from GSEA—reinforcing the reliability of our findings and confirming that Mgenes are indeed functionally enriched in the aforementioned cancer-relevant pathways.

Figure 8 GSVA analysis based on Mgenes. (A) GSVA plot for CLDN15. (B) GSVA plot for CP. (C) GSVA plot for CXCR4. (D) GSVA plot for EPCAM. (E) GSVA plot for GSTP1. (F) GSVA plot for MAOB. (G) GSVA plot for MPZL2. (H) GSVA plot for PEG3. (I) GSVA plot for PRSS8. (J) GSVA plot for SOX17. (K) GSVA plot for STAR. (L) GSVA plot for TRIP13. GSVA, gene set variation analysis; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Correlation analysis of Mgenes with various immune cells in the tumor immune microenvironment

To explore the potential crosstalk between Mgenes and the immune regulatory network in the ovarian cancer microenvironment—an essential aspect for understanding tumor-immune interactions and developing immunotherapeutic strategies—we first investigated the proportion of various immune cell subsets in the tumor group and the normal control group (Figure 9A,9B). The results revealed distinct differences in immune cell distribution between the two groups: specifically, plasma cells, follicular helper T cells, M0 macrophages, activated dendritic cells, and activated mast cells exhibited a significantly increased proportion in the tumor group compared to the normal group, which may reflect the tumor-induced activation or recruitment of these immune cell populations. In contrast, resting CD4 memory T cells, monocytes, M2 macrophages, resting dendritic cells, and neutrophils showed a notably decreased proportion in the tumor group, potentially indicating impaired immune surveillance or a shift toward an immunosuppressive microenvironment. Additionally, the analysis uncovered heterogeneity in the proportion pattern of immune cells within the tumor group itself, suggesting that the immune microenvironment may vary across different regions or pathological stages of ovarian cancer lesions, which could influence local tumor progression and treatment response.

Figure 9 Proportion and correlation analysis of immune cells in the tumor immune microenvironment. (A) Proportions of various immune cells in the tumor immune microenvironment between the tumor group and normal group. (B) Box plots comparing differences in immune cell proportions between the tumor group and normal group. (C) Correlation analysis among various immune cells. *, P<0.05; **, P<0.01; ***, P<0.001.

Furthermore, to characterize the interplay between immune cell populations in the tumor microenvironment, we analyzed the correlations between different immune cell subsets. The results demonstrated considerable differences in the correlation strengths and directions between various immune cell types (Figure 9C)—for instance, some cell subsets showed positive co-occurrence patterns (implying potential synergistic effects) while others exhibited negative associations (suggesting mutual regulatory inhibition). This observation further highlights the complexity of the immune cell interaction network in the ovarian cancer microenvironment, laying a critical foundation for subsequent analyses of the correlation between Mgenes and these immune cell subsets.

Following the characterization of immune cell distribution and intercellular correlations, we further analyzed the correlations between individual Mgenes and each immune cell subset to decipher the potential regulatory roles of Mgenes in shaping the tumor immune microenvironment. As presented in Figures 10,11, the correlation profiles varied substantially across different Mgenes. CLDN15 exhibited significant positive correlations with resting NK cells, activated mast cells, naive B cells, regulatory T cells (Tregs), resting CD4 memory T cells, monocytes, activated CD4 memory T cells, and resting dendritic cells—suggesting it may promote the accumulation or functional maintenance of these immune subsets. Conversely, it showed significant negative correlations with resting mast cells, M1 macrophages, plasma cells, M0 macrophages, follicular helper T cells, and activated NK cells, indicating potential inhibitory effects on these cell populations.

Figure 10 Correlation analysis between Mgenes and various immune cells. (A) Lollipop plot for CLDN15. (B) Lollipop plot for CP. (C) Lollipop plot for CXCR4. (D) Lollipop plot for EPCAM. (E) Lollipop plot for GSTP1. (F) Lollipop plot for MAOB. (G) Lollipop plot for MPZL2. (H) Lollipop plot for PEG3. (I) Lollipop plot for PRSS8. (J) Lollipop plot for SOX17. (K) Lollipop plot for STAR. (L) Lollipop plot for TRIP13. Red P values indicate statistical significance. abs (cor), absolute value of the correlation coefficient.

Figure 11 Panoramic correlation heatmap of Mgenes and various immune cells in the tumor immune microenvironment. abs (cor), absolute value of the correlation coefficient.

CP displayed a specific significant positive correlation with M1 macrophages, implying a potential role in regulating the polarization or function of pro-inflammatory macrophages. Meanwhile, CXCR4 showed significant positive correlations with M0 macrophages, follicular helper T cells, activated NK cells, plasma cells, and M1 macrophages, while exhibiting significant negative correlations with resting CD4 memory T cells, activated dendritic cells, activated CD4 memory T cells, Tregs, and resting NK cells—suggesting it may modulate the balance between pro-tumor and anti-tumor immune cells.

EPCAM presented a significant negative correlation exclusively with CD8⁺ T cells, hinting at a potential role in suppressing the cytotoxic activity of this key anti-tumor immune subset. In contrast, GSTP1 demonstrated a significant positive correlation with M0 macrophages, indicating it may be involved in the regulation of macrophage activation status.

MAOB exhibited a more complex correlation pattern: it showed significant positive correlations with resting dendritic cells, monocytes, Tregs, naive B cells, resting CD4 memory T cells, and M2 macrophages, while displaying significant negative correlations with memory B cells, activated NK cells, M1 macrophages, and M0 macrophages. This profile reflects its potential role in shaping an immunosuppressive microenvironment via promoting Tregs and M2 macrophages.

MPZL2 showed significant positive correlations with activated dendritic cells, follicular helper T cells, gamma-delta T cells, and resting mast cells, alongside a significant negative correlation with monocytes—suggesting it may influence immune cell activation and recruitment. PEG3, on the other hand, displayed significant positive correlations with resting dendritic cells, naive B cells, and activated dendritic cells, while exhibiting significant negative correlations with plasma cells, memory B cells, and M0 macrophages—implying it may regulate B cell differentiation and macrophage function.

PRSS8 demonstrated significant positive correlations with M0 macrophages and memory B cells, and significant negative correlations with resting CD4 memory T cells, neutrophils, resting dendritic cells, and naive B cells—indicating potential involvement in modulating both myeloid and lymphoid cell populations. SOX17 showed a similar dual-correlation trend: it had significant positive correlations with M0 macrophages and memory B cells, but significant negative correlations with neutrophils, CD8⁺ T cells, and activated CD4 memory T cells—suggesting it may suppress anti-tumor immune responses by inhibiting cytotoxic T cells and neutrophils.

Finally, STAR exhibited significant positive correlations with CD8⁺ T cells and resting dendritic cells, alongside a significant negative correlation with M0 macrophages—hinting at a potential role in enhancing anti-tumor immunity via promoting CD8⁺ T cell accumulation. TRIP13 displayed a specific significant positive correlation with activated dendritic cells, implying it may regulate dendritic cell activation and subsequent T cell priming.

Investigation of Mgenes-ovarian cancer genetic causality via MR

To investigate whether there is a genetic causal association between Mgenes and the pathogenesis of ovarian cancer, we conducted an MR analysis leveraging GWAS data and eQTL data.

For this study, we integrated ovarian cancer-related datasets from two authoritative databases to ensure the robustness and generalizability of our findings. Specifically, the included datasets were from the FinnGen database, encompassing five distinct ovarian cancer subtypes: finngen_R12_C3_OVARY_ENDOMETROID_EXALLC, finngen_R12_C3_OVARY_EXALLC, finngen_R12_C3_OVARY_GRANULOSA_EXALLC, finngen_R12_C3_OVARY_MUCINO_EXALLC, and finngen_R12_C3_OVARY_SEROUS_EXALLC; in addition, three independent datasets (bbj-a-139, ebi-a-GCST90018888, and ieu-b-4963) were retrieved from the IEU Open GWAS database. Comprehensive details of these datasets, including sample size, population characteristics, and genotyping platforms, are summarized in Table 1.

Regrettably, our MR analysis results failed to provide evidence supporting a genetic causal relationship between Mgenes and the risk of ovarian cancer across all the included datasets and subtypes. This finding suggests that genetic variations influencing Mgenes expression may not play a direct causal role in driving ovarian cancer development, at least within the studied populations and under the analytical framework employed.

Discussion

Ovarian cancer’s status as a lethal gynecological malignancy stems primarily from delayed diagnosis, underscoring the urgent need for robust diagnostic tools. This study addressed critical gaps in existing research by systematically evaluating 113 ML approaches, integrating genomic data, and validating findings across independent cohorts—yielding novel insights into precision diagnostic model development and the biological role of candidate genes (Mgenes) in ovarian cancer.

A key strength of this study lies in its comprehensive comparison of ML algorithms, a departure from prior work that often relied on 5–10 single-modal models (23-27). The Lasso + NaiveBayes model emerged as the most robust, achieving AUC values of 0.991 (training set), 0.889 (GSE119054), and 0.936 (GSE66957)—outperforming most reported ML-based ovarian cancer diagnostic models, which typically exhibit external validation AUCs of 0.87–0.94 (24,27). The model’s exceptional recall (100% in GSE66957 and GSE119054) is clinically critical, as minimizing false negatives reduces the risk of missed early-stage diagnoses—where 5-year survival exceeds 90%. Notably, the model’s reduced specificity in GSE119054 (33.3%) and GSE66957 (58.3%) warrants consideration. This limitation likely stems from the small number of normal control samples (e.g., 3 in GSE119054), which may have biased the model toward classifying samples as tumorous to maximize recall. Additionally, batch effects—even after sva correction—could persist between GEO datasets, as technical variations (e.g., microarray platforms, sample processing) are rarely fully eliminated. Future studies should incorporate larger, prospectively collected cohorts (including diverse ethnicities and clinical stages) to improve specificity and generalizability.

Beyond model performance, the 12 Mgenes identified in the Lasso + NaiveBayes model exhibit strong diagnostic potential, with 11 achieving AUC >0.85 and CP reaching an exceptional AUC =0.966. This aligns with prior evidence linking individual Mgenes to ovarian cancer pathogenesis: for example, GSTP1 hypermethylation is a well-documented epigenetic marker in ovarian cancer (28), while CXCR4 overexpression promotes tumor cell migration and peritoneal metastasis (29). The downregulation of CLDN15, PEG3, MAOB, and STAR in tumors further supports their role as tumor suppressors—consistent with PEG3’s known function in regulating p53-mediated apoptosis (30) and CLDN15’s role in maintaining epithelial barrier integrity (31). Correlation analysis between Mgenes revealed coordinated expression patterns, suggesting synergistic regulatory networks. For instance, the positive correlation between EPCAM (an epithelial marker) and TRIP13 (a cell cycle regulator) may reflect their joint role in promoting epithelial-to-mesenchymal transition (EMT)—a process critical for ovarian cancer invasion (32). Such co-expression highlights the advantage of using gene panels over single biomarkers, as they capture broader biological complexity and reduce false-positive rates.

Functional analyses further confirmed that Mgenes are enriched in pathways central to ovarian cancer progression. The overrepresentation of KEGG_CELL_CYCLE and KEGG_DNA_REPLICATION aligns with the uncontrolled proliferation hallmark of cancer: for example, TRIP13’s role in spindle assembly checkpoint regulation (as implied by GO enrichment) directly contributes to genomic instability—a driver of ovarian cancer development (33,34). Meanwhile, enrichment in KEGG_FOCAL_ADHESION supports Mgenes’ involvement in tumor cell-extracellular matrix interactions, which facilitate metastasis (e.g., CXCR4-mediated adhesion to stromal cells) (35). The PPI network analysis also validated Mgenes’ biological relevance by identifying co-expressed genes with established cancer roles. For example, ST14 (a serine protease) promotes EMT via upregulation of N-cadherin (36), while CLDN7 (a tight junction protein) is frequently dysregulated in high-grade serous ovarian cancer (HGSC)—the most aggressive subtype (37). These interactions suggest Mgenes are part of larger regulatory networks that drive ovarian cancer pathogenesis, making them potential therapeutic targets (e.g., inhibiting CXCR4 or ST14 to suppress metastasis).

The correlation between Mgenes and immune cell subsets provides additional critical insights into ovarian cancer’s immunosuppressive microenvironment—an obstacle to immunotherapy efficacy. For example, MAOB’s positive correlation with Tregs and M2 macrophages (immunosuppressive populations) and negative correlation with M1 macrophages (pro-inflammatory) suggests that it may promote immune evasion—a finding supported by MAOB’s role in metabolizing catecholamines to suppress T cell activation (38). Conversely, STAR’s positive correlation with CD8⁺ T cells (cytotoxic anti-tumor cells) implies that it could enhance anti-tumor immunity, highlighting its potential as an immunotherapeutic adjuvant. These observations have clinical implications: Mgenes could be used to stratify patients for immunotherapy. For instance, high MAOB expression may predict resistance to programmed death-1 (PD-1) inhibitors (due to increased Tregs), while high STAR expression may identify patients likely to respond. Future studies should explore whether targeting Mgenes (e.g., MAOB inhibition) can reverse immune suppression and improve immunotherapy outcomes.

There are several limitations in this study that merit consideration: first, reliance on GEO datasets introduces selection bias, as these samples are often retrospective and lack detailed clinical metadata (e.g., treatment history, comorbidities); second, the small number of normal control samples (n=20 in training, n=15 in testing) limits the model’s ability to distinguish benign from malignant lesions—a critical clinical need; third, while MR analysis explored causal links between Mgenes and ovarian cancer, the use of eQTL data from non-ovarian tissues may reduce the accuracy of causal estimates, and the statistical power of MR analyses was not fully evaluated, leaving the interpretation of null results potentially confounded by weak instruments or insufficient power; fourth, incomplete subgroup information (e.g., tumor subtype, stage, age, treatment status) for partial samples from the 5 GEO datasets precludes comprehensive stratified analysis of patient heterogeneity, potentially affecting the model’s clinical applicability; fifth, the assumptions underlying CIBERSORT for immune deconvolution may not be fully satisfied in ovarian tumor microenvironments, and validation using orthogonal approaches is needed; sixth, the generalizability of the model to real-world clinical samples (e.g., blood, ascites, biopsy-limited specimens) was not assessed, which limits its immediate translational utility; finally, no in vitro or in vivo experiments were conducted to validate the functional roles of candidate Mgenes, leaving their mechanistic contributions speculative.

To address these gaps, future research should: validate the Lasso + NaiveBayes model using prospectively collected, fully annotated multi-center clinical samples (e.g., serum, tissue microarrays, blood, ascites, biopsy specimens) to evaluate performance across patient subgroups and pre-surgical utility; integrate multi-modal data (e.g., proteomics, imaging, clinical variables) to improve specificity; perform experimental validation of Mgenes’ biological functions (e.g., CRISPR knockout) to elucidate roles in cell cycle regulation and immune modulation; validate CIBERSORT-derived immune deconvolution findings (e.g., via multiplex immunofluorescence); optimize MR study designs with larger sample sizes and more robust IVs to enhance statistical power and reduce the impact of weak instruments; and explore Mgene-based combination therapies [e.g., CXCR4 inhibitors + programmed death ligand-1 (PD-L1) antibodies] to enhance utility in precision medicine.

Conclusions

This study systematically evaluated 113 ML approaches to develop a robust ovarian cancer diagnostic model, identifying Lasso + NaiveBayes as the optimal algorithm with strong performance across training and external validation cohorts. The 12 Mgenes (GSTP1, PEG3, MAOB, CP, SOX17, CXCR4, PRSS8, EPCAM, TRIP13, CLDN15, STAR, MPZL2) identified herein are not only high-performance diagnostic biomarkers (notably CP, with an AUC of 0.966) but also functional mediators of ovarian cancer pathogenesis, enriched in cell cycle regulation, DNA replication, and immune microenvironment modulation.

Oriented toward early detection and clinical utility, this model addresses two key clinical needs: leveraging its 100% recall in validation sets to compensate for the suboptimal sensitivity/specificity of traditional tools like CA125 (alleviating the issue of ~70% of patients being diagnosed at advanced stages), and using Mgenes to link diagnostic performance with functional pathways and immune correlations for potential patient stratification. Given its variable specificity (33.3% in GSE119054, 58.3% in GSE66957), the model is positioned as a complementary diagnostic tool (paired with confirmatory tests such as imaging or CA125 assays) rather than a standalone precision screening tool; the original reference to “precision” reflected its high recall and strong discriminative ability (AUC 0.889–0.936 with 95% CIs), which minimize false negatives—a critical priority in ovarian cancer diagnosis.

By establishing a framework for integrating genomic data and ML to improve early detection, this work addresses a critical unmet need in ovarian cancer care. While limitations (e.g., small normal sample sizes, retrospective data, variable specificity) must be addressed, the findings lay the groundwork for translating Mgene-based diagnostics and therapeutics into clinical practice. Future multi-modal data integration will enhance specificity, with the ultimate goal of reducing misdiagnosis rates, improving survival, and advancing ovarian cancer care.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2580/rc

Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2580/prf

Funding: This work was supported by the Talent Development Plan of Shanghai Fifth People’s Hospital, Fudan University (grant No. 2024WYRCJY03 to R.C.) and the High-level Professional Physician Training Program administered by Minhang District (grant No. 2024MZYS15 to R.C.).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2580/coif). R.C. received funding support from the Talent Development Plan of Shanghai Fifth People’s Hospital, Fudan University (grant No. 2024WYRCJY03) and the High-level Professional Physician Training Program administered by Minhang District (grant No. 2024MZYS15). The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Sung H, Ferlay J, Siegel RL, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021;71:209-49. [Crossref] [PubMed]
Alizadeh H, Akbarabadi P, Dadfar A, et al. A comprehensive overview of ovarian cancer stem cells: correlation with high recurrence rate, underlying mechanisms, and therapeutic opportunities. Mol Cancer 2025;24:135. [Crossref] [PubMed]
Wu HH, Chou HT, Lin SY, et al. FIGO 2023 staging system predicts not only survival outcome but also recurrence pattern in corpus-confined endometrial cancer patients. Taiwan J Obstet Gynecol 2025;64:76-81. [Crossref] [PubMed]
Charkhchi P, Cybulski C, Gronwald J, et al. CA125 and Ovarian Cancer: A Comprehensive Review. Cancers (Basel) 2020;12:3730. [Crossref] [PubMed]
Sahu SA, Shrivastava D. A Comprehensive Review of Screening Methods for Ovarian Masses: Towards Earlier Detection. Cureus 2023;15:e48534. [Crossref] [PubMed]
Arravalli T, Chadaga K, Muralikrishna H, et al. Detection of breast cancer using machine learning and explainable artificial intelligence. Sci Rep 2025;15:26931. [Crossref] [PubMed]
UrRehman Z. Effective lung nodule detection using deep CNN with dual attention mechanisms. Sci Rep 2024;14:3934. [Crossref] [PubMed]
Liu J, Liu L, Antwi PA, et al. Identification and Validation of the Diagnostic Characteristic Genes of Ovarian Cancer by Bioinformatics and Machine Learning. Front Genet 2022;13:858466. [Crossref] [PubMed]
Loizzi V, Comes MC, Arezzo F, et al. Validation of machine learning-based models to predict and explain the risk of ovarian cancer: a multicentric study on BRCA-mutated patients undergoing risk-reducing salpingo-oophorectomy. Front Oncol 2025;15:1574037. [Crossref] [PubMed]
GuhaS.Feature Selection Using Lasso Regression Enhances Deep Learning Model Performance For Diagnosis Of Lung Cancer from Transcriptomic Data.bioRxiv 2024. Available online: https://www.biorxiv.org/content/10.1101/2024.05.01.592076v1.full.pdf
Leek JT, Johnson WE, Parker HS, et al. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 2012;28:882-3. [Crossref] [PubMed]
Kolde R. Pheatmap: Pretty Heatmaps. 2025. Available online: https://github.com/raivokolde/pheatmap
Wickham H. ggplot2: Elegant Graphics for Data Analysis. New York: Springer Cham; 2016.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008;9:559. [Crossref] [PubMed]
Wu T, Hu E, Xu S, et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation (Camb) 2021;2:100141. [Crossref] [PubMed]
Robin X, Turck N, Hainard A, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011;12:77. [Crossref] [PubMed]
Warde-Farley D, Donaldson SL, Comes O, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res 2010;38:W214-20. [Crossref] [PubMed]
Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 2013;14:7. [Crossref] [PubMed]
Newman AM, Liu CL, Green MR, et al. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods 2015;12:453-7. [Crossref] [PubMed]
Huang H. linkET: Everything is Linkable. 2021. Available online: https://github.com/Hy4m/linkET
Kurki MI, Karjalainen J, Palta P, et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 2023;613:508-18. [Crossref] [PubMed]
ElsworthBLyonMAlexanderTLiuYMatthewsPHallettJThe MRC IEU OpenGWAS data infrastructure.bioRxiv 2020. Available online: https://www.biorxiv.org/content/10.1101/2020.08.10.244293v1
Li J, Zhang T, Ma J, et al. Machine-learning-based contrast-enhanced computed tomography radiomic analysis for categorization of ovarian tumors. Front Oncol 2022;12:934735. [Crossref] [PubMed]
Feng Y. An integrated machine learning-based model for joint diagnosis of ovarian cancer with multiple test indicators. J Ovarian Res 2024;17:45. [Crossref] [PubMed]
Wu M, Gu S, Yang J, et al. Comprehensive machine learning-based preoperative blood features predict the prognosis for ovarian cancer. BMC Cancer 2024;24:267. [Crossref] [PubMed]
Liu Z, Han L, Ji X, et al. Multi-omics analysis and experiments uncover the function of cancer stemness in ovarian cancer and establish a machine learning-based model for predicting immunotherapy responses. Front Immunol 2024;15:1486652. [Crossref] [PubMed]
Chao X, Wang S, Lang J, et al. The application of risk models based on machine learning to predict endometriosis-associated ovarian cancer in patients with endometriosis. Acta Obstet Gynecol Scand 2022;101:1440-9. [Crossref] [PubMed]
Bol GM, Suijkerbuijk KP, Bart J, et al. Methylation profiles of hereditary and sporadic ovarian cancer. Histopathology 2010;57:363-70. [Crossref] [PubMed]
Wu X, Zhang H, Sui Z, et al. CXCR4 promotes the growth and metastasis of esophageal squamous cell carcinoma as a critical downstream mediator of HIF-1α. Cancer Sci 2022;113:926-39. [Crossref] [PubMed]
Deng Y, Wu X. Peg3/Pw1 promotes p53-mediated apoptosis by inducing Bax translocation from cytosol to mitochondria. Proc Natl Acad Sci U S A 2000;97:12050-5. [Crossref] [PubMed]
Overgaard CE, Daugherty BL, Mitchell LA, et al. Claudins: control of barrier function and regulation in response to oxidant stress. Antioxid Redox Signal 2011;15:1179-93. [Crossref] [PubMed]
Oliver S, Williams M, Jolly MK, et al. Exploring the role of EMT in ovarian cancer progression using a multiscale mathematical model. NPJ Syst Biol Appl 2025;11:36. [Crossref] [PubMed]
Chen C, Li P, Fan G, et al. Role of TRIP13 in human cancer development. Mol Biol Rep 2024;51:1088. [Crossref] [PubMed]
Lu S, Qian J, Guo M, et al. Insights into a Crucial Role of TRIP13 in Human Cancer. Comput Struct Biotechnol J 2019;17:854-61. [Crossref] [PubMed]
Zhao M, Discipio RG, Wimmer AG, et al. Regulation of CXCR4-mediated nuclear translocation of extracellular signal-related kinases 1 and 2. Mol Pharmacol 2006;69:66-75. [Crossref] [PubMed]
Nie X, Gao L, Zheng M, et al. ST14 interacts with TMEFF1 and is a predictor of poor prognosis in ovarian cancer. BMC Cancer 2024;24:330. [Crossref] [PubMed]
Romani C, Zizioli V, Silvestri M, et al. Low Expression of Claudin-7 as Potential Predictor of Distant Metastases in High-Grade Serous Ovarian Carcinoma Patients. Front Oncol 2020;10:1287. [Crossref] [PubMed]
Beucher L, Gabillard-Lefort C, Baris OR, et al. Monoamine oxidases: A missing link between mitochondria and inflammation in chronic diseases ? Redox Biol 2024;77:103393. [Crossref] [PubMed]

Cite this article as: Shen S, Zhang L, Sun Q, Huang Y, Yang Y, Hu N, Jiang S, Zhang L, Wang X, Chen R. A panel of machine learning approaches for diagnostic model development and validation in ovarian cancer. Transl Cancer Res 2026;15(2):123. doi: 10.21037/tcr-2025-1-2580

A panel of machine learning approaches for diagnostic model development and validation in ovarian cancer

Highlight box

Introduction

Methods

Datasets and data preprocessing

Analysis of differentially expressed genes (DEGs)

Weighted gene co-expression network analysis (WGCNA)

Functional enrichment analysis

Development of diagnostic models via 113 distinct ML approaches

Construction of the confusion matrix

Correlation analysis

GeneMANIA analysis

Gene set enrichment analysis (GSEA)

Gene set variation analysis (GSVA)

CIBERSORT analysis

LinkET analysis

Genome-wide association study (GWAS) data curation and Mendelian randomization (MR) analysis

Table 1

Statistical analysis

Results

Identification of DEGs

WGCNA for identifying key gene modules

Identification of intersection genes and functional enrichment analysis

Construction and validation of ML models

Expression patterns of Mgenes in ovarian cancer

Functional enrichment analysis based on Mgenes

Correlation analysis of Mgenes with various immune cells in the tumor immune microenvironment

Investigation of Mgenes-ovarian cancer genetic causality via MR

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share