Development and validation of a multi-task deep learning model integrating PET-CT radiomics, clinical variables, and EBV DNA for prognostic prediction in locally advanced nasopharyngeal carcinoma
Highlight box
Key findings
• The multi-task deep learning (MTDL) model integrating positron emission tomography-computed tomography (PET-CT) radiomics, clinical variables, and Epstein-Barr virus (EBV) DNA outperformed tumor-node-metastasis (TNM) staging and single-task models in predicting prognosis for locally advanced nasopharyngeal carcinoma (LA-NPC), achieving concordance indices (C-indices) of 0.834 for overall survival (OS) and 0.821 for progression-free survival (PFS).
What is known and what is new?
• TNM staging is the standard prognostic tool for NPC but has notable limitations, including overlapping survival curves between T2 and T3 categories and inadequate differentiation of prognostic subgroups. Existing single-modality radiomics and machine learning models show variable performance and fail to capture complex non-linear interactions among metabolic, anatomical, and molecular biomarkers.
• This study develops and validates a multi-task deep learning model that integrates PET-CT radiomics, clinical variables, and EBV DNA for simultaneous prediction of overall survival and progression-free survival in locally advanced nasopharyngeal carcinoma (NPC). The model significantly outperforms TNM staging and single-task approaches, and incorporates gradient-weighted class activation mapping (Grad-CAM) visualization for interpretable, clinically actionable risk stratification.
What is the implication, and what should change now?
• This model has the potential to enhance personalized treatment planning by identifying high-risk patients who may benefit from treatment intensification. Future work should prioritize prospective external validation in independent multi-center cohorts and clinical trials to confirm its real-world effectiveness before clinical implementation.
Introduction
Nasopharyngeal carcinoma (NPC) represents a distinct malignancy with striking geographic variation, predominantly affecting populations in Southeast Asia and southern China (1). According to GLOBOCAN 2020 data, approximately 133,354 new cases and 80,008 deaths occurred worldwide, with endemic regions bearing a disproportionate burden (2). The age-standardized incidence rate in Eastern Asia reaches 2.70 per 100,000, with projections indicating a 34.60% increase in global cases by 2040 (2). Despite substantial improvements in treatment outcomes following the introduction of intensity-modulated radiation therapy (IMRT) and concurrent chemoradiotherapy (CCRT), locally advanced NPC (LA-NPC) continues to present significant therapeutic challenges, with 5-year overall survival (OS) rates ranging from 75.00% to 88.00% and distant metastasis occurring in 15.00% to 30.00% of patients (3-5).
Current prognostic assessment of LA-NPC relies predominantly on the tumor-node-metastasis (TNM) staging system, which stratifies patients based on anatomical tumor extent. However, the 8th edition of the American Joint Committee on Cancer (AJCC) staging system demonstrates notable limitations, including overlapping survival curves between T2 and T3 categories and between adjacent overall stages, as well as inadequate differentiation of prognostic subgroups. Specifically, studies have demonstrated that patients with T2 and T3 nasopharyngeal lesions exhibit similar 5-year OS rates, challenging the discriminative utility of the current staging framework (6-8). Recent multi-institutional studies involving over 8,800 patients have revealed significant heterogeneity in outcomes within the same stage, with concordance indices (C-indices) ranging from 0.60 to 0.72, underscoring the need for more refined prognostic tools. Existing radiomics and machine learning models, including nomograms, random forests, and support vector machines, have shown variable performance (C-indices: 0.65–0.78) but remain limited by their reliance on handcrafted features and inability to capture deep non-linear feature interactions (7-9). The inability of anatomical staging alone to capture tumor biology, treatment response heterogeneity, and individual patient characteristics has catalyzed the search for integrative prognostic models.
18F-fluorodeoxyglucose positron emission tomography-computed tomography (18F-FDG PET-CT) has emerged as a powerful imaging modality that combines metabolic and anatomical information, offering superior diagnostic accuracy compared to conventional imaging (10-12). Meta-analyses demonstrate that PET-derived parameters, including metabolic tumor volume (MTV), maximum standardized uptake value (SUVmax), and total lesion glycolysis (TLG), serve as independent prognostic factors with hazard ratios (HRs) ranging from 2.50 to 3.30 for survival outcomes (13-15). Additionally, plasma Epstein-Barr virus (EBV) DNA levels have been validated as robust biomarkers, with pre-treatment levels and early clearance patterns showing strong prognostic associations (HR 2.81–3.01) (16-18). However, traditional prognostic models incorporating these biomarkers typically employ linear Cox regression or machine learning approaches that may inadequately capture complex, non-linear relationships among multiple prognostic factors.
Recent advances in artificial intelligence (AI), particularly deep learning, have revolutionized medical image analysis and prognostic prediction (19-21). Deep learning-based radiomics models have demonstrated C-indices of 0.726 to 0.884 in NPC prognosis prediction, significantly surpassing traditional TNM staging (22-24). Among various multimodal fusion strategies, multi-task learning (MTL) frameworks offer particular promise for NPC prognosis. Unlike simple feature concatenation or late fusion approaches, MTL simultaneously optimizes multiple related prediction tasks through shared representations, allowing the model to leverage commonalities between OS and progression-free survival (PFS) prediction. This shared representation learning can improve generalization and reduce overfitting, which is particularly advantageous when working with relatively small clinical datasets (25-27). A foundational multi-task model recently demonstrated 50.00% reduction in training data requirements while achieving 99.70% external validation accuracy across diverse medical imaging tasks (25). However, to date, no study has developed a multi-task deep learning (MTDL) model that integrates multi-modal PET-CT imaging with clinical and molecular biomarkers for simultaneous prediction of multiple survival endpoints in LA-NPC.
In this study, we developed and validated an MTDL model that simultaneously predicts 3-year OS and PFS by integrating pre-treatment 18F-FDG PET-CT imaging with clinical variables and EBV DNA levels. Furthermore, we employed gradient-weighted class activation mapping (Grad-CAM) to provide interpretable visualization of the model’s decision-making process, addressing a critical need for explainable AI in clinical oncology (28-30). We hypothesized that this integrative approach would substantially outperform conventional TNM staging and single-modality models, providing clinically actionable risk stratification for personalized treatment planning. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2705/rc).
Methods
Study population and design
This retrospective cohort study enrolled 200 patients with histologically confirmed, previously untreated LA-NPC (AJCC 8th edition stages III–IVA) who underwent pre-treatment 18F-FDG PET-CT imaging and received definitive CCRT at Meizhou People’s Hospital between January 2018 and December 2022. Inclusion criteria comprised: (I) age 18–75 years; (II) World Health Organization (WHO) type II or III histology; (III) Eastern Cooperative Oncology Group (ECOG) performance status 0–1; (IV) adequate organ function; and (V) complete pre-treatment PET-CT imaging and clinical data. Exclusion criteria included prior malignancy, distant metastasis at diagnosis, or incomplete follow-up data. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of Meizhou People’s Hospital (No. 2024-C-20). Given the retrospective nature of the study utilizing de-identified data, the requirement for informed consent was waived by the ethics committee. Patients were randomly allocated to training (n=140, 70.00%) and internal validation (n=60, 30.00%) cohorts using computer-generated random numbers with stratification by TNM stage.
The sample size of 200 patients was determined based on the available patient cohort meeting all inclusion criteria during the study period. To mitigate the risk of overfitting inherent to deep learning models trained on relatively small datasets, we employed multiple strategies: (I) MTL to leverage shared representations across OS and PFS tasks, effectively reducing the number of independent parameters; (II) extensive data augmentation during training; (III) dropout regularization (P=0.20–0.30); and (IV) five-fold cross-validation for hyperparameter optimization with early stopping.
Treatment protocol
All patients received IMRT with a prescribed dose of 69.96 Gy (range, 68.00–70.00 Gy) to the primary tumor and involved lymph nodes in 33 fractions, with 60.00–63.00 Gy to high-risk clinical target volumes and 54.00–57.00 Gy to low-risk volumes. Concurrent chemotherapy consisted of cisplatin 100 mg/m2 on days 1, 22, and 43 of radiotherapy or weekly cisplatin 40 mg/m2. Selected high-risk patients (T4 or N3 disease) received induction chemotherapy (2–3 cycles of docetaxel 60 mg/m2, cisplatin 60 mg/m2, and 5-fluorouracil 600 mg/m2 on days 1–5) prior to CCRT at the discretion of the treating oncologist.
PET-CT image acquisition and preprocessing
18F-FDG PET-CT scans were performed using a combined PET-CT scanner (Discovery MI, GE Healthcare, Waukesha, Wisconsin, USA) within 2 weeks before treatment initiation. Following at least 6 hours of fasting and confirmation of serum glucose <8.10 mmol/L, patients received intravenous injection of 5.00–5.55 MBq/kg 18F-FDG. After 60-minute uptake in a quiet, dimly lit room, imaging was performed from the skull base to the clavicles. CT acquisition parameters included 140 kV, 30–170 mA with automatic tube current modulation, 3.00 mm slice thickness, and 512×512 matrix. PET acquisition utilized two-dimensional (2D) mode with 2.50 minutes per bed position and 128×128 matrix, reconstructed with three-dimensional (3D) ordered subset expectation maximization (20 subsets, 2 iterations).
Image preprocessing encompassed: (I) rigid registration of PET and CT images to a standard space using Advanced Normalization Tools (ANTs); (II) resampling to isotropic 2.00 mm × 2.00 mm × 2.00 mm voxels using trilinear interpolation; (III) SUV normalization for PET images [SUV = (tissue activity × body weight)/injected dose]; (IV) intensity normalization for CT images using z-score transformation; and (V) Gaussian smoothing (σ=1.00 mm) for noise reduction. Primary tumors and involved lymph nodes were semi-automatically segmented using a threshold of SUV ≥2.50 with manual refinement by two experienced nuclear medicine physicians blinded to clinical outcomes.
Clinical variables and EBV DNA quantification
Clinical variables included age, sex, WHO histological type, T stage, N stage, overall TNM stage, ECOG performance status, smoking history, and body mass index. Pre-treatment plasma EBV DNA levels were quantified using real-time quantitative polymerase chain reaction (qPCR) targeting the BamHI-W region of the EBV genome, with results expressed as copies/mL. The lower detection limit was 500 copies/mL, and undetectable levels were assigned a value of 250 copies/mL for statistical analysis. EBV DNA measurements were performed in a central laboratory certified by the College of American Pathologists.
Feature selection and preprocessing: prior to model training, clinical variables were screened using univariate Cox regression analysis for both OS and PFS endpoints, and variables with P<0.10 were retained for inclusion in the deep learning model. All pre-specified clinical variables (age, sex, WHO histological type, T stage, N stage, overall TNM stage, ECOG performance status, smoking history, and body mass index) met this threshold and were included in the final model. EBV DNA levels were log-transformed due to their right-skewed distribution. No additional feature selection was applied to imaging features, as the 3D ResNet-34 architecture inherently performs hierarchical feature extraction and selection through its convolutional layers, and the attention-based fusion mechanism automatically weights the relative importance of different input modalities.
MTDL model architecture
The MTDL model architecture comprised three main components: (I) dual-pathway 3D convolutional neural networks for feature extraction; (II) attention-based multi-modal fusion; and (III) multi-task prediction heads (Figure 1). For imaging feature extraction, we employed two parallel 3D ResNet-34 architectures to independently process PET and CT volumes. Each ResNet-34 consisted of an initial convolutional layer (7×7×7 kernel, 64 filters, stride 2), followed by four residual blocks with 3, 4, 6, and 3 residual units, respectively. Max pooling (3×3×3, stride 2) was applied after the initial convolution, and adaptive average pooling reduced spatial dimensions to 1×1×1 before the fully connected layer, yielding 512-dimensional feature vectors for both PET and CT pathways.
Clinical variables and log-transformed EBV DNA levels were processed through separate dense neural networks. Categorical variables were one-hot encoded, while continuous variables were standardized to zero mean and unit variance. The clinical pathway consisted of two fully connected layers (256 and 128 neurons) with batch normalization and dropout (P=0.30) after each layer, producing a 128-dimensional clinical feature vector.
Multi-modal feature fusion was achieved through a scaled dot-product attention mechanism. The concatenated imaging features served as queries (Q), while clinical features functioned as keys (K) and values (V). Attention weights were computed as: , where dk represents the dimensionality of keys. The fused representation underwent two additional fully connected layers (512 and 256 neurons) with rectified linear unit (ReLU) activation and dropout (P=0.20), forming a shared representation layer.
Task-specific prediction heads branched from the shared representation. Each head comprised two fully connected layers (128 and 64 neurons) followed by a DeepSurv layer implementing Cox proportional hazards loss. The DeepSurv layer computed a risk score ri = exp(θT hi), where θ represents learned parameters and hi denotes the final hidden representation for patient i. The negative log partial likelihood loss was optimized: , where δi is the event indicator and Ri is the risk set at time ti. MTL employed weighted loss: , with task weights λOS = λPFS=0.50 determined through grid search.
Model training and hyperparameter optimization
The model was implemented in PyTorch 1.13.0 and trained using Adam optimizer with an initial learning rate of 0.0001, β1=0.90, β2=0.999, and weight decay of 0.0001. A cosine annealing learning rate schedule with warm restarts was employed, reducing the learning rate from 0.0001 to 0.00001 over 100 epochs. The batch size was set to 16, balanced across risk groups using inverse class frequency weighting to address event rate imbalance. Five-fold cross-validation within the training cohort guided hyperparameter selection, evaluating combinations of learning rates (0.0001, 0.0005, 0.001), dropout rates (0.20, 0.30, 0.40), and network depths. Early stopping with patience of 20 epochs based on validation C-index prevented overfitting. Data augmentation included random rotation (±15°), flipping (50.00% probability), scaling (0.90–1.10), and elastic deformation (α=4, σ=0.05) applied with 70.00% probability during training.
Model interpretability using Grad-CAM
To enhance model interpretability, we implemented Grad-CAM to visualize spatial attention patterns on PET-CT images (28). For a given input image and target class (high-risk vs. low-risk), Grad-CAM computes the gradient of the risk score with respect to feature maps in the final convolutional layer. The importance weights for feature map k are calculated as: , where yc is the risk score, represents activation at spatial location (i,j) in feature map k, and Z is the normalization factor. The class activation map is then generated by weighted combination: , highlighting regions most influential for risk prediction. We generated Grad-CAM visualizations for representative high-risk and low-risk cases and validated regions of interest with two board-certified radiation oncologists.
Statistical analysis
Model performance was evaluated using Harrell’s concordance index (C-index), time-dependent area under the curve (AUC) at 3 years, and calibration curves. The C-index measures the probability of correct risk ordering between patient pairs, with values >0.80 indicating excellent discrimination. Bootstrap resampling (1,000 iterations) generated 95% confidence intervals (CIs) for performance metrics. The MTDL model was compared against TNM staging and single-task models using likelihood ratio tests, with statistical significance defined as two-sided P<0.050. Kaplan-Meier survival curves were constructed for risk groups defined by median predicted risk, with log-rank tests assessing differences. Calibration was evaluated by plotting predicted versus observed survival probabilities with 95.00% confidence bands. Feature importance was quantified by computing the absolute gradient of the risk score with respect to each input feature, averaged across all patients. Statistical analyses employed R 4.2.0 (survival, survivalROC, rms packages) and Python 3.9.0 (scikit-survival, lifelines).
Results
Patient characteristics and outcomes
Among 247 patients assessed for eligibility, 200 met inclusion criteria and were enrolled (Figure 2). Baseline characteristics were well balanced between training (n=140) and validation (n=60) cohorts, with no statistically significant differences observed for any variable (all P>0.05; Table 1). The median age was 47.00 years (range 24–72 years), with male predominance (72.50%). The median pre-treatment EBV DNA level was 3,250 copies/mL [interquartile range (IQR), 850–12,500 copies/mL]. Most patients had stage III (42.50%) or IVA (57.50%) disease. During a median follow-up of 36.00 months (range, 6–60 months), 34 deaths (17.00%) and 48 progression events (24.00%) occurred. The 3-year OS and PFS rates were 78.30% (95% CI: 70.40–84.40%) and 70.30% (95% CI: 61.90–77.20%), respectively. Distant metastasis was the most common failure pattern (n=32, 16.00%), followed by locoregional recurrence (n=16, 8.00%).
Table 1
| Characteristic | Overall (n=200) | Training (n=140) | Validation (n=60) |
|---|---|---|---|
| Age, years | 47.00 (24–72) | 47.00 (24–72) | 48.00 (26–71) |
| Sex | |||
| Male | 145 (72.50) | 102 (72.86) | 43 (71.67) |
| Female | 55 (27.50) | 38 (27.14) | 17 (28.33) |
| T stage | |||
| T3 | 112 (56.00) | 78 (55.71) | 34 (56.67) |
| T4 | 88 (44.00) | 62 (44.29) | 26 (43.33) |
| N stage | |||
| N0–1 | 67 (33.50) | 47 (33.57) | 20 (33.33) |
| N2–3 | 133 (66.50) | 93 (66.43) | 40 (66.67) |
| TNM stage | |||
| III | 85 (42.50) | 59 (42.14) | 26 (43.33) |
| IVA | 115 (57.50) | 81 (57.86) | 34 (56.67) |
| EBV DNA, copies/mL | 4,850 [1,240–15,600] | 4,720 [1,180–15,200] | 5,010 [1,350–16,100] |
| MTV, cm3 | 32.50 [18.20–58.40] | 31.80 [17.60–57.90] | 33.70 [19.10–59.20] |
| SUVmax | 11.20 [8.40–15.30] | 11.10 [8.30–15.10] | 11.40 [8.70–15.60] |
Data are presented as median (range), n (%), or median [IQR]. EBV, Epstein-Barr virus; IQR, interquartile range; MTV, metabolic tumor volume; N, node; SUVmax, maximum standardized uptake value; T, tumor; TNM, tumor-node-metastasis.
MTDL model performance
The MTDL model demonstrated excellent discriminative ability in the internal validation cohort, achieving C-indices of 0.834 (95% CI: 0.805–0.863) for OS and 0.821 (95% CI: 0.790–0.852) for PFS (Figure 1, Table 2). These results significantly surpassed both TNM staging (C-indices: 0.648 for OS and 0.631 for PFS); both P<0.001 and single-task models (OS: C-index 0.771, P=0.007; PFS: C-index 0.750, P=0.003). The time-dependent AUCs at 3 years were 0.89 for OS and 0.86 for PFS (Figure 3), confirming strong predictive accuracy at the clinically relevant timepoint. Calibration curves demonstrated good agreement between predicted and observed survival probabilities, with minimal deviation from the 45-degree reference line (data not shown).
Table 2
| Model | OS | PFS | |||||
|---|---|---|---|---|---|---|---|
| C-index (95% CI) | 3-year AUC | P value | C-index (95% CI) | 3-year AUC | P value | ||
| MTDL model | 0.834 (0.805–0.863) | 0.89 | Ref. | 0.821 (0.790–0.852) | 0.86 | Ref. | |
| Single-task OS model | 0.771 (0.738–0.804) | 0.83 | 0.007 | – | – | ||
| Single-task PFS model | – | – | 0.750 (0.715–0.785) | 0.79 | 0.003 | ||
| TNM staging (8th ed.) | 0.648 (0.608–0.688) | 0.72 | <0.001 | 0.631 (0.590–0.672) | 0.69 | <0.001 | |
AUC, area under the curve; CI, confidence interval; MTDL, multi-task deep learning; OS, overall survival; PFS, progression-free survival; TNM, tumor-node-metastasis.
Risk stratification and survival analysis
Using the median predicted risk score as the cutoff, patients were stratified into high-risk (n=67, 33.50%) and low-risk (n=133, 66.50%) groups. Kaplan-Meier analysis revealed significant survival differences between risk groups (Figure 4). The 3-year OS was 54.20% (95% CI: 42.10–66.30%) in the high-risk group versus 91.30% (95% CI 85.80–96.80%) in the low-risk group (log-rank P<0.001; HR 4.52, 95% CI: 2.81–7.26). Similarly, the 3-year PFS was 47.10% (95% CI: 35.50–58.70%) in the high-risk group versus 82.40% (95% CI: 75.90–88.90%) in the low-risk group (log-rank P<0.001; HR 3.87, 95% CI: 2.45–6.12). The MTDL-based risk stratification significantly improved patient categorization compared to TNM staging, which showed substantial overlap in survival curves between stages III and IVA (log-rank P=0.18 for OS).
Feature importance and model interpretability
Feature importance analysis revealed that PET-derived metabolic features contributed most substantially to model predictions, accounting for 42.00% of the total predictive power (Figure 5). The top three features were MTV (18.50%), SUVmax (12.30%), and TLG (11.20%). CT anatomical features contributed 31.00%, with tumor infiltration depth (8.70%), lymph node size (7.40%), and tumor volume (6.30%) being the most influential. Clinical variables, including T stage (5.90%), N stage (5.20%), and age (4.10%), accounted for 23.20% of predictive capability, while EBV DNA levels contributed an additional 3.80%. This hierarchical feature contribution pattern underscores the complementary value of integrating multi-modal data for comprehensive prognostic assessment.
Grad-CAM visualization provided intuitive interpretation of model attention patterns (Figure 6). For high-risk patients, the model consistently focused on regions with high metabolic activity in the primary tumor and retropharyngeal lymph nodes on PET images, as well as areas of deep infiltration and large tumor volumes on CT images. Conversely, for low-risk patients, attention was diffusely distributed with lower intensity, reflecting less aggressive tumor characteristics. Visual inspection by radiation oncologists confirmed that Grad-CAM-highlighted regions aligned with clinical assessment of high-risk features, validating the model’s clinical relevance and biological plausibility.
Discussion
In this study, we developed and internally validated an MTDL model that integrates multi-modal PET-CT imaging with clinical and molecular biomarkers for simultaneous prediction of OS and PFS in LA-NPC patients. The MTDL model achieved C-indices of 0.834 and 0.821 for OS and PFS, respectively, significantly outperforming both conventional TNM staging (improvements of 0.186 and 0.190) and single-task deep learning approaches (improvements of 0.063 and 0.071). To our knowledge, this represents the first application of MTDL to prognostic prediction in NPC, demonstrating the value of jointly optimizing related survival endpoints through shared representations.
Our findings align with and extend recent advances in AI-based prognostic modeling for NPC. Qiang et al. reported a deep learning system achieving C-indices of 0.830 for OS in 3,764 patients, comparable to our MTDL model’s performance (23). Similarly, a recent multi-center study of an automated AI model (NPC-SurvAI) demonstrated integrated AUCs of 0.838–0.894 across four institutions, validating the robustness of deep learning approaches (24). However, these prior studies employed single-task learning and did not leverage the synergistic benefits of multi-task optimization. Our results demonstrate that MTL confers meaningful advantages, with C-index improvements of 6.30% and 7.10% over single-task counterparts, consistent with foundational work showing 3.00% to 5.00% performance gains from multi-task architectures (25-27). Moreover, the MTDL model’s ability to simultaneously predict multiple endpoints addresses a critical clinical need, as treatment decisions often require comprehensive assessment of both survival and progression risks.
The substantial contribution of PET-derived metabolic features (42.00% of predictive power) underscores the unique prognostic value of functional imaging beyond anatomical staging. Our feature importance analysis identified MTV, SUVmax, and TLG as the top three predictors, corroborating meta-analytic evidence demonstrating HRs of 2.50 to 3.30 for these parameters (13-15). Notably, retropharyngeal lymph node uptake patterns emerged as a critical prognostic feature (9.80% contribution), consistent with studies showing that metabolically active retropharyngeal nodes predict distant metastasis (31). The complementary role of CT anatomical features (31.00% contribution), particularly tumor infiltration depth and lymph node size, highlights the importance of integrating multiple imaging modalities. This multi-modal integration capitalizes on the distinct yet complementary information provided by metabolic and anatomical imaging, enabling more comprehensive tumor characterization than either modality alone (32-34).
The incorporation of EBV DNA, while contributing modestly to overall predictive power (3.80%), reflects its established role as a circulating biomarker of tumor burden and treatment response (16-18). Recent studies have demonstrated that early EBV DNA clearance predicts superior survival outcomes (5-year OS 84.80% vs. 72.90%), suggesting potential utility for treatment adaptation (17). Our model’s integration of EBV DNA with imaging features creates a comprehensive prognostic framework that captures tumor biology at multiple scales, from molecular markers to tissue-level imaging characteristics. Future iterations could incorporate longitudinal EBV DNA measurements to enable dynamic risk assessment during treatment, potentially triggering early treatment intensification for high-risk patients with persistent elevations.
The interpretability provided by Grad-CAM visualization represents a critical advancement toward clinically acceptable AI systems. Radiation oncologists confirmed that model attention patterns aligned with established high-risk features, including metabolically active primary tumors, retropharyngeal lymphadenopathy, and deep soft tissue infiltration. This concordance between algorithmic focus and clinical expertise enhances trust and facilitates clinical translation, addressing a major barrier to AI adoption in oncology (28-30). The European Union’s AI Act and recent joint guidelines from the European Society for Radiotherapy and Oncology (ESTRO) and American Association of Physicists in Medicine (AAPM) mandate explainability for high-risk medical AI systems, making Grad-CAM visualization not only scientifically valuable but also regulatory essential (35-37).
The clinical implications of accurate risk stratification are substantial. Our MTDL model identified 33.50% of patients as high-risk with markedly inferior outcomes (3-year OS 54.20% vs. 91.30% for low-risk), suggesting this subgroup may benefit from treatment intensification. Recent risk-adapted strategies, including induction chemotherapy for high-risk patients, have demonstrated improved outcomes when appropriately targeted (38-40). Conversely, low-risk patients (66.50% of the cohort) achieving 91.30% 3-year OS might be candidates for treatment de-intensification to reduce long-term toxicities while maintaining oncologic control. The Treatment Response-Adapted Risk Index (RAIRI) model recently demonstrated 35.90% absolute benefit from adjuvant chemotherapy in high-risk patients versus no benefit in low-risk groups, validating the clinical utility of precise risk stratification (41). Integration of our MTDL model into prospective clinical trials could enable personalized treatment selection, optimizing the balance between efficacy and toxicity.
While these results are encouraging, it is essential to acknowledge that the clinical readiness of the MTDL model requires further validation. Current literature consistently demonstrates that deep learning models showing high accuracy in internal validation cohorts often experience significant performance degradation when applied to external datasets from different institutions or populations. The model’s ability to change treatment decisions for the identified high-risk group remains theoretical at this stage; prospective randomized trials comparing outcomes between model-guided and standard treatment selection are needed to establish definitive clinical benefit. Additionally, the practical implementation of PET-CT-based deep learning models requires standardized imaging protocols and computational infrastructure that may not be universally available, potentially limiting the model’s accessibility in resource-limited settings.
Several limitations warrant consideration. First, and most critically, the absence of an independent external validation cohort represents a major limitation of this study. According to the TRIPOD reporting guidelines, our model was validated using a randomly split sample from a single institution (type 2a internal validation), which provides limited evidence of generalizability compared to validation on geographically or temporally distinct datasets. External validation in independent, multi-center cohorts is essential before any clinical implementation can be considered. Performance degradation of 5.00% to 10.00% is commonly observed during external validation of deep learning models, and our reported C-indices should be interpreted with this caveat (42-44). Second, the retrospective single-center design and modest sample size (n=200) raise concerns about overfitting, particularly given the complexity of the 3D ResNet-34 architecture with millions of trainable parameters. Although we employed multiple regularization strategies including MTL, data augmentation, dropout, and five-fold cross-validation, a formal sample size estimation or power analysis was not performed a priori. Third, imaging heterogeneity, including variations in PET-CT scanners, reconstruction algorithms, and segmentation approaches, may impact feature reliability and model performance. Standardized imaging protocols and federated learning approaches could mitigate these concerns while preserving data privacy (45-47). Fourth, the median follow-up of 36 months, while adequate for 3-year survival analysis, limits assessment of long-term prognostic accuracy beyond 5 years. Fifth, the feature selection process relied on the deep learning architecture’s inherent feature learning capabilities rather than a priori statistical screening [e.g., least absolute shrinkage and selection operator (LASSO) regression], which may introduce potential bias in smaller datasets.
Conclusions
In conclusion, we developed and internally validated an MTDL model integrating PET-CT imaging, clinical variables, and EBV DNA for prognostic prediction in LA-NPC. The model demonstrated superior performance compared to TNM staging and single-task approaches in an internal validation setting, with clinically meaningful risk stratification and interpretable predictions through Grad-CAM visualization. However, the absence of external validation limits the current evidence for clinical applicability, and these findings should be considered preliminary. These preliminary findings support the potential of multi-modal, MTDL to enhance personalized treatment planning. Future work should prioritize prospective external validation in independent multi-center cohorts, followed by integration into clinical decision support systems, and evaluation in randomized trials of risk-adapted therapy to definitively establish clinical utility and patient benefit.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2705/rc
Data Sharing Statement: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2705/dss
Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2705/prf
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1-2705/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of Meizhou People’s Hospital (No. 2024-C-20). Given the retrospective nature of the study utilizing de-identified data, the requirement for informed consent was waived by the ethics committee.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Chen YP, Chan ATC, Le QT, et al. Nasopharyngeal carcinoma. Lancet 2019;394:64-80. [Crossref] [PubMed]
- Zhang Y, Rumgay H, Li M, et al. Nasopharyngeal Cancer Incidence and Mortality in 185 Countries in 2020 and the Projected Burden in 2040: Population-Based Global Epidemiological Profiling. JMIR Public Health Surveill 2023;9:e49968. [Crossref] [PubMed]
- Sun Y, Li WF, Chen NY, et al. Induction chemotherapy plus concurrent chemoradiotherapy versus concurrent chemoradiotherapy alone in locoregionally advanced nasopharyngeal carcinoma: a phase 3, multicentre, randomised controlled trial. Lancet Oncol 2016;17:1509-20. [Crossref] [PubMed]
- Yang Q, Cao SM, Guo L, et al. Induction chemotherapy followed by concurrent chemoradiotherapy versus concurrent chemoradiotherapy alone in locoregionally advanced nasopharyngeal carcinoma: long-term results of a phase III multicentre randomised controlled trial. Eur J Cancer 2019;119:87-96. [Crossref] [PubMed]
- Liu F, Jin T, Liu L, et al. The role of concurrent chemotherapy for stage II nasopharyngeal carcinoma in the intensity-modulated radiotherapy era: A systematic review and meta-analysis. PLoS One 2018;13:e0194733. [Crossref] [PubMed]
- Pan JJ, Ng WT, Zong JF, et al. Proposal for the 8th edition of the AJCC/UICC staging system for nasopharyngeal cancer in the era of intensity-modulated radiotherapy. Cancer 2016;122:546-58.
- Du XJ, Wang GY, Zhu XD, et al. Refining the 8th edition TNM classification for EBV related nasopharyngeal carcinoma. Cancer Cell 2024;42:464-473.e3.
- Pan JJ, Mai HQ, Ng WT, et al. Ninth Version of the AJCC and UICC Nasopharyngeal Cancer TNM Staging Classification. JAMA Oncol 2024;10:1627-35. [Crossref] [PubMed]
- Jiang Y, Qu S, Pan X, et al. Prognostic Nomogram For Locoregionally Advanced Nasopharyngeal Carcinoma. Sci Rep 2020;10:861. [Crossref] [PubMed]
- Hsu CL, Chang KP, Lin CY, et al. Plasma Epstein-Barr virus DNA concentration and clearance rate as novel prognostic factors for metastatic nasopharyngeal carcinoma. Head Neck 2012;34:1064-70. [Crossref] [PubMed]
- Yang SS, Wu YS, Chen WC, et al. Benefit of [18F]-FDG PET/CT for treatment-naïve nasopharyngeal carcinoma. Eur J Nucl Med Mol Imaging 2022;49:980-91. [Crossref] [PubMed]
- Chan SC, Chang JT, Lin CY, et al. Clinical utility of 18F-FDG PET parameters in patients with advanced nasopharyngeal carcinoma: predictive role for different survival endpoints and impact on prognostic stratification. Nucl Med Commun 2011;32:989-96. [Crossref] [PubMed]
- Lin J, Xie G, Liao G, et al. Prognostic value of 18F-FDG-PET/CT in patients with nasopharyngeal carcinoma: a systematic review and meta-analysis. Oncotarget 2017;8:33884-96. [Crossref] [PubMed]
- Li Q, Zhang J, Cheng W, et al. Prognostic value of maximum standard uptake value, metabolic tumor volume, and total lesion glycolysis of positron emission tomography/computed tomography in patients with nasopharyngeal carcinoma: A systematic review and meta-analysis. Medicine (Baltimore) 2017;96:e8084. [Crossref] [PubMed]
- Moon SH, Hyun SH, Choi JY. Prognostic significance of volume-based PET parameters in cancer patients. Korean J Radiol 2013;14:1-12. [Crossref] [PubMed]
- Peng H, Guo R, Chen L, et al. Prognostic Impact of Plasma Epstein-Barr Virus DNA in Patients with Nasopharyngeal Carcinoma Treated using Intensity-Modulated Radiation Therapy. Sci Rep 2016;6:22000. [Crossref] [PubMed]
- Li W, Chen J, Liang B, et al. Long-term monitoring of dynamic changes in plasma EBV DNA for improved prognosis prediction of nasopharyngeal carcinoma. Cancer Med 2021;10:883-94. [Crossref] [PubMed]
- Zhu L, Ouyang T, Xiong Y, et al. Prognostic Value of Plasma Epstein-Barr Virus DNA Levels Pre- and Post-Neoadjuvant Chemotherapy in Patients With Nasopharyngeal Carcinoma. Front Oncol 2021;11:714433. [Crossref] [PubMed]
- Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542:115-8. [Crossref] [PubMed]
- Poplin R, Varadarajan AV, Blumer K, et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat Biomed Eng 2018;2:158-64. [Crossref] [PubMed]
- Ardila D, Kiraly AP, Bharadwaj S, et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med 2019;25:954-61. [Crossref] [PubMed]
- Li S, Deng YQ, Hua HL, et al. Deep learning for locally advanced nasopharyngeal carcinoma prognostication based on pre- and post-treatment MRI. Comput Methods Programs Biomed 2022;219:106785. [Crossref] [PubMed]
- Qiang M, Li C, Sun Y, et al. A Prognostic Predictive System Based on Deep Learning for Locoregionally Advanced Nasopharyngeal Carcinoma. J Natl Cancer Inst 2021;113:606-15. [Crossref] [PubMed]
- Zhong L, Dong D, Fang X, et al. A deep learning-based radiomic nomogram for prognosis and treatment decision in advanced nasopharyngeal carcinoma: A multicentre study. EBioMedicine 2021;70:103522. [Crossref] [PubMed]
- Schäfer R, Nicke T, Höfener H, et al. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. Nat Comput Sci 2024;4:495-509. [Crossref] [PubMed]
- Ruder S. An overview of multi-task learning in deep neural networks. arXiv:1706.05098. 2017. Available online: https://doi.org/
10.48550 /arXiv.1706.05098 - Kim S, Thomas GP, Chris M. Cross-Task Attention Network: Improving Multi-Task Learning for Medical Imaging Applications. ArXiv abs/2309.03837 (2023): n. pag. Available online: https://www.semanticscholar.org/paper/Cross-Task-Attention-Network%3A-Improving-Multi-Task-Kim-Purdie/2d2747360dda1ffadb82050891da0c1bae144983
- Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis 2020;128:336-59.
- Adebayo J, Gilmer J, Muelly M, et al. Sanity checks for saliency maps. arXiv:1810.03292. 2018. Available online: https://doi.org/
10.48550 /arXiv.1810.03292 - Holzinger A, Biemann C, Pattichis CS, et al. What do we need to build explainable AI systems for the medical domain? arXiv:1712.09923. 2017. Available online: https://doi.org/
10.48550 /arXiv.1712.09923 - Ng WT, Chow JCH, Beitler JJ, et al. Current Radiotherapy Considerations for Nasopharyngeal Carcinoma. Cancers (Basel) 2022;14:5773. [Crossref] [PubMed]
- Zheng XK, Chen LH, Wang QS, et al. Influence of FDG-PET on computed tomography-based radiotherapy planning for locally recurrent nasopharyngeal carcinoma. Int J Radiat Oncol Biol Phys 2007;69:1381-8. [Crossref] [PubMed]
- Li H, Kong Z, Xiang Y, et al. The role of PET/CT in radiotherapy for nasopharyngeal carcinoma. Front Oncol 2022;12:1017758. [Crossref] [PubMed]
- Gu B, Meng M, Bi L, et al. Prediction of 5-year progression-free survival in advanced nasopharyngeal carcinoma with pretreatment PET/CT using multi-modality deep learning-based radiomics. Front Oncol 2022;12:899351. [Crossref] [PubMed]
- European Commission. Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain union legislative acts. COM/2021/206 final. 2021. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex:52021PC0206
- Hurkmans C, Bibault JE, Brock KK, et al. A joint ESTRO and AAPM guideline for development, clinical validation and reporting of artificial intelligence models in radiation therapy. Radiother Oncol 2024;197:110345. [Crossref] [PubMed]
- Tejani AS, Klontzas ME, Gatti AA, et al. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 Update. Radiol Artif Intell 2024;6:e240300. [Crossref] [PubMed]
- Zhang Y, Chen L, Hu GQ, et al. Gemcitabine and Cisplatin Induction Chemotherapy in Nasopharyngeal Carcinoma. N Engl J Med 2019;381:1124-35. [Crossref] [PubMed]
- Li WF, Chen L, Sun Y, et al. Induction chemotherapy for locoregionally advanced nasopharyngeal carcinoma. Chin J Cancer 2016;35:94. [Crossref] [PubMed]
- Bossi P, Chan AT, Licitra L, et al. Nasopharyngeal carcinoma: ESMO-EURACAN Clinical Practice Guidelines for diagnosis, treatment and follow-up†. Ann Oncol 2021;32:452-65. [Crossref] [PubMed]
- Liu Y, Yan W, Chen Y, et al. Treatment response-adapted risk index model for survival prediction and adjuvant chemotherapy selection in nonmetastatic nasopharyngeal carcinoma. NPJ Digit Med 2025;8:564. [Crossref] [PubMed]
- Steyerberg EW, Harrell FE Jr. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol 2016;69:245-7. [Crossref] [PubMed]
- Bradshaw TJ, Huemann Z, Hu J, et al. A Guide to Cross-Validation for Artificial Intelligence in Medical Imaging. Radiol Artif Intell 2023;5:e220232. [Crossref] [PubMed]
- Park SH, Han K. Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology 2018;286:800-9. [Crossref] [PubMed]
- Kaissis GA, Makowski MR, Rückert D, et al. Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell 2020;2:305-11.
- Rieke N, Hancox J, Li W, et al. The future of digital health with federated learning. NPJ Digit Med 2020;3:119. [Crossref] [PubMed]
- Xu J, Glicksberg BS, Su C, et al. Federated Learning for Healthcare Informatics. J Healthc Inform Res 2021;5:1-19. [Crossref] [PubMed]

