A multimodal fusion model for bone tumor benign and malignant diagnosis: development and validation with clinical text and radiographs

Ju Zeng; Qiuchi Chen; Tao Zhang; Decui Liang; Dongming Li

doi:10.21037/tcr-2025-1832

Original Article

A multimodal fusion model for bone tumor benign and malignant diagnosis: development and validation with clinical text and radiographs

Ju Zeng¹, Qiuchi Chen², Tao Zhang¹, Decui Liang², Dongming Li¹

¹Department of Medical Imaging, Sichuan Orthopedics Hospital, Chengdu, China; ²School of Economics and Management, University of Electronic Science and Technology of China, Chengdu, China

Contributions: (I) Conception and design: J Zeng, Q Chen, D Liang; (II) Administrative support: J Zeng, T Zhang, D Li; (III) Provision of study materials or patients: J Zeng, T Zhang, D Li; (IV) Collection and assembly of data: J Zeng, T Zhang, D Li; (V) Data analysis and interpretation: Q Chen, D Liang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Ju Zeng, Master of Medicine. Department of Medical Imaging, Sichuan Orthopedics Hospital, No. 132, West Section 1, First Ring Road, Chengdu 610072, China. Email: 326210689@qq.com.

Background: Bone tumors have diverse clinical and imaging features, rendering preoperative differentiation of benign, intermediate/malignant types challenging. Unimodal methods (medical records or X-rays) are prone to misdiagnosis/missed diagnosis due to incomplete information. While postoperative histopathology is the gold standard, there is an urgent clinical demand for a precise preoperative diagnostic tool. This study aims to develop and validate a multimodal model integrating deep learning with Dempster-Shafer (DS) evidence theory for the differential diagnosis of benign, intermediate/malignant bone tumors. Using postoperative histopathology as the reference standard, the model achieves diagnosis by integrating preoperative clinical text and radiographs.

Methods: This single-center retrospective study included 319 pathologically confirmed bone tumor patients admitted between 2020 and 2025 following selection criteria. Utilizing the patients’ X-ray images and medical record text data, we constructed a fusion model based on deep learning and DS evidence theory to classify tumors into benign and intermediate/malignant categories. The performance of the model was evaluated using the receiver operating characteristic (ROC) curve along with its 95% confidence interval (CI).

Results: The dataset comprised text data and radiographs from a total of 319 patients and it was stratified by time into a training set, an internal validation set, and an external validation set. On the internal validation set, the fusion model achieved an area under the curve (AUC) of 0.821 (95% CI: 0.713–0.916), with an accuracy of 81.6%, precision of 81.3%, recall of 76.5% and an F1 score of 78.8%, outperforming both the unimodal text model with an AUC of 0.814 and accuracy of 77.6% and the image model with an AUC of 0.782 and accuracy of 72.4%. On the external validation set, the fusion model maintained robust performance: AUC reached 0.808 (95% CI: 0.667–0.928), accuracy 77.3%, and F1 score 70.6%. Compared to the proposed fusion approach, most baseline models underperformed across all metrics, with their accuracy ranging from 59.1% to 77.3% and F1 score ranging from 47.1% to 70.6%. Furthermore, the model's diagnostic performance rivals that of senior radiologists and significantly outperforms junior radiologists. McNemar’s test results confirmed no significant difference in diagnostic performance between the model and senior radiologists, while a statistically significant performance gap existed between junior and senior radiologists.

Conclusions: We have developed and validated a fusion model that integrated deep learning and DS evidence theory. In the task of distinguishing between benign and intermediate/malignant bone tumors, this fusion model demonstrated encouraging performance compared to models that utilize unimodal data and other baseline fusion models.

Keywords: Bone neoplasms; deep learning; dempster-shafer evidence theory (DS evidence theory); multimodal fusion; radiography

Submitted Sep 02, 2025. Accepted for publication Dec 16, 2025. Published online Feb 02, 2026.

doi: 10.21037/tcr-2025-1832

Highlight box

Key findings

• Preoperative differentiation of benign and intermediate/malignant bone tumors remains clinically challenging due to diverse clinical and imaging features, with unimodal diagnostic methods prone to misjudgment. A multimodal fusion model integrating deep learning and Dempster-Shafer (DS) evidence theory demonstrates robust diagnostic performance, which is statistically indistinguishable from that of senior radiologists and superior to unimodal models and baseline fusion approaches, providing reliable preoperative support.

What is known and what is new?

• Early identification of bone tumor malignancy is critical for treatment and prognosis; unimodal diagnosis is limited by incomplete information, and multimodal fusion is promising but lacks effective cross-modal conflict resolution.

• This study innovatively applies DS evidence theory to fuse clinical text and X-ray data, resolving inter-modal uncertainty, offering a practical multimodal diagnostic framework for bone tumors.

What is the implication, and what should change now?

• The model’s performance comparable to senior radiologists confirms that synthesizing clinical text and imaging data via DS evidence theory effectively mitigates single-modal limitations. It addresses the unmet need for consistent, reliable preoperative bone tumor classification.

• Clinical practice could incorporate this multimodal model into routine preoperative bone tumor evaluation workflows, particularly to assist junior radiologists in decision-making. It should be used as an auxiliary tool to reduce misdiagnosis risks without replacing physician judgment, as it enhances clinical decision-making efficiency.

Introduction

Bone tumors are a type of disease that affects the quality of human life and can even be life-threatening. It is a major public health problem around the world (1). According to the World Health Organization (WHO) soft tissue and bone tumor classification (2020), bone tumors are divided into 8 categories of 68 tumors, and they are divided into benign, intermediate, and malignant according to their invasiveness (2). Due to the different biology of benign and malignant bone tumors, there are different treatment methods. The treatment plan for benign bone tumors mainly relies on monitoring, local resection or regional curettage (3,4). Malignant bone tumors usually require combination treatment, including radical surgery, immunotherapy, chemotherapy and radiotherapy (5). Malignant bone tumors develop relatively rapidly and will seriously endanger the life safety of patients (6). Therefore, making early judgments on the benign and malignant nature of bone tumors is crucial to clinical decisions and patient prognosis.

In the diagnosis of bone tumors, although the pathology is the gold standard for bone tumor diagnosis, imaging-pathological correlation remains an important part of tumor evaluation and is critical to minimize diagnostic errors and achieve optimal clinical outcomes (7). Imaging examinations play an irreplaceable role in lesion discovery, evaluation, and differentiation between benign and malignant tumors. A comprehensive diagnostic workflow typically includes X-ray, computed tomography (CT), and magnetic resonance imaging (MRI) of the affected region. However, X-ray examination has emerged as the preferred imaging method for bone tumor diagnosis due to its simplicity, cost-effectiveness, and widespread application. It can display the position, destruction mode and periosteal response patterns of bone disease (8). Nevertheless, the imaging manifestations of bone tumors are diverse and complex, and X-ray examination provides only a two-dimensional overlapping image, which limits its ability to display detailed lesion characteristics. Consequently, radiologists with limited diagnostic experience may be prone to misjudgment. In recent years, machine learning and deep learning have shown promising results in the field of assisted imaging diagnosis and hold significant development potential, although they also face certain challenges (9,10). Notably, current artificial intelligence (AI) models for bone tumor classification primarily focus on a single data modality, overlooking a broader clinical context, which inevitably diminishes their potential. The integration of diverse data patterns presents opportunities to enhance the robustness and accuracy of diagnostic and prognostic models, thereby bringing AI closer to practical clinical application (11). Therefore, our research aimed to utilize clinical text data and X-ray image data from bone tumor patients to construct a multimodal fusion model based on deep learning and Dempster-Shafer evidence theory (abbreviated as DS evidence theory or DS theory) for distinguishing benign and intermediate/malignant tumors. In this case, DS evidence theory takes into account decision uncertainty and synthesizes knowledge from different data sources. We compared this fusion model with single-modal models and other baseline models. At the same time, we also explored the impact of model hyperparameters on the performance. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1832/rc).

Methods

Study patients and data sets

This retrospective study was approved by the institutional review board of Sichuan Orthopedics Hospital (No. KY2025-031-01). The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments and individual consent for this analysis was waived due to the retrospective nature.

This study retrospectively retrieved data from patients who underwent postoperative pathological examinations and were diagnosed with bone tumors at Sichuan Orthopedics Hospital from 2020 to 2025, totaling 452 patients. The inclusion criteria were summarized as follows: (I) patients with benign, intermediate, or malignant bone tumors confirmed by postoperative pathological examination; (II) availability of high-quality X-ray images appropriate for diagnosis; (III) no prior needle biopsy or invasive treatment prior to imaging examinations; (IV) complete admission record text data. The exclusion criteria have: (I) poor-quality X-ray images unsuitable for diagnosis; (II) an interval exceeding three months between imaging examinations and surgery; (III) patients with recurrent bone tumors. After exclusion criteria screening, a total of 319 cases were enrolled in this study. The dataset consists of radiographs of bone tumors in the relevant areas of all patients and admission records. The admission records contain sequential text data such as chief complaints, history of present illness, past medical history, personal history, and family history, along with demographic information including gender, age, and vital signs. Clinical texts consist of the chief complaint, present illness, past medical history, and demographic information.

Prior to study initiation, we performed a sample size calculation based on the expected diagnostic performance of the multimodal model. Referring to previous studies on AI-assisted bone tumor diagnosis (12,13) and consulting with doctors, we set the clinically acceptable threshold p₀ at 70%. Based on anticipated experimental results, the expected sensitivity p₁ is projected to be 80%. The significance level α is set at 0.05, employing a one-tailed test with a test power of 0.8. The formula for calculating the pre-specified sample size (one-tailed test) is as follows:

$n_{+} = \frac{{(Z_{1 - α} \sqrt{p_{0} (1 - p_{0})} + Z_{1 - β} \sqrt{p_{1} (1 - p_{1})})}^{2}}{{(p_{1} - p_{0})}^{2}}$ [1]

Where n₊ represents the number of samples.

By substituting the preset parameters into the calculation, the results indicate that the sample size for validation is approximately 118 samples.

To assess the model’s robustness and generalization capability, the dataset was divided into time periods to simulate real clinical scenarios: 199 patients diagnosed between 2020 and 2023 formed the training set, while 76 patients diagnosed in 2024 constituted the internal validation set. Finally, 44 patients from the hospital in 2025 served as the external validation set for the final model performance evaluation. The combined internal and external validation sets comprise 120 cases in total, exceeding the previously calculated sample size of 118 cases, indicating that the model evaluation is reasonable. Subsequently, the data underwent appropriate preprocessing, followed by training and testing validation of the classification task model. Figure 1 illustrates the workflow diagram of this study.

Figure 1 Flowchart used in this study.

Model establishment

Establishment of text classification model

In the text classification model, we employed pre-trained models, namely “RoBERTa-wwm-ext” (Robust optimize BERT approach whole word masking extended data) (14), which were trained on Chinese corpus. The learning rate (lr) was set to 0.001. Additionally, considering the limited number of training samples, we implemented dropout and early stopping mechanisms to prevent model overfitting, and the dropout rate was configured to 0.2. The adaptive moment estimation with weight decay (AdamW) optimizer was utilized, with a batch size of 32 and a hidden layer size of 128 in the fully connected layer. This model was executed in an environment featuring PyTorch version 2.3.1, Python version 3.10, and an NVIDIA GeForce RTX 3050 4GB laptop GPU, equipped with Compute Unified Device Architecture (CUDA) version 12.0.

Primarily, for individual characteristic information such as gender and age, we adopted one-hot encoding to extract features and acquired the corresponding vectors, succinctly preserving their categorical information. Secondly, with respect to long-form text information like current medical history and past medical history, we made use of the Requests library to perform regular expression-based operations. This allowed us to eliminate and purify the redundant information, thereby retaining solely the pivotal content. Subsequently, the remaining text was utilized to generate corresponding text vectors via RoBERTa-wwm-ext. Thereafter, the obtained one-hot encoded vectors were concatenated with the long-text vectors and then fed into the fully-connected layer. Ultimately, the probability distribution of benign and malignant tumors was output through the Softmax function, realizing the binary classification of benign and intermediate/malignant types based on the medical record report texts. The text classification model is illustrated in Figure 2.

Figure 2 Framework of text classification model. BERT, bidirectional encoder representations from transformers; CLS, classification token; ESP, end of sentence piece; RoBERTa-wwm-ext, Robust optimize BERT approach whole word masking extended data; Trm, transformer.

Establishment of image classification model

In the image classification model, the pre-trained weights of “YOLO v11-cls-m” (15) were utilized, which is a popular object detection and image segmentation model. The learning rate was set to 0.0001, and similar to text classification models, we implemented dropout as 0.2 and early stopping mechanisms to prevent model overfitting as much as possible. The optimizer used was AdamW, with a batch size of 8. The model was executed within the default environment (Python 3.10.14, torch 2.4.0) of the Kaggle server and used a Tesla T4 GPU (15095 MiB).

Initially, all preoperative X-ray images have been converted from DICOM format to PNG format. These images were uniformly preprocessed to a fixed input dimension of 640×640 pixels. The You Only Look Once (YOLO) framework automatically normalized pixel values to the range of 0 to 1 by dividing the pixel values by 255. It also adopted letterboxing to avoid feature distortion during resizing, which maintains the aspect ratio of images by padding gray pixels. Due to the limited amount of image data, we then applied data augmentation techniques to the training set samples. This approach significantly expanded the training set and included various transformations such as rotation, flipping, brightness adjustment, and contrast adjustment. These pre-processed images were fed into the Backbone of YOLOv11, which is located in the left region of Figure 3. The backbone consists of multiple convolutional layers, cross stage partial kernel-2 (C3k2) modules for feature fusion, a spatial pyramid pooling fast (SPPF) module with an enhanced pooling layer for aggregating multi-scale features of bone tumor lesions, and a cross stage partial with pyramid squeeze attention (C2PSA) attention module. This C2PSA attention module specifically enhances the model’s focus on diagnostically critical regions of bone tumors. As the core component of the model architecture, the backbone generates high-dimensional image feature maps through the sequential operation of these components, and then transmits the feature maps to the neck component, which is shown in the middle region of Figure 3. In the neck component, the input feature maps undergo upsampling and concatenation operations to generate multi-scale feature maps, which retain high-fidelity semantic information of both small lesions and large lesions in bone tumors. Finally, the head component, located in the right region, outputs the probability distribution of benign and intermediate or malignant bone tumor categories and presents the corresponding binary classification results. For clarity, the layout and functional logic of the entire image classification model are fully illustrated in Figure 3.

Figure 3 Framework of image classification model. C2PSA, cross stage partial with pyramid squeeze attention; C3k2, cross stage partial kernel-2; MR, magnetic resonance; SPPF, spatial pyramid pooling fast; YOLO, You Only Look Once.

Establishment of fusion model

In this section, we employed DS evidence theory for multimodal fusion. The DS evidence theory (16) is a method for handling uncertain reasoning. It can effectively model the uncertainty of evidence and provides a flexible reasoning framework. DS theory mainly includes the identification framework $Θ = {A_{1}, A_{2}, ..., A_{j}, ..., A_{n}}$ , where θ_j becomes an element or event of the identification framework $Θ$ . The basic probability assignment (BPA) usually can be denoted as $m (A)$ , which satisfies: (I) $m : 2^{Θ} \to [0, 1]$ ; (II) $\sum_{A \subseteq Θ} m (A) = 1$ ; (III) $m (\emptyset) = 0$ . The belief function $B e l (A) = \sum_{B \subseteq A} m (B)$ , represents the degree of certainty supported by the evidence, and the likelihood function $P l (A) = \sum_{B \cap A \neq \emptyset} m (B)$ indicates the degree of belief that does not refute. For different independent evidence sources, there are different basic probability assignment functions. The DS combination formula uses the orthogonal sum to combine different basic probability assignment functions into a new one. The formula is defined as follows:

$m (A) = \frac{1}{1 - k} \sum_{A_{1} \cap A_{2} \cap ... \cap A_{n} = A} m_{1} (A_{1}) m_{2} (A_{2}) ... m_{n} (A_{n})$ [2]

$k = \sum_{A_{1} \cap A_{2} \cap ... \cap A_{n} = \emptyset} m_{1} (A_{1}) m_{2} (A_{2}) ... m_{n} (A_{n})$ [3]

where k is the conflict coefficient, with k closer to 1 indicating a higher level of conflict between evidence sources, and closer to 0 indicating that the evidence sources agree with each other.

After obtaining the probability predictions of the malignancy and benignancy of the patient’s bone tumors from the text classification model and the image classification model respectively, this study used the DS theory to fuse the results of the two evidence sources. The two models were regarded as two evidence sources, and the frame of discernment consists of two mutually exclusive events: benign and intermediate/malignant. The probability distribution matrices obtained from the models are used to calculate the conflict coefficient $k (ε_{F})$ and the fused probability of malignancy $P_{F}^{M}$ and benignancy $P_{F}^{B}$ . Since the conflict coefficient represented the inconsistency between the two evidence sources and affected the credibility of the fused probability to a certain extent, we introduced the idea of three-way decision, set a threshold ε for the conflict value. For samples with a conflict value greater than ε, a deferred judgment was made, and for these samples, the DS theory based on Deng entropy (17) was adopted to reduce the conflict coefficient between the two evidence sources until k<ε. So far, the complete framework of the multimodal classification model proposed in this paper for distinguishing benign from moderate/malignant bone tumors is fully illustrated in Figure 4. The left region displays the probability outputs of the two-modal classification model. The top right region presents the evidence source inputs, while the middle region shows the conflict calculation evaluation and threshold determination. The bottom region presents the final fused probability output, which is used for the classification decision between benign and moderate/malignant bone tumors.

Figure 4 Flowchart of multimodal fusion classification model for bone tumors. C2PSA, cross stage partial with pyramid squeeze attention; C3k2, cross stage partial kernel-2; DR, digital radiography; SPPF, spatial pyramid pooling fast; YOLO, You Only Look Once.

Other baseline models

For the comparison, the baseline methods in this study are summarized as follows:

Mid-term fusion model: the mid-term fusion model involves decomposing modalities into feature vectors, performing vector fusion, and then training the classifier. The text vectors encoded by Bidirectional Encoder Representations from Transformers Base Chinese (BERT-Base-Chinese) and the image vectors encoded by YOLO are first mapped into the same space and then concatenated. Subsequently, a fully-connected layer is added to conduct the classification of the malignancy and benignancy of bone tumors.
ConVIRT (18): conVIRT is a method for learning joint representations from paired medical images and texts through contrastive learning. It employs a two-tower structure to encode images and texts separately. By means of the multimodal contrastive loss, it maximizes the similarity between images and texts while minimizing the similarity of mismatches. This method adopts self-supervised learning and does not rely on annotated data, thus enhancing the model’s ability to understand the semantic relationships between images and texts.
GLoRIA (19): this method proposes a novel framework to enhance the performance of medical image recognition by integrating global and local features. The GLoRIA framework combines global image features with local medical text descriptions through a contrastive learning mechanism and label-efficiency optimization. In this way, it can effectively learn more abundant image-text correlations, thereby improving the accuracy and robustness of tasks such as medical image classification and retrieval.
CLIP (20): CLIP is a multimodal learning model that maps images and texts into the same shared representation space through contrastive learning. It is trained with a vast number of image-text pairs, enabling the model to understand the semantic relationships between images and texts. As a result, CLIP demonstrates remarkable versatility and flexibility, making it applicable to a wide range of tasks in the field of multimodal data processing.
Random forest (21) + YOLO: random forest is an ensemble learning algorithm that aggregates classification results from multiple decision trees and outputs the final category through voting. First, discrete text fields are merged into unified text, and high-dimensional text features are extracted via a binary bag-of-words model to reduce model complexity and accommodate small samples. An ensemble model comprising 100 decision trees is constructed, with a maximum depth of 10 to prevent overfitting. This leverages ensemble learning’s robustness to capture nonlinear relationships between features and labels. The resulting text classification performance is then fused with image classification results using DS fusion.
Support vector machines (SVMs) (22) + YOLO: SVMs are a classification method based on the maximum margin hyperplane. Their linear kernel function adapts to high-dimensional sparse clinical text features. After processing and inputting the fused features into the SVM, text classification results are obtained. These results are then combined with image classification results through DS fusion.

Sensitivity analysis

To evaluate the sensitivity of the model to key hyperparameters, this study employed a single-factor sensitivity analysis method. This involved adjusting a single parameter independently while keeping all other parameters fixed, thereby exploring the impact of each parameter on model performance. The specific implementation process was as follows: First, the lr sensitivity analysis was conducted. For the image classification model, the lr was set to 10⁻³ and 10⁻⁵, while other parameters remained in their initial configurations. In contrast, for the text classification model, the lr was set to 10⁻⁴ and 10⁻⁵, with other parameters retained in their original settings. Next, the sensitivity analysis of the dropout parameter was performed. For the image classification model, the dropout rate was sequentially adjusted to 0.5. A similar parameter adjustment experiment was also conducted for the text classification model.

Model evaluation

For each prediction model, the receiver operating characteristic (ROC) curve was plotted and the area under the curve (AUC) along with its 95% confidence interval (CI) were calculated. We utilized the metrics of accuracy, precision, recall, specificity and F1 score to assess the performance of the model. Accuracy is defined as the ratio of the number of correctly classified samples in the dataset to the total number of samples. Precision reflects the model’s ability to distinguish positive samples in its predictions. When precision is higher, it indicates that among all samples predicted as positive by the model, a larger proportion truly belongs to the positive class. In other words, the model is more accurate in identifying positive samples. Recall reflects the model's ability to identify actual positive samples. A higher recall means that the model can identify as many of the actual positive samples in the dataset as possible. Specificity denotes the model’s capacity to accurately identify true negative samples. A higher specificity value indicates that the model can effectively exclude the majority of true negative cases, thereby avoiding the misdiagnosis of benign tumors as malignant ones and alleviating the unnecessary psychological distress and economic burden on patients with benign lesions. The F1 score is a combination of precision and recall. The higher the F1 score, the more stable the model is. All metrics were averages obtained from five runs of the model on the test set. In the sensitivity analysis, drawing the ROC curve and the precision-recall (PR) curve of the fusion model under different hyperparameters can intuitively present the impact of parameter changes on the model performance.

Statistical analysis

Statistical analyses were conducted using Python software (version 3.9). For continuous data that conformed to a normal distribution, descriptive statistics were presented as the mean ± standard deviation, and inter-group comparisons were performed using the Z-test. In contrast, continuous data that did not follow a normal distribution were expressed as the median [interquartile range (IQR)] and analyzed using the Mann-Whitney U test for between-group comparisons. Categorical data were reported as frequencies (percentages). For categorical variables, the chi-square test was utilized to compare proportions. All tests were two-sided. A value of P<0.05 was considered statistically significant.

Results

Patient clinical characteristics

This study included a total of 319 lesions from 319 patients with bone tumors. Among them, the benign group consisted of 188 patients (94 males and 94 females) with a median age of 32.8 (IQR, 17.0–48.0) years; the intermediate/malignant group consisted of 131patients (84 males and 47 females) with a median age of 42.2 (IQR, 23.0–60.0) years. There were significant differences in age and gender ratio between the two groups (P<0.001 and 0.02, respectively). Table 1 presents more details on clinical characteristics.

Table 1

Patient clinical characteristics in each group

Characteristics	Benign group	Intermediate/malignant group	U/χ²	P value
Age (years)	32.8 (17.0–48.0)	42.2 (23.0–60.0)	9,161.5	<0.001
Sex			5.7	0.02
Male	94 (50.0)	84 (64.1)
Female	94 (50.0)	47 (35.9)
Pain			30.4	<0.001
Yes	118 (62.8)	119 (90.8)
No	70 (37.2)	12 (9.2)
Swelling			16.8	<0.001
Yse	55 (29.3)	69 (52.7)
No	133 (70.7)	62 (47.3)
Redness and hyperemia			27.4	<0.001
Yes	10 (5.3)	35 (26.7)
No	178 (94.7)	96 (73.3)
Warmth			16.7	<0.001
Yes	5 (2.7)	21 (16.0)
No	183 (97.3)	110 (84.0)
Dyskinesia			0.0	>0.99
Yes	35 (18.6)	24 (18.3)
No	153 (81.4)	107 (81.7)
Palpable mass			106.7	<0.001
Yes	18 (9.6)	86 (65.6)
No	170 (90.4)	45 (34.4)
Sites			9.0	0.01
Long tubular bone	108 (57.5)	71 (54.2)
Short tubular bone	42 (22.3)	46 (35.1)
Irregular bone	38 (20.2)	14 (10.7)

Data are presented as median (interquartile range) or n (%).

Model performance on the internal validation set and external validation set

The evaluation metrics for each classification model are detailed in Table 2. The text classification model achieved an average accuracy of 77.6%, precision of 68.9%, recall of 91.2%, specificity of 66.7%, F1 score of 78.5%, and an AUC of 0.814. The image classification model attained an average accuracy of 72.4%, precision of 74.1%, recall of 58.8%, specificity of 83.3%, F1 score of 65.57%, and an AUC of 0.782. Following the subsequent integration of DS theory, the accuracy of the classification model increased to 81.6%, with a precision of 81.3%, recall of 76.5 %, specificity of 85.7%, F1 score of 78.8%, and an AUC of 0.821.

Table 2

Performance of various classification models on the binary classification task of benign and intermediate/malignant bone tumors on internal validation set

Parameters	Text classification model	Image classification model	Fusion classification model
AUC (95% CI)	0.814 (0.722–0.907)	0.782 (0.662–0.882)	0.821 (0.713–0.916)^†
Accuracy, %	77.6	72.4	81.6^†
Precision, %	68.9	74.1	81.3^†
Recall, %	91.2^†	58.8	76.5
Specificity, %	66.7	83.3	85.7^†
F1 score, %	78.5	65.6	78.8^†

^†, the best indicator of each line. AUC, area under the curve; CI, confidence interval.

Besides, Figure 5 illustrates the ROC curves and the 95% CIs of the three models in the internal validation set. It also shows respectively the confusion matrices in the test of the text classification model, the image classification model and the fusion model.

Figure 5 ROC curves and confusion matrices of three classification models. (A) The ROC curves of the three models on the internal validation set; (B) the confusion matrix of the text classification model on the internal validation set; (C) the confusion matrix of the image classification model on the internal validation set; (D) the confusion matrix of the fusion model on the internal validation set. AUC, area under the curve; CI, confidence interval; ROC, receiver operating characteristic.

To further validate the generalization capability of the proposed model, we employed an external validation set comprising data from 44 patients collected between January and December 2025 at collaborating hospitals. To further validate the generalization capability of the proposed model, we employed an external validation set comprising data from 44 patients collected between January and December 2025 at collaborating hospitals. The results (detailed in Table 3) show that despite a moderate decline in absolute metrics compared to the internal validation set, the fusion model maintained robust diagnostic performance, achieving an AUC of 0.808 (95% CI: 0.667–0.928), accuracy of 77.3%, precision of 75%, recall of 66.7%, specificity of 84.6% and an F1 score of 70.6%. In contrast, the single-modal text and image models exhibited more pronounced performance declines in certain metrics on the external dataset. For instance, the text model’s recall decreased by 13.4%, while the image model’s specificity dropped by 10.2%.

Table 3

Performance of various classification models on the binary classification task of benign and intermediate/malignant bone tumors on external validation set

Models	Text classification model	Image classification model	Fusion classification model
AUC (95% CI)	0.784 (0.631–0.911)	0.731 (0.556–0.885)	0.808 (0.667, 0.928)^†
Accuracy, %	75.0	68.2	77.3^†
Precision, %	66.7	61.1	75.0^†
Recall, %	77.8^†	61.1	66.7
Specificity, %	73.1	73.1	84.6^†
F1 score, %	71.8^†	61.1	70.6

^†, the best indicator of each line. AUC, area under the curve; CI, confidence interval.

Besides, Figure 6 illustrates the ROC curves and the 95% CIs of the three models in the external validation set. It also shows respectively the confusion matrices in the test of the text classification model, the image classification model and the fusion model.

Figure 6 ROC curves and confusion matrices of three classification models. (A) The ROC curves of the three models in the external validation set; (B) the confusion matrix of the text classification model in the external validation set; (C) the confusion matrix of the image classification model in the external validation set; (D) the confusion matrix of the fusion model in the external validation set. AUC, area under the curve; CI, confidence interval; ROC, receiver operating characteristic.

Performance comparison analysis of our model with the baseline models

Table 4 and Figure 7 provide a detailed comparison results of the performance between our fusion method and other seven baseline models described in section “Other baseline models”. The accuracy of the baseline models ranged from 60.5% to 81.6%, precision from 56.3% to 83.3%, recall from 52.9% to 73.5%, and F1 score from 54.6% to 78.1%. All these metrics were lower than those of our proposed fusion method.

Table 4

Performance comparison of our model with the baseline models on internal validation set

Models	Accuracy, %	Precision, %	Recall, %	F1 score, %
Mid-term fusion model	69.7	65.7	67.6	67.7
ConVIRT	60.5	56.3	52.9	54.6
GLoRIA	72.4	70.9	64.7	67.7
CLIP	73.7	79.2	55.9	65.5
Random forest + YOLO	72.4	76.0	55.9	64.4
SVM + YOLO	72.4	78.3	52.9	63.2
Our fusion model	81.6^†	81.3^†	76.5^†	78.8^†

^†, the best indicator in each column. SVM, support vector machine; YOLO, You Only Look Once.

Figure 7 Performance of our model and other baseline models on internal validation set. SVM, support vector machine; YOLO, You Only Look Once.

The performance comparison based on the internal validation set has preliminarily demonstrated the advantages of our proposed multimodal fusion model over traditional baseline models, particularly in terms of classification accuracy and F1 score. To further evaluate the model’s adaptability to different data distributions and verify whether it learns universal features for distinguishing benign from malignant bone tumors rather than specific noise, this study conducted additional performance assessments on an external validation set to comprehensively measure the model’s generalization capability. The specific results are shown in Table 5 and Figure 8.

Table 5

Performance comparison of our model with the baseline models on external validation set

Models	Accuracy, %	Precision, %	Recall, %	F1 score, %
Mid-term fusion model	70.5	60.9	77.8^†	68.3
ConVIRT	59.1	50.0	44.4	47.1
GLoRIA	63.6	56.3	50.0	52.9
CLIP	61.4	52.6	55.6	54.1
Random forest + YOLO	70.5	66.7	55.6	60.6
SVM + YOLO	77.3^†	75.0^†	66.7	70.6^†
Our fusion model	77.3^†	75.0^†	66.7	70.6^†

^†, the best indicator in each column. SVM, support vector machine; YOLO, You Only Look Once.

Figure 8 Performance of our model and other baseline models on external validation set. SVM, support vector machine; YOLO, You Only Look Once.

The results of sensitivity analysis

Figure 9 illustrates the ROC curves and PR curves of the fusion model on the internal set, respectively, under varying hyperparameter settings for text and image models. When the lr of the text classification model was set to 10⁻⁵, the AUC of the fusion model reached 0.822, surpassing the original parameter model’s AUC of 0.821. However, the average precision (AP) value was 0.774, which was slightly lower than the original model’s AP of 0.775. Decreasing the lr to 10⁻⁴ resulted in a decrease in the AUC value to 0.811 and a drop in the AP value to 0.765. Additionally, when the dropout parameter was adjusted to 0.5, both the AUC and AP values of the fusion model were significantly lower than those under the original parameter settings, with AUC at 0.704 and AP also at 0.666. When the lr of the image classification model was set to 10⁻³, the AUC of the fusion model dropped to 0.786 and the AP value was 0.747. When the lr was set to 10⁻⁵, the AUC was 0.77 and the AP value was 0.755. As the dropout rate of the image classification model increased from 0.2 to 0.5, its AUC value decreased significantly to 0.721, while the AP value dropped to 0.699. In general, although the fusion model is more sensitive to the optimizer in the image classification model, the influence of other hyperparameters is within the acceptable range and has a certain robustness.

Figure 9 ROC curves and PR curves for image parameters sensitivity analysis of fusion model on internal validation set. (A) The ROC curves of the fusion model under different text hyperparameter settings; (B) the PR curves of the fusion model under different text hyperparameter settings; (C) the ROC curves of the fusion model under different image hyperparameter settings; (D) the PR curves of the fusion model under different image hyperparameter settings. AP, average precision; AUC, area under the curve; PR, precision-recall; ROC, receiver operating characteristic.

We simultaneously investigated the impact of parameter sensitivity on the external validation set. Figure 10 shows the ROC curves and PR curves of the fusion model under the same text and image model hyperparameters as previously set, as well as on external datasets. When the lr of the text classification model was set to 10⁻⁴, the AUC value of the integrated model reached 0.784, lower than the AUC value of 0.808 for the original parameter model. Likewise, the AP value was 0.74, which was slightly lower than the original model’s AP of 0.774. Decreasing the lr to 10⁻⁵ resulted in a decrease in the AUC value to 0.761 and a drop in the AP value to 0.675. Additionally, when the dropout parameter was adjusted to 0.5, both the AUC and AP values of the fusion model were also lower than those under the original parameter settings, with AUC at 0.793 and AP also at 0.737. When the lr of the image classification model was set to 10⁻³, The AUC value of the fusion model decreased to 0.806, which is very close to the AUC value of the original model. Meanwhile, the AP value was 0.683, showing a significant difference from the original model. When the lr was set to 10⁻⁵, the AUC was 0.788 and the AP value was 0.718. As the dropout rate of the image classification model increased from 0.2 to 0.5, its AUC value decreased to 0.767, while the AP value dropped to 0.733. Compared to the internal validation set, the ensemble model exhibits greater sensitivity to the optimizer in text classification tasks, though the impact of other hyperparameters remains within acceptable bounds. The range of metric variations observed across both internal and external validation sets indicates a degree of robustness in the ensemble model.

Figure 10 ROC curves and PR curves for image parameters sensitivity analysis of fusion model. (A) The ROC curves of the fusion model under different text hyperparameter settings; (B) the PR curves of the fusion model under different text hyperparameter settings; (C) the ROC curves of the fusion model under different image hyperparameter settings; (D) the PR curves of the fusion model under different image hyperparameter settings. AP, average precision; AUC, area under the curve; PR, precision-recall; ROC, receiver operating characteristic.

Performance comparison analysis of our model with radiologists

In the field of medical imaging diagnosis, systematically comparing auxiliary diagnostic models with clinicians of varying experience levels is crucial for demonstrating their clinical value. We invited two doctors for comparison—one junior and one senior, to distinguish benign from malignant lesions in both internal validation and external validation datasets without prior exposure to the diagnoses. The comparison results are shown in Figures 11,12.

Figure 11 Performance comparison analysis of our model with radiologists on internal validation set. (A) The differences between the model and physicians in key classification metrics; (B) Kappa coefficient for diagnosis between senior doctor and junior doctor; (C) the McNemar test forest diagram illustrating the diagnostic performance differences between junior physicians and the model, respectively, compared to senior physicians; (D-F) diagnostic confusion matrices for junior physicians, our model, and senior physicians respectively.

Figure 12 Performance comparison analysis of our model with radiologists on external validation set. (A) The differences between our model and physicians in key classification metrics; (B) Kappa coefficient for diagnosis between senior doctor and junior doctor; (C) the McNemar test forest diagram illustrating the diagnostic performance differences between junior physicians and our model, respectively, compared to senior physicians; (D-F) diagnostic confusion matrices for junior doctor, our model, and senior doctor respectively. *, P<0.05; ns, P>0.05.

The bar chart clearly shows that the research model’s core metrics—including accuracy and sensitivity—approach those of senior physicians while outperforming junior physicians. Meanwhile, the Kappa coefficient in Figure 11B indicates that diagnostic agreement between junior and senior physicians is 0.4726, falling within the moderate range. This reflects the impact of varying clinical experience on diagnostic outcomes. The McNemar test forest diagram in Figure 11C further supports the aforementioned conclusion: there is no significant difference in diagnostic performance between the model and senior physicians, while the performance gap between junior and senior physicians is statistically significant. This demonstrates that our model can assist junior physicians in diagnosing benign versus malignant bone tumors to a certain extent.

On external validation datasets, the model’s classification metrics have declined, yet it still outperforms the performance of the junior doctor. This demonstrates that our model possesses certain clinical value.

Case study

While logistic regression and weighted averaging are widely adopted for model output fusion due to their simplicity and interpretability, these methods have inherent limitations in handling the uncertainty and heterogeneous conflict of medical multi-modal data, which is the core rationale for our selection of DS evidence theory. Specifically, logistic regression relies on the linear fitting of model outputs and assumes that text and image features follow a fixed statistical distribution, which is difficult to satisfy in small-sample clinical datasets. For instance, when the text model outputs a high malignancy probability but the image model suggests benign features, logistic regression would force a compromise via coefficient weighting, ignoring the essential conflict between modalities and leading to ambiguous decision boundaries.

In contrast, DS theory introduces the concepts of BPA and conflict coefficients, enabling quantitative description of inconsistencies between textual and image models. For clinical scenarios involving high cross-modal conflict, DS theory does not simply average or linearly combine outputs. Instead, it redistributes the conflict quality function to credible propositions based on each modality’s reliability. To investigate DS theory’s role in reconciling high cross-modal conflicts, this study selected a high-conflict case for qualitative analysis.

We utilize the SHapley Additive exPlanations (SHAP) tool to visualize the reasoning behind text classification models. First, it receives patient-provided text such as chief complaints, current medical history, and past medical history, then returns the model’s predicted probabilities. Next, shap. explainer combines this prediction function with a tokenizer to create an explainer that understands the token segmentation logic of the text. Subsequently, it calculates token-level SHAP values for the sample text, quantifying each token’s impact on the model’s prediction. Finally, we generate a visualization displaying the influence weight of each token. Due to the difficulty in visualizing YOLO classification models, we leveraged YOLO’s detection framework and employed partially annotated data for detection training. The digital radiography (DR) images in this case were segmented to highlight lesion areas, facilitating visual presentation. Note that the prediction probabilities of this detection model do not fully align with those of the classification model.

Figure 13 presents a case of a patient with a bone tumor in the upper left humerus. The text modality classifier classified it as malignant with a probability of 94.29%, while the image classifier classified it as benign with a probability of 69.07%. Convert diagnostic classification probabilities into evidence PBA, therefore we get:

$\begin{array}{l} m_{1} (M) = 0.95, m_{1} (B) = 0.05, \\ m_{2} (B) = 0.69, m_{2} (M) = 0.31 \end{array}$ [4]

Figure 13 Case study of integrating modal evidence using DS theory. DS, Dempster-Shafer.

Using DS rules, the evidence conflict degree between the two modalities was calculated as $K = m_{1} (M) \cdot m_{2} (B) + m_{1} (B) \cdot m_{2} (M)$ . Substituting into the BPA yields the conflict coefficient K=0.671, exceeding 0.5 and thus indicating high-conflict fusion.

In the text, the red-highlighted segments provide evidence supporting malignancy, including ‘defect of the lateral cortex in the middle segment of the left humerus’, and ‘history of chronic myeloid leukemia’. These constitute the core basis for the physician’s malignant diagnosis. Blue text fragments provide evidence against malignancy, such as ‘no local soft tissue swelling’ which has low relevance to bone tumor malignancy. While ‘marked local tenderness’ aids in diagnosing malignant bone tumors, the text model treats it as neutral evidence. Based on the strong association of red fragments supporting malignancy, combined with the limited influence of blue fragments weakly opposing it, the text model ultimately outputs a malignancy probability of 95%. Using the YOLO detection model, we can roughly infer that the image classification model’s assessment of lesion locations is relatively accurate. However, atypical radiographic presentations may lead to erroneous results, with a 69% probability of misclassifying the lesion as benign.

Perform fusion using the normalization factor 1−K=1−0.671=0.329 and calculate the final support. The support for malignancy is calculated as follows:

$\begin{array}{l} m (M) = \frac{1}{1 - K} [m_{1} (M) \cdot m_{2} (M) + m_{1} (M) \cdot m_{2} (B) \times 0 + m_{1} (B) \cdot m_{2} (M) \times 0] \\ \approx 0.89 \end{array}$ [5]

Similarly, the support calculation for benign cases is as follows:

$m (B) = \frac{m_{1} (B) \cdot m_{2} (B)}{1 - K} \approx 0.11$ [6]

After fusion, $m (M) \approx 0.89 ＞ m (B) \approx 0.11$ , the final diagnosis was malignant, consistent with the actual pathological label. DS theory fully captures the strong malignant support information from the red text segments while attenuating the weak non-support information from the blue segments and counteracting misclassifications from the image modality. This approach embodies the core clinical rationale of the text modality while mitigating interference from single-modality errors. Achieving trustworthy fusion of diverse evidence sources.

Discussion

Although AI still has many limitations and faces huge challenges in the auxiliary diagnosis of bone tumors at present, there is no denying its great potential (23,24). A meta-analysis indicates that radiologists can enhance the accuracy of distinguishing between benign and malignant bone tumors with the support of deep learning algorithms, achieving results comparable to preoperative biopsy (25). Currently, deep learning models based on radiomics are widely used in research related to the auxiliary diagnosis of bone tumors (12,26). Among them, convolutional neural network models are the most extensively applied (27). However, most of the studies have primarily focused on unimodal data, leading to potential risks of misdiagnosis and missed diagnoses due to incomplete information. In contrast, multimodal data can offer a comprehensive understanding of disease characteristics, thereby facilitating advancements in precision oncology (28). For example, Hinterwimmer et al. (29) proposed a multimodal deep learning model that integrates the NesT image classification model with a multi-layer perceptron. This approach, which combined clinical metadata with X-ray imaging data, had significantly improved the classification accuracy of primary bone tumors. Nonetheless, there are still limited studies that merge clinical text data with imaging data to develop multimodal fusion models. Therefore, we employed deep learning methods and evidence theory to develop a fusion model that integrates clinical text and X-ray images. This innovative approach offers a novel perspective for differentiating between benign and malignant bone tumors.

The multimodal data of this research were the admission records and DR images of patients. It was noteworthy that only a minor portion of the textual data was pertinent to the DR images, resulting in limited interaction between the modalities. Consequently, early and middle fusion methods were deemed unsuitable for this study. To address this issue, we developed a post-fusion model based on DS theory. The results indicated that the AUC of the fusion model on the internal validation set reached 0.821, which was higher than the text classification model and the image classification model. The accuracy of the fusion model was 81.6%, with a precision of 81.3%, a specificity of 85.7% and an F1 score of 78.5%. In terms of accuracy, the fusion classification model demonstrated an improvement of 4.0% compared to the text classification model and an enhancement of 9.2% relative to the image classification model. Regarding precision, the fused classification model showed a 12.4% increase compared to the text classification model and a 7.2% improvement compared to the image classification model. In terms of specificity comparison, the fusion model achieved a 19% improvement over text models and a 2.4% improvement over image models. For the F1 score, the fused model exhibited a 0.2% enhancement over the text model and a 13.2% increase compared to the image model. Overall, the significant improvement of the fusion model could be attributed to the fusion method based on DS theory. During the fusion process, it became evident that when the classification results from the text evidence source conflicted with those from the image evidence source, and the trust degrees of the two sources were equal, DS theory tended to favor the classification with higher judgment confidence. For instance, if the text classification model was 90% confident that the case was a benign bone tumor, while the image classification model was 85% confident that it was a malignant bone tumor, DS theory would conclude that the case was benign by calculating the probabilities of benign and malignant outcomes after fusion. We set the trust degrees of the two evidence sources to be equal because we believed that images and texts assessed cases from different informational perspectives, despite a weak interactivity between them. In addition, in the above situation, the conflict coefficient was 0.78, which was relatively close to 1, belonging to a strong conflict relationship between the two evidence sources. Considering this conflict, we established a threshold of 0.5, meaning that only when the two models produced different judgments, we would employ Deng entropy to adjust the original probability matrix and reduce the degree of conflict.

In order to explore the interplay between text information and image information, we incorporated a mid-term fusion method utilizing YOLO and BERT-Base-Chinese in the baseline model experiment section. By mapping the encoded text vectors and image vectors to the same space and subsequently concatenating them, we introduced a fully connected layer for the classification of benign and intermediate/malignant cases, thereby investigating the mid-term fusion effect. The results indicated that the performance of the mid-term fusion model was moderate, specifically, the accuracy and precision of the CLIP model were notably high among four deep learning models, achieving 73.7% and 79.2%, respectively. However, its recall performance was inferior compared to other deep learning baseline models. The GLoRIA model performs comparably to the CLIP model when the accuracy metric was excluded. Conversely, the ConVIRT model exhibited relatively low performance across all indicators. The root cause may lie in ConVIRT’s design logic, which heavily relies on strong semantic alignment between image-text pairs. However, the low matching quality between the text and images used in this study directly resulted in poor learning performance for ConVIRT, and it results in poorer classification outcomes than those of other baseline models. Our fusion model outperforms other baseline models on most metrics, particularly achieving outstanding accuracy and F1 scores that are significantly higher than those of baseline models.

In the field of bone tumor imaging diagnosis, previous literature indicates that studies involving multimodal fusion often utilize structured clinical features as inputs for models. For instance, Liu et al. (30) proposed a fusion model that combines deep learning and machine learning techniques to classify benign, malignant, and intermediate bone tumors based on patients’ clinical features and X-ray images of lesions. The findings demonstrated that this fusion model achieved performance levels comparable to those of experienced radiologists. Similarly, Ye et al. (13) developed an integrated multi-task deep learning framework that leverages multi-parameter MRI for the automatic detection, segmentation, and classification of primary bone tumors and bone infections. Their results revealed that this framework performs well in multi-classification tasks. In contrast to structured features, clinical texts offer a more comprehensive representation of patient information. By employing natural language processing models, the contextual relationships within the text can be captured, enabling the extraction of more complex semantic features, which may enhance model performance and broaden its application scenarios. Additionally, several studies have explored multimodal fusion using text data alongside imaging data, yielding promising results. For example, Yang et al. (31) introduced a multi-modal deep learning network that combines chest X-ray images with clinical history texts, thereby improving the multi-label disease diagnosis capability of chest X-ray films. Collectively, these findings suggest that multimodal fusion can offer innovative research avenues for AI-assisted disease diagnosis in clinical practice and has undoubtedly become a current research hotspot.

In addition, we also conducted a sensitivity analysis of the model to explore the impact of the hyperparameters. The results showed that the performance of the fusion model was relatively stable under different text hyperparameters. Specifically, except when dropout was 0.5, the AP value decreased significantly to 0.666. The changes of other parameters had little impact on the performance of the fusion model, and the change range of AUC and AP values compared with the original model was within 0.1. The performance of the fusion model varied greatly under different image hyperparameters. When the lr was 10⁻³, the AUC decreased to 0.786 and the AP value dropped to 0.747. When the lr was 10⁻⁵, the AUC and AP decreased further to 0.77 and 0.755, respectively. When the dropout increased, both the AUC and the AP value had a significant decrease. These findings indicated that learning rate, dropout in the image classification model significantly impacted the performance of the fusion model. Proper parameter settings were essential for enhancing model convergence and optimization.

There are some limitations in this study. First, this is a retrospective study with a limited number of patients, which may introduce more selection bias and random effects. Second, although the simulation replicated real-world scenarios and employed a time-based division method to generate internal and external validation sets, the data remained limited to a single center, which to some extent affects the generalizability of the results. Third, from a clinical data completeness perspective, there are deficiencies in the information covered by the study. On the one hand, the clinical text data only included admission records, and the imaging data only involved DR images. There are certain limitations with the information that they provided, making the research results mainly applicable to the preliminary diagnosis scene of bone tumors and leaving a gap from the diversified diagnostic needs in clinical practice. On the other hand, structured biomarkers such as alkaline phosphatase (ALP) and lactate dehydrogenase (LDH), which are clinically recognized as critical auxiliary indicators for bone tumor diagnosis, were not included in the model. This is because a portion of patients in this study did not undergo surgical treatment at Sichuan Orthopedics Hospital; their pathological results were confirmed through long term follow up, and relevant laboratory examinations (including ALP and LDH detection) were not performed during admission, leading to incomplete biomarker data that could not be reliably integrated into model training. Finally, considering the research focus, the research did not include the data of patients with developmental variations, osteomyelitis and similar conditions. In clinical practice, such cases are relatively common and have important value for the differential diagnosis of bone tumors. Therefore, the exclusion of this data may impose limitations on the broader applicability of the research findings in clinical settings.

Conclusions

In summary, this research built a multimodal fusion model based on deep learning and DS theory. This model organically integrated the clinical text and imaging data of bone tumor patients. The research results showed that in the task of differentiating between benign and malignant bone tumors, this fusion model performed better than unimodal data models and other baseline fusion models, reflecting great application prospects in the preliminary differentiation of bone tumor benignity and malignancy. At the same time, this model has also opened up new ideas for multimodal fusion technology to assist doctors in making clinical decisions for other diseases. In the future, we will actively carry out large-sample and multi-center research work to further enrich the diversity of data sets and be committed to building models that can be widely applied to various clinical practice scenarios, laying a foundation for providing more powerful and accurate support tools for clinical medicine.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1832/rc

Data Sharing Statement: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1832/dss

Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1832/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1832/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by institutional review board of Sichuan Orthopedics Hospital (No. KY2025-031-01) and individual consent for this analysis was waived due to the retrospective nature.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Xu Y, Shi F, Zhang Y, et al. Twenty-year outcome of prevalence, incidence, mortality and survival rate in patients with malignant bone tumors. Int J Cancer 2024;154:226-40. [Crossref] [PubMed]
Choi JH, Ro JY. The 2020 WHO Classification of Tumors of Bone: An Updated Review. Adv Anat Pathol 2021;28:119-38. [Crossref] [PubMed]
Deckers C, Rooy JWJ, Flucke U, et al. Midterm MRI Follow-Up of Untreated Enchondroma and Atypical Cartilaginous Tumors in the Long Bones. Cancers (Basel) 2021;13:4093. [Crossref] [PubMed]
Gao C, Qiu ZY, Hou JW, et al. Clinical observation of mineralized collagen bone grafting after curettage of benign bone tumors. Regen Biomater 2020;7:567-75. [Crossref] [PubMed]
Wang Y, Wang C, Xia M, et al. Engineering small-molecule and protein drugs for targeting bone tumors. Mol Ther 2024;32:1219-37. [Crossref] [PubMed]
Xu Y, Shi F, Zhang Y, et al. Twenty-year outcome of prevalence, incidence, mortality and survival rate in patients with malignant bone tumors. Int J Cancer 2024;154:226-40. [Crossref] [PubMed]
Hwang S, Hameed M, Kransdorf M. The 2020 World Health Organization classification of bone tumors: what radiologists should know. Skeletal Radiol 2023;52:329-48. [Crossref] [PubMed]
von Schacky CE, Wilhelm NJ, Schäfer VS, et al. Multitask Deep Learning for Segmentation and Classification of Primary Bone Tumors on Radiographs. Radiology 2021;301:398-406. [Crossref] [PubMed]
Hosny A, Parmar C, Quackenbush J, et al. Artificial intelligence in radiology. Nat Rev Cancer 2018;18:500-10. [Crossref] [PubMed]
Law M, Seah J, Shih G. Artificial intelligence and medical imaging: applications, challenges and solutions. Med J Aust 2021;214:450-452.e1. [Crossref] [PubMed]
Lipkova J, Chen RJ, Chen B, et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 2022;40:1095-110. [Crossref] [PubMed]
Wang H, He Y, Wan L, et al. Deep learning models in classifying primary bone tumors and bone infections based on radiographs. NPJ Precis Oncol 2025;9:72. [Crossref] [PubMed]
Ye Q, Yang H, Lin B, et al. Automatic detection, segmentation, and classification of primary bone tumors and bone infections using an ensemble multi-task deep learning framework on multi-parametric MRIs: a multi-center study. Eur Radiol 2024;34:4287-99. [Crossref] [PubMed]
Cui YM, Che WX, Liu T, et al. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2021;29:3504-14.
Jocher G, Qiu J. Ultralytics YOLOll. [Version:11.0.0]. Ultralytics, 2024. Accessed 12, 2024. Available online: https://github.com/ultralytics/ultralytics
Shafer G. A mathematical theory of evidence. Princeton: Princeton University Press; 1976.
Deng Y. Deng entropy. Chaos, Solitons & Fractals 2016;91:549-53.
Zhang YH, Jiang H, Miura Y, et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text. Proceedings of the 7th Machine Learning for Healthcare Conference 2022;182:2-25.
Huang SC, Shen LY, Lungren MP, Yeung S. GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:3942-3951.
Radford A, Kim JW, Hallacy C, et al. Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning 2021;139:8748-63.
Breiman L. Random forests. Mach Learn 2001;45:5-32.
Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20:273-97.
Meng Y, Yang Y, Hu M, et al. Artificial intelligence-based radiomics in bone tumors: Technical advances and clinical application. Semin Cancer Biol 2023;95:75-87. [Crossref] [PubMed]
Lacroix M, Aouad T, Feydy J, et al. Artificial intelligence in musculoskeletal oncology imaging: A critical review of current applications. Diagn Interv Imaging 2023;104:18-23. [Crossref] [PubMed]
Zhao K, Zhu X, Zhang M, et al. Radiologists with assistance of deep learning can achieve overall accuracy of benign-malignant differentiation of musculoskeletal tumors comparable with that of pre-surgical biopsies in the literature. Int J Comput Assist Radiol Surg 2023;18:1451-8. [Crossref] [PubMed]
Gitto S, Annovazzi A, Nulle K, et al. X-rays radiomics-based machine learning classification of atypical cartilaginous tumour and high-grade chondrosarcoma of long bones. EBioMedicine 2024;101:105018. [Crossref] [PubMed]
Li Y, Dong B, Yuan P. The diagnostic value of machine learning for the classification of malignant bone tumor: a systematic evaluation and meta-analysis. Front Oncol 2023;13:1207175. [Crossref] [PubMed]
Li Y, Pan LR, Peng YJ, Li XY, Wang X, Qu LM, et al. Application of deep learning-based multimodal fusion technology in cancer diagnosis: A survey. Eng Appl Artif Intell. 2025;143:109972.
Hinterwimmer F, Guenther M, Consalvo S, et al. Impact of metadata in multimodal classification of bone tumours. BMC Musculoskelet Disord 2024;25:822. [Crossref] [PubMed]
Liu R, Pan D, Xu Y, et al. A deep learning-machine learning fusion approach for the classification of benign, malignant, and intermediate bone tumors. Eur Radiol 2022;32:1371-83. [Crossref] [PubMed]
Yang L, Wan Y, Pan F. Enhancing Chest X-ray Diagnosis with a Multimodal Deep Learning Network by Integrating Clinical History to Refine Attention. J Imaging Inform Med 2025;38:3568-83. [Crossref] [PubMed]

Cite this article as: Zeng J, Chen Q, Zhang T, Liang D, Li D. A multimodal fusion model for bone tumor benign and malignant diagnosis: development and validation with clinical text and radiographs. Transl Cancer Res 2026;15(2):91. doi: 10.21037/tcr-2025-1832

A multimodal fusion model for bone tumor benign and malignant diagnosis: development and validation with clinical text and radiographs

Highlight box

Introduction

Methods

Study patients and data sets

Model establishment

Establishment of text classification model

Establishment of image classification model

Establishment of fusion model

Other baseline models

Sensitivity analysis

Model evaluation

Statistical analysis

Results

Patient clinical characteristics

Table 1

Model performance on the internal validation set and external validation set

Table 2

Table 3

Performance comparison analysis of our model with the baseline models

Table 4

Table 5

The results of sensitivity analysis

Performance comparison analysis of our model with radiologists

Case study

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share