Research on the error pattern recognition of dosimetric quality assurance by Bayesian optimization

Yewei Wang; Xueying Pang; Helong Wang; Yanling Bai

doi:10.21037/tcr-2025-337

Original Article

Research on the error pattern recognition of dosimetric quality assurance by Bayesian optimization

Yewei Wang^1#, Xueying Pang^2#, Helong Wang¹, Yanling Bai¹

¹Department of Radiation Physics, Harbin Medical University Cancer Hospital, Harbin, China; ²Department of Oncology, First Affiliated Hospital, Heilongjiang University of Chinese Medicine, Harbin, China

Contributions: (I) Conception and design: Y Wang; (II) Administrative support: Y Bai; (III) Provision of study materials or patients: Y Wang; (IV) Collection and assembly of data: H Wang; (V) Data analysis and interpretation: X Pang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Yanling Bai, PhD. Department of Radiation Physics, Harbin Medical University Cancer Hospital, Baojian Road 6, Nangang District, Harbin 150081, China. Email: baiyl2126@126.com.

Background: The clinical benefits of recognizing errors from dosimetric quality assurance (DQA) can be realized by improving the dose delivery accuracy. However, an efficient error detection method for data with multiple types of errors is still needed. This study sought to develop an algorithm for quantitatively analyzing multiple errors in DQA data by leveraging Bayesian optimization (BO) and statistical methods.

Methods: The analysis included 79 treatment plans, randomly divided into a training subset (comprising 60 plans) and a testing subset (comprising 19 plans), delivered using an Infinity linear accelerator (LINAC). The analysis examined errors stemming from bilateral multi-leaf collimator (MLC) leaf-banks, jaws, and collimator rotation. A Gaussian process (GP) model functioned as the surrogate for BO, which aimed to adjust the error matrix to minimize failure rates in the DQA. The algorithm’s performance was evaluated using simulated and real-world data. To evaluate the efficacy of the algorithm in detecting errors, error matrices of two magnitudes were introduced into the simulations: [−0.5 mm, 0.5 mm, 0.5 mm, 0.5 mm, −0.5 degrees], and [−1 mm, 1 mm, 1 mm, −1 mm, −1 degrees]. In the analysis of the real-world data, inherent systematic errors in the training subset were identified by statistically analyzing the coefficient of variation in the solution sets produced through BO, and corrections were subsequently applied to the original plans. The precision of the error identification was measured by comparing the adjustments to the failure rates for both the training and testing subsets.

Results: Systemic biases were identified, and the detected error matrices of [−0.46±0.466 mm, 0.47±0.477 mm, 0.23±1.589 mm, −0.01±1.786 mm, −0.54±0.408 degrees], and [−0.92±0.553 mm, 0.83±0.453 mm, 0.95±1.924 mm, −0.55±1.719 mm, −0.91±0.435 degrees] closely mirrored the expected magnitudes. The analysis of inherent errors revealed substantial improvements in the failure rates following correction, including reductions from 6.06%±4.783% to 1.78%±1.033% in the training subset and from 4.15%±2.643% to 2.02%±1.261% in the testing subset.

Conclusions: The error pattern recognition algorithm can quantitatively detect errors in data with multiple types of errors and analyze the inherent systematic errors in plans that have already passed gamma analysis. The method can enhance the overall performance of plan implementation on specific equipment. Additionally, the algorithm can analyze inherent systematic deviations in clinical DQA data and provide well-labeled datasets for deep-learning methods.

Keywords: Dosimetric quality assurance (DQA); gamma analysis; Bayesian optimization (BO); machine learning

Submitted Feb 12, 2025. Accepted for publication Mar 20, 2025. Published online Mar 27, 2025.

doi: 10.21037/tcr-2025-337

Highlight box

Key findings

• The proposed method can identify errors in data that have passed gamma analysis, and quantify errors in datasets with multiple coexisting errors.

What is known, and what is new?

• Errors may exist in dosimetric quality assurance (DQA) results that fall within acceptable tolerance thresholds. Currently, no effective method exists to detect such errors.

• This study developed an error detection method that combines Bayesian optimization and statistical approaches.

What is the implication, and what should change now?

• This method can detect errors in batch measurement data.

• It can label errors in DQA data, providing annotated datasets for deep-learning applications.

Introduction

Intensity-modulated radiotherapy (IMRT) plans can be used to generate numerous irregularly shaped segments with small aperture areas, and to monitor units to achieve desired target dose gradients and minimize dosages in organs at risk. Consequently, these plans can be susceptible to errors that cause deviations between the delivered and planned dose. To ensure the accuracy and reliability of delivery, pretreatment dosimetric quality assurance (DQA) is commonly used to assess whether the precision of a linear accelerator (LINAC) can satisfy the requirements of the complex plans (1).

The gamma analysis algorithm is the gold standard for assessing the agreement between measured and calculated values (2,3). The resulting gamma passing rate (GPR) has a binary value of pass/fail that indicates whether the discrepancy falls within the clinically acceptable dose and distance-to-agreement tolerance. When the GPR falls below the site-specific threshold, the plan is considered unsafe, as it may result in inaccurate delivery on the specific device, leading to potential treatment risks (4). The American Association of Physicists in Medicine (AAPM) Task Group report 218 noted that the main systematic errors contributing to DQA failure include uncertainties in the treatment planning system (TPS), such as multi-leaf collimator (MLC) leaf ends and tongue-and-groove effects modeling, and uncertainties in the delivery system, such as MLC leaf position errors and gantry rotational instability (5). It is common practice to delay treatment and modify the plan until the DQA requirements are met (6).

To minimize treatment delays caused by DQA failures, research is being conducted on GPR prediction methods. These methods can estimate the probability of plan delivery failure during the planning stage, and provide guidance for replanning in the clinic. Moreover, GPR prediction models can analyze the results of DQA to capture and correct errors in the LINAC to ensure the safe delivery of complex treatment plans, which can improve the overall performance of plans on a specific device. Depending on the machine-learning methods employed, GPR prediction models may be categorized as model- or deep learning-based methods.

Model-based methods extract plan parameters or intensity features designed with expert knowledge to explore the correlation between sources of error and DQA results (7-9). Model-based methods are interpretable, and their features have a self-consistent physical meaning. However, individual features are only sensitive to specific errors, and the combination of features reduces interpretability. Conversely, deep learning-based methods extract high-dimensional features from labeled datasets to predict GPR outcomes and analyze sources of error (10-17). Compared to model-based approaches, deep learning-based methods have superior prediction accuracy, but their interpretability requires further enhancement.

Even with the application of various training techniques such as data augmentation, transfer learning, and fine-tuning, deep learning-based methods require large datasets to prevent overfitting and enhance generalizability (18). However, unlike natural image datasets, the construction of sufficiently large datasets is constrained by patient privacy concerns and high acquisition costs, which poses challenges to the widespread application of deep learning in error detection. Despite rapid advancements in deep-learning algorithm architectures and increased model depth, the limited scale of datasets has created a primary bottleneck in application to DQA (19). To maximize the clinical utility of GPR predictive models, these models need to identify and correct the root causes of failed plans, thereby enhancing the overall performance of specific devices. However, due to the lack of methodologies for identifying errors in DQA outcomes, current datasets depend on the artificial introduction of potential errors into plans to create simulated data (20-22). This method creates challenges because, despite the presence of discrepancies between measured and theoretical doses, it employs clinical acceptance pass rate thresholds as baseline values in error scenario simulations, presuming that the validation results are without bias. The datasets are then constructed by overlaying the simulated errors onto these baseline outcomes. The premise is that discrepancies between measured and theoretical results can arise from multiple error sources, which might not be individually significant enough to be detected through routine quality control but can collectively cause dosimetry deviations. Due to the insensitivity of the GPR to delivery errors, there is currently no method for identifying such sources of error (23). As a result, models trained on such databases inherit biases, and are unable to analyze the inherent errors within the data, leading to biased GPR predictive models. Moreover, the introduction of simulated errors into measurements that already contain errors produces results that are significantly greater than the actual dose deviations caused by error magnitudes; thus, models trained on such data are less effective.

To address the lack of label data for GPR predictive models, this study proposed a novel method that combines Bayesian optimization (BO) with statistics to detect the quantitative magnitude of various errors under the coexistence of multiple error types. This approach can identify inconsistencies between the treatment equipment and plan system models from DQA outcomes, even in datasets of limited sample size and when the DQA results already meet the quality control threshold in clinical settings. It enables the retrospective analysis of existing plan measurement results and data annotation to train machine-learning models, thereby fostering the rapid development of artificial intelligence in quality control. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-337/rc).

Methods

Dataset

The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). This study was approved by the Ethics Committee of Harbin Medical University Cancer Hospital (No. KY2023-83, 2023-12-18). The image and radiotherapy data were anonymized in the preprocessing stage. Individual consent for this retrospective analysis was waived.

A total of 79 treatment plans delivered on an Infinity LINAC (Elekta AB, Stockholm, Sweden) between February 1 and 15, 2023, were selected for inclusion in the study. These plans were measured during the period between two calibrations of the LINAC parameters to ensure the consistency and reliability of the machine’s performance. The plans included 3 head-and-neck plans, 48 thorax plans, 3 abdominopelvic plans, 18 gynecological plans, and 7 bone/soft-tissue metastasis plans (Table 1). All the plans were designed using a 6 MV beam with Monaco (version 5.11, Elekta Inc.). Except for 10 thorax plans that were delivered using the Step&Shoot technique, all the other plans were delivered using volumetric-modulated arc therapy (VMAT). The prescription and optimization parameters for the plans, including the beam angles, number of segments, minimum segment width, plan optimization level, and dose rate smoothing level, were set according to specific site protocols for the respective diseases.

Table 1

Summary of the prescription and delivery technology used to obtain the dataset

Site	Prescription (cGy)	Total cases	Boost cases	Delivery
Head and neck	200	3	–	VMAT
Lung	200–700	41	3	IMRT/VMAT
Esophagus	180–200	7	1	VMAT
Gynecology	180–230	18	6	VMAT
Rectum	200	2	2	VMAT
Prostate	190	1	–	VMAT
Bone/soft-tissue metastasis	180–600	7	4	VMAT

VMAT, volumetric-modulated arc therapy; IMRT, intensity-modulated radiotherapy.

Dose measurements were conducted between a top 10 cm and a bottom 6 cm solid water phantom inserted into a PTW Seven29 chamber array (PTW, Freiburg, Germany); the chamber’s effective measurement point was aligned with the LINAC’s isocenter (ISO). During the measurement of all the plans, the beam angles were overridden to 0°. The detector readings, Digital Imaging and Communications in Medicine (DICOM) images, structure sets, and plan files were exported for further processing.

Error types and dose calculation

Discrepancies between measurements and calculations arise from systemic errors in the TPS and LINAC parameters, along with random errors during measurement. Systemic mechanical deviations of the LINAC are common errors encountered in clinical practice. Potential causes of these deviations include the mechanical fatigue of various connecting components, insufficient maintenance, and incorrect calibration protocols. In this study, we considered systemic positional errors of the bilateral MLC leaf-banks (X1, X2) and bilateral jaws (Y1, Y2), as well as collimator rotation errors.

The dose in the BO process was calculated by a research version of the dose calculation engine (Manteia Tech Engine, Manteia Tech Inc., Xiamen, China), which was based on graphics processing unit-optimized Monte Carlo software (24), and had a grid size of 2 mm and a simulation particle count of 1e9. The final deposited dose was smoothed using an adaptive anisotropic diffusion filter. The GPR in this study was calculated using an in-house script, with the criteria of 2%/2 mm and relative normalization.

The framework of the error pattern recognition algorithm based on BO

BO

We sought to find an optimal solution in relation to the error vector $β$ (constructed based on the errors mentioned in the section “Error types and dose calculation”) to improve the coincidence between the measured does (reference dose, $y$ ) and the calculated dose (target dose, $x$ ). The objective of optimization is to determine the parameter $β$ to minimize the failure rate (1 − GPR) of the DQA, and this optimization problem was expressed as:

$β \sim \arg \min (1 - f (x_{β}, y | β \in R^{d} : α \leq β \leq b))$ [1]

where $x$ refers to the calculation dose, $y$ refers to the measurement dose, $β$ refers to the error vector, $d$ refers to the dimensions of $β$ , and $a$ and $b$ refer to the definition domain of errors.

If the analytical expression of the objective function is explicit, the optimal solution can be solved by gradient descent or the Quais-Newton method. In the field of dose calculation, as the exact analytical expression is usually unknown, methods based on the partial derivative of independent variables cannot be used. Similarly, the exhaustion method cannot be used because the solution range exponentially increases with the number of variables. BO is a machine-learning optimization method based on Bayes statistics theory, and has gained popularity in deep-learning and signal processing due to its ability to optimize time-expensive derivative-free functions (25,26). The BO process consists of two main steps: (I) training a surrogate model for the objective function; and (II) using an acquisition function according to the surrogate model feedback to determine which data to sample next. It can be expressed as:

$p (β | s) = p (s | β) * p (β) / p (S)$ [2]

where $S = [(x_{1, β}, y_{1}), ..., (x_{n, β}, y_{n})]$ denotes a set of paired data $s$ , $p (S)$ is the distribution of a finite collection of data $S$ , $p (β)$ is the prior distribution of the surrogate model, $p (s | β)$ is the distribution of data $s$ on model $p (β)$ , and $p (β | s)$ is the posterior probability distribution of the updated model based on data .

The Gaussian process (GP) was adopted as the surrogate model due to its efficient solution for continuous objective functions with limited variables. When a finite collection of data and the corresponding function values (failure rate, 1-GPR) are combined into a vector $[f (s_{1}), ..., f (s_{n})]$ , GP takes the prior distribution to be multivariate normal, and the mean vector $μ_{0}$ and covariance matrix $k_{0}$ are constructed by evaluating the mean and covariance kernel function at each point.

$f (s_{1 : n}) \sim N o r m a l (μ_{0} (s_{1 : n}), k_{0} (s_{1 : n}, s_{1 : n}))$ [3]

Then, the conditional distribution of $f (s)$ on a new observed point $s$ can be computed using Bayes’ rule:

$f (s | f (s_{1 : n})) \sim N o r m a l (μ_{n} (s), σ_{n}^{2} (s))$ [4]

$μ_{n} (s) = k_{0} (s, s_{1 : n}) k_{0} {(s_{1 : n}, s_{1 : n})}^{- 1} (f (s_{1 : n}) - μ_{0} (s_{1 : n})) + μ_{0} (s)$ [5]

$σ_{n}^{2} (s) = k_{0} (s, s) - k_{0} (s, s_{1 : n}) k_{0} {(s_{1 : n}, s_{1 : n})}^{- 1} k_{0} (s_{1 : n}, s)$ [6]

The mean function with a constant value and the Matern kernel for covariance were adopted, and they can be expressed as:

$μ_{0} (s) = μ$ [7]

$k_{0} (s, s^{'}) = α_{0} \frac{2^{1 - υ}}{Γ (υ)} {(\sqrt{2 υ} ‖ s - s^{'} ‖)}^{2} K_{υ} (\sqrt{2 υ} ‖ s - s^{'} ‖)$ [8]

Where $K ν$ is the modified Bessel function and $ν$ , $α_{0}$ are the hyperparameters estimated by the maximum likelihood estimate during the optimization process.

The expected improvement (EI) method was used as the acquisition function in this study. The EI method seeks to maximize the difference between the function value of the new sample point $s'$ and the acquired optimal point $s$ . The improvement in the value is $f (s') - f (s)$ if the quantity is positive and 0 otherwise; and it can then be expressed as:

$E I (s) = \max (f (s') - f (s), 0)$ [9]

The optimization problem in Eq. [1] can be converted to a maximum $E I (s)$ . Then, the expectation of $E I (s)$ can be obtained according to the definition of GP regression:

$E I (s) = (f (s') - μ) Φ (\frac{f (s') - μ}{σ}) + σ ϕ (\frac{f (s') - μ}{σ})$ [10]

Where $Φ$ and $ϕ$ are the cumulative distribution function and probability density function of the standard Gaussian distribution, respectively.

In this study, the core objective of BO was to investigate mechanical positioning errors in MLC, jaws, and collimator rotation systems. Specifically addressing MLC leaf position optimization challenges, we identified that candidate solutions generated by the surrogate model could induce inter-leaf geometric spacing below safety thresholds, thereby posing mechanical collision risks that would render plans unsafe. To resolve this critical limitation, we developed a physics-constrained BO framework featuring an automated collision detection mechanism. This framework’s core innovation lies in its real-time exclusion protocol, which systematically eliminates collision-prone candidate points during the optimization process while maintaining search space integrity for error pattern recognition.

Systematic error analysis based on statistics

The solution set was constructed from the collection of solutions derived from individual plans using the BO algorithm. The solution value for individual plans may vary because the BO algorithm approximates the optimal solution within a constrained number of iterations, which could result in reaching a local optimum. These variations can also be attributed to the diverse responses of plans with varying complexities to systematic errors. We hypothesized that systematic errors lead to a convergence value in a set of solutions, while sets without systematic errors tend to show random distributions. To validate this hypothesis, we calculated the coefficient of variation for each error type derived from the BO algorithm across all solutions. The coefficient of variation assesses the degree of variation in relation to the mean value. A non-significant degree of variation (≤1) implies a consistent pattern or the presence of systematic errors in the solutions. In such instances, the mean value of the solution set is considered the representative value for that variable, signifying a systematic error. Conversely, a significant variation (>1) indicates the absence of systematic errors for that variable.

Training details

The BO algorithm was implemented using code derived from the open-source package OPENBOX (27). The surrogate model employs a GP, with the acquisition function configured as Expected Improvement with Constraints. The acquisition function optimizer utilizes a hybrid Random Scipy strategy, and the optimization process executes 300 iterations. The solution boundary for the MLC/jaw or collimator was set as [−3, 3] mm or degrees, respectively.

The final solution of BO may yield multiple values due to the piecewise nature of the GPR function. In that case, selecting a unique solution involves choosing the value that maximizes the structure similarity index measure between the optimized and original doses.

Experimental design

Performance evaluation of the error pattern recognition algorithm in simulated error scenarios

Simulated error scenarios were generated by modifying the values of the five parameters described in section “Error types and dose calculation” within the DICOM-RT files. To illustrate the detection capability of the error detection algorithm, two degrees of error scenarios were configured: [−0.5 mm, 0.5 mm, 0.5 mm, −0.5 mm, −0.5 degrees] and [−1 mm, 1 mm, 1 mm, −1 mm, −1 degrees]. The magnitude of these errors aligned with the action-level tolerances of the LINAC quality control and typical settings for error scenario simulations (28). Modifications excluded segments under two conditions: (I) the leaves occluded by the jaws remained unchanged; and (II) the leaf pairs closed at that control point were not adjusted.

The convergence speed of the algorithm was tested on the original dataset. The mean difference between the minimum objective function values at fixed intervals of 50 iterations and the optimal solution obtained after 300 iterations was calculated and plotted against the number of iterations.

Validation of systematic errors based on statistics in real-world data

Figure 1 presents a diagram of the systematic error analysis process. In this test, 60 plans were randomly selected from the dataset as a training subset, while the remaining 19 plans served as the testing subset. The solution set was compiled from individual solutions obtained by applying the BO algorithm to the training subset plans. The coefficient of variation for the solution set corresponding to each variable was then analyzed. An error matrix was established based on the degree of variation in the solution sets for each variable. This matrix was then introduced into the original plans to establish an optimized group of plans. To validate the generalization capability of the algorithm, the error matrix derived from the training subset was introduced into the plans of the testing subset to calculate the improvement in the failure rate.

Figure 1 Diagram of the systematic error analysis experiment. BO, Bayesian optimization.

Additionally, 18 common plan complexity indices were calculated to analyze the different responses of plans to systematic errors [see the supplementary materials for a complete list of the plan complexity indices and statistical results (Appendix 1; Table S1)].

Statistical analysis

In simulated error scenarios, solution values from both the original and error-simulated plans were analyzed using the Bland-Altman statistical method to identify systematic biases between the two groups, indicating the resolution capability of the detection. In real-world data, the differences in the failure rates between the optimized and the original plans in both the training and testing subsets were assessed using the Wilcoxon matched-pairs signed-rank test. The independent sample T-test or the Mann-Whitney U-test was used to evaluate the differences in various indices between the improved and non-optimized plans (termed the negative plans) according to the results of the Shapiro-Wilk test.

Results

Performance results of the error pattern recognition algorithm in simulated error scenarios

The performance of the error pattern recognition algorithm was evaluated across scenarios involving multiple error types by analyzing differences between the simulated and original plans. For the [−0.5 mm, 0.5 mm, 0.5 mm, −0.5 mm, −0.5 degrees] error scenario, the algorithm calculated the mean differences with standard deviations between the error-introduced plans and the original plans for MLC leaf-banks X1 and X2, jaws Y1 and Y2, and collimator rotation as −0.46±0.466 mm [95% limits of agreement (LoA): −1.38 to 0.45 mm], 0.47±0.477 mm (LoA: −0.46 to 1.41 mm), 0.23±1.589 mm (LoA: −2.88 to 3.35 mm), −0.01±1.786 mm (LoA: −3.51 to 3.49 mm), and −0.54±0.408 degrees (LoA: −1.34 to 0.26 degrees), respectively, as shown in Figure 2. For the [−1 mm, 1 mm, 1 mm, −1 mm, −1 degrees] error scenario, the mean differences with standard deviations between the two plan types were −0.92±0.553 mm (LoA: −2.01 to 0.16 mm), 0.83±0.453 mm (LoA: −0.06 to 1.72 mm), 0.95±1.924 mm (LoA: −2.82 to 4.72 mm), −0.55±1.719 mm (LoA: −3.92 to 2.82 mm), and −0.91±0.435 degrees (LoA: −1.76 to −0.05 degrees), respectively, as presented in Figure 3. The high slopes of the regression line for the collimator error detection in both error scenarios (0.86 and 0.95, respectively) indicated that the differences increased as the detected errors increased. The elevated variance observed in the results originates from high-dose distributions extending to the detector periphery or exceeding its measurable range in error-perturbed plans. This phenomenon induces sampling bias during individual optimization iterations, propagating substantial noise within the solution space. Bland-Altman analyses revealed that the systematic biases between the simulated and original plans closely matched the introduced error magnitudes (P<0.01), except for jaws Y1 and Y2 in the 0.5 magnitude error scenario, where no significant difference was observed (P=0.19 and P=0.95). Thus, the algorithm demonstrated effective performance across various error magnitudes.

Figure 2 Results of the Bland-Altman analysis for various types of errors between the error introduced and original plans in the simulation error scenario of [−0.5 mm, 0.5 mm, 0.5 mm, −0.5 mm, and −0.5 degrees]. The solid line represents the systematic bias, the long-dashed line represents the 95% confidence interval, the short-dashed line represents the regression line of differences with the 95% confidence interval. (A) MLC leaf-bank X1. (B) MLC leaf-bank X2. (C) Jaw Y1. (D) Jaw Y2. (E) Collimator rotation. MLC, multi-leaf collimator; SD, standard deviation.

Figure 3 Results of the Bland-Altman analysis for various types of errors between the error introduced and original plans in the simulation error scenario of [−1 mm, 1 mm, 1 mm, −1 mm, and −1 degrees]. The solid line represents the systematic bias, the long-dashed line represents the 95% confidence interval, the short-dashed line represents the regression line of differences with the 95% confidence interval. (A) MLC leaf-bank X1. (B) MLC leaf-bank X2. (C) Jaw Y1. (D) Jaw Y2. (E) Collimator rotation. MLC, multi-leaf collimator; SD, standard deviation.

Figure 4 illustrates the mean difference between the minimum objective function values at fixed intervals of 50 iterations, and the optimal solution obtained after 300 iterations. The value at the 0^th iteration represents a difference in the average failure rate of 5.60%±4.426% compared to the optimal solution, indicating the initial failure rate of the original plans. The results revealed that the algorithm converges within 50 iterations, reducing the average failure rate difference compared to the optimal solution to 0.60%±0.613%. Subsequent interval points show further minimized differences with average values of 0.30%±0.339%, 0.18%±0.271%, 0.11%±0.216%, and 0.05%±0.134%, respectively, indicating the effective optimization process of the algorithm over time.

Figure 4 Mean and standard deviation of the difference between the minimum objective function values across iterations.

Plan correction ability test results of the error pattern recognition algorithm using real-world data

A statistical analysis was conducted on a solution set created from outcomes derived from applying the BO algorithm to training subset plans. The solution sets for MLC leaf-banks X1 and X2, jaws Y1 and Y2, and collimator rotation had coefficients of variation of 0.45, 3.08, 11.98, 6.90, and 3.04, respectively (Figure 5A). The low variation in the MLC leaf-bank X1 solution set revealed a systematic error. Conversely, no such errors were apparent in the other parameters. Using the mean value from the MLC leaf-bank X1 solution set, an error matrix [1.04, 0, 0, 0, 0] was formulated and implemented in the training subset plans to recalculate the failure rates. As a result of the recalibration, 51 of 60 plans (85.0%) were successfully optimized, significantly reducing the failure rate from 6.06%±4.783% to 1.78%±1.033% (P<0.0001) (Figure 5B). This error matrix was further applied to the test subset plans to test the algorithm’s generalizability, and the results showed that 16 of 19 plans (84.2%) were effectively optimized, and the failure rate decreased from 4.15%±2.643% to 2.02%±1.261% (P<0.01) (Figure 5C).

Figure 5 Plan correction ability test results using the error pattern recognition algorithm in real-world data. (A) Statistical results of the solution set for various variables, with coefficients of variation labeled on each bin. (B) Failure rate difference from the Wilcoxon matched-pairs signed-rank test between the original and optimized plans for the training subset. ****, P<0.0001. (C) Failure rate difference from the paired T-test between the original and optimized plans for the testing subset. **, P<0.01. MLC, multi-leaf collimator.

To explore the differences between the optimized plans and negative plans in the training and testing subsets, 18 common plan complexity indices were analyzed to identify distinctions between the two plan categories. Table 2 presents eight complexity indices that showed statistically significant differences between the optimized and negative plans. The disparities in various characteristics between the two groups had large effect sizes (d>0.894), except for the Small Aperture Score 5 mm index, which had a small effect size (d=0.234).

Table 2

Differences in various complexity indices between the improved and negative plans

Complexity index	Group	Median/mean	Standard difference	Z/T^#	P	Difference of median/mean	Cohen’s d
Plan irregularity	Improved	5.12	2.704	−4.064	<0.001	2.735	1.206
Plan irregularity	Negative	2.39	0.872	−4.064	<0.001	2.735	1.206
Plan modulation	Improved	0.51	0.131	−2.508	0.01	0.100	0.91
Plan modulation	Negative	0.41	0.139	−2.508	0.01	0.100	0.91
Small aperture score 5 mm	Improved	0	0.036	−2.225	0.03	0	0.234
Small aperture score 5 mm	Negative	0	0.038	−2.225	0.03	0	0.234
Mean field area	Improved	5,231.51	2,277.684	−3.299	0.001	2501.455	1.061
Mean field area	Negative	2,730.06	1,759.252	−3.299	0.001	2501.455	1.061
Mean asymmetry distance*	Improved	62.02	19.575	2.853	0.006	17.965	0.894
Mean asymmetry distance*	Negative	44.06	22.936	2.853	0.006	17.965	0.894
Aperture subregions	Improved	2.24	1.198	−4.706	<0.001	1.145	1.233
Aperture subregions	Negative	1.10	0.118	−4.706	<0.001	1.145	1.233
Aperture Y jaw	Improved	144.11	75.681	−3.039	0.002	67.095	0.979
Aperture Y jaw	Negative	77.02	46.436	−3.039	0.002	67.095	0.979
Leaf gap std	Improved	24.72	7.480	−2.998	0.003	6.375	1.002
Leaf gap std	Negative	18.35	7.177	−2.998	0.003	6.375	1.002

*, the index was evaluated by the independent samples T-test; ^#, Z/T refers to the z score and t value for the Wilcoxon-Mann-Whitney U-test and independent samples T-test, respectively. std, standard.

Discussion

In this study, we produced an algorithm framework for error pattern recognition based on BO and established a systematic error analysis method through statistics. The results showed that the proposed method could quantitatively detect errors in datasets with multiple types of errors within a limited number of sampling times. The systematic error analysis method can effectively extract true values from noisy solution sets and significantly improve the overall gamma analysis performance on a dataset. To our knowledge, this is the first report on a methodology that enables the quantitative analysis of errors in datasets with multiple types of errors.

In this study, we introduced five common mechanical errors associated with LINAC machines. In real-world measurements, various errors frequently coexist and interact, affecting the fidelity of the measurement outcomes. Due to the lack of methods for directly detecting errors from measurement outcomes, the construction of datasets for GPR predictive model training currently relies on simulating errors atop measurements with inherent biases. This practice results in the exaggeration of simulated error magnitudes beyond their true values, which adversely affects the performance of machine-learning models in scenarios characterized by the presence of multiple errors. Kimura et al. developed a convolutional neural network model that excelled in single-error scenarios with a prediction accuracy of 0.92 but showed a decrease in performance to 0.44 in datasets with two concurrent errors (29). Similarly, Sheen et al. developed a logistic regression model by leveraging features extracted from dose distribution maps, which reduced the prediction accuracy to 0.76 in select error scenarios (22).

Conversely, our algorithm was designed to quantitatively analyze errors in environments where multiple errors are present simultaneously. This is particularly true for error simulations of a smaller 0.5 magnitude, which correspond to random offsets encountered during the calibration process and meet the precision requirements for equipment specified in the AAPM-TG142 report (30). Given the insensitivity of the gamma analysis to delivery errors, there is a prevalent belief that errors in this scale would not substantially affect GPR outcomes. However, under such conditions, the proposed algorithm was still capable of discerning and analyzing the introduced errors, showing its ability to analyze nuanced errors.

Another highlight of this study is the application of systematic error analysis in statistics to explore inherent errors in plans that have passed gamma analysis. Our findings (Figure 5) indicated that the proposed error pattern recognition algorithm can effectively detect intrinsic errors in clinical DQA data. This provides an efficient method for labeling data that overcomes the aforementioned shortcomings, facilitating the potential application of deep-learning algorithms in DQA. Notably, the data selected for this method’s application should be based on a consistent LINAC state, as our approach employs statistical analysis to identify systematic deviations tending toward uniformity in the solution set. If the data originate from different accelerator states (e.g., are collected across multiple calibration processes), despite the high precision of current calibration methods, random errors inherent to calibration are inevitable. Should the random errors generated by two calibrations tend in opposing directions, the application of this method would fail to identify systematic deviations in a noisy solution set compiled from individual outcomes.

BO is a branch of the neighborhood search technique that can find the optimal solution with limited sampling. It stands out from other neighborhood search algorithms, such as simulated annealing and variable neighborhood search, due to its persuasive theoretical framework and ability to escape local optima, as shown by hyperparameter tuning in machine learning and the experimental selection of materials and drug designs (31). In scenarios involving the coexistence of multiple errors, the combination of different variables results in a vast solution space. The multi-objective optimization algorithm based on BO can efficiently explore the potential space, thereby reducing redundant tests and computational time. In this study, the range of variables fell within [−3, 3], with an increment of 0.1, resulting in a large solution space with a size of 61^5. The boundary of searching space was set to explore the convergence performance of the error pattern recognition algorithm. When the possible margin of errors is known, the algorithm can converge in fewer iterations due to the reduced variable space. It is advisable to set the threshold boundaries of the optimization objective to the range recommended by the AAPM TG-142 report if the accelerator passes rigorous routine quality control tests. Notably, the surrogate model in the BO framework depends on the objective function and the number of variables. In this experiment, the GP model was selected due to its effectiveness when the number of variables was less than 20, and the results confirmed its suitability for analyzing errors in DQA. If the number of variables exceeds the limitations of the GP model, surrogate models designed for high dimensions can be employed (32).

During the analysis of inherent errors, systematic errors were statistically selected from the noisy solution set. The perturbation observed in the solution sets can be attributed to the piecewise property of the GPR function, where different combinations of variables induce the same function value. Mechanical inaccuracies in the MLC and Jaws speed parameters may contribute to noise within the optimization framework. The radiation dose is dynamically modulated by the real-time positioning of the MLC/jaws during accelerator delivery. Without explicit modeling of speed in our computational architecture, the algorithm compensates by attributing errors to predefined control variables, which in turn amplifies solutions discrepancies. In addition, the quality of the unique solution obtained from the structural similarity index measure could be affected by differences in the dose distribution between plans. This inference is supported by the significant difference in the plan modulation index between the optimized and negative plans (Table 2). Moreover, even with an additional 10% probability of random sampling, the solution obtained from BO may still be a local optimal solution or saddle point. The cause of the perturbation could also be random noise in the calculations and measurements. Valdes et al. suggested that residual errors can contribute to random noise after removing the dependence of GPR on systematic errors (33).

Another reason for the fluctuation in the solution set may be the variation in the complexity of the plan and the relative normalization method. The results of the different plan parameter complexity-related indices (Table 2) suggest that negative plans exhibit lower dose modulation and radiation areas relative to the optimized plans. Any fluctuation around the true value could significantly affect the maximum dose, resulting in an unfavorable change in high-dose gradient regions during relative normalization. Two-dimensional measurements were used to reach the conclusion; however, the correlation may also be applicable to three-dimensional conditions.

This study had several limitations. First, the selected errors did not consider the errors of the gantry rotation and beam parameters, which are as significant as the error of the selected parameters in the GPR. Additionally, the experiments were conducted using planar measurements, which might conceal some imperceptible errors caused by dose blurring effects. In future research, we plan to expand our study by incorporating three-dimensional measurements and a broader range of error types, while also assessing the performance of our proposed algorithm under conditions characterized by multicollinearity among multiple variables.

Conclusions

In this study, we proposed a framework for DQA error pattern recognition based on BO and established a statistical method for analyzing systematic deviation. The performance of the algorithm was evaluated using both simulated scenarios and real-world data, and the results showed its effectiveness in analyzing inherent systematic errors in data containing multiple types of errors. These findings can guide accelerator adjustments and enhance the overall performance of plan implementation on specific equipment. Additionally, the algorithm can analyze inherent systematic deviations in clinical DQA data and provide well-labeled datasets for deep-learning methods.

Acknowledgments

The authors would like to thank Huidong Wang, PhD for providing technical support in the data analysis.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-337/rc

Data Sharing Statement: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-337/dss

Peer Review File: Available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-337/prf

Funding: The study design and data collection of this work were supported by the National Natural Science Foundation of China (No. 12375341 to Y.B.).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-337/coif). Y.W. receives a license for a research version dosimetry engine from Manteia Tech Inc. exclusively for use in this study. The dosimetry engine was used to accelerate the in-house software program, and it did not affect the conclusions or the sharing of data from this study. Y.B. reports funding from the National Natural Science Foundation of China (No. 12375341). The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). This study was approved by the Ethics Committee of Harbin Medical University Cancer Hospital (No. KY2023-83, 2023-12-18). The image and radiotherapy data were anonymized in the preprocessing stage. Individual consent for this retrospective analysis was waived.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Xiao Q, Li G. Application and Challenges of Statistical Process Control in Radiation Therapy Quality Assurance. Int J Radiat Oncol Biol Phys 2024;118:295-305. [Crossref] [PubMed]
Mehrens H, Taylor P, Followill DS, et al. Survey results of 3D-CRT and IMRT quality assurance practice. J Appl Clin Med Phys 2020;21:70-6. [Crossref] [PubMed]
Steers JM, Fraass BA. IMRT QA and gamma comparisons: The impact of detector geometry, spatial sampling, and delivery technique on gamma comparison sensitivity. Med Phys 2021;48:5367-81. [Crossref] [PubMed]
O'Daniel JC, Giles W, Cui Y, et al. A structured FMEA approach to optimizing combinations of plan-specific quality assurance techniques for IMRT and VMAT QA. Med Phys 2023;50:5387-97. [Crossref] [PubMed]
Miften M, Olch A, Mihailidis D, et al. Tolerance limits and methodologies for IMRT measurement-based verification QA: Recommendations of AAPM Task Group No. 218. Med Phys 2018;45:e53-83. [Crossref] [PubMed]
Ono T, Nakamura M, Ono Y, et al. Development of a plan complexity mitigation algorithm based on gamma passing rate predictions for volumetric-modulated arc therapy. Med Phys 2022;49:1793-802. [Crossref] [PubMed]
Zhen H, Nelms BE, Tomé WA. On the use of biomathematical models in patient-specific IMRT dose QA. Med Phys 2013;40:071702. [Crossref] [PubMed]
Wootton LS, Nyflot MJ, Chaovalitwongse WA, et al. Error Detection in Intensity-Modulated Radiation Therapy Quality Assurance Using Radiomic Analysis of Gamma Distributions. Int J Radiat Oncol Biol Phys 2018;102:219-28. [Crossref] [PubMed]
Granville DA, Sutherland JG, Belec JG, et al. Predicting VMAT patient-specific QA results using a support vector classifier trained on treatment plan characteristics and linac QC metrics. Phys Med Biol 2019;64:095017. [Crossref] [PubMed]
Huang Y, Pi Y, Ma K, et al. Virtual Patient-Specific Quality Assurance of IMRT Using UNet++: Classification, Gamma Passing Rates Prediction, and Dose Difference Prediction. Front Oncol 2021;11:700343. [Crossref] [PubMed]
Huang Y, Pi Y, Ma K, et al. Deep Learning for Patient-Specific Quality Assurance: Predicting Gamma Passing Rates for IMRT Based on Delivery Fluence Informed by log Files. Technol Cancer Res Treat 2022;21:15330338221104881. [Crossref] [PubMed]
Matsuura T, Kawahara D, Saito A, et al. Predictive gamma passing rate of 3D detector array-based volumetric modulated arc therapy quality assurance for prostate cancer via deep learning. Phys Eng Sci Med 2022;45:1073-81. [Crossref] [PubMed]
Matsuura T, Kawahara D, Saito A, et al. A synthesized gamma distribution-based patient-specific VMAT QA using a generative adversarial network. Med Phys 2023;50:2488-98. [Crossref] [PubMed]
Moreau N, Bonnor L, Jaudet C, et al. Deep Hybrid Learning Prediction of Patient-Specific Quality Assurance in Radiotherapy: Implementation in Clinical Routine. Diagnostics (Basel) 2023;13:943. [Crossref] [PubMed]
Yoganathan SA, Ahmed S, Paloor S, et al. Virtual pretreatment patient-specific quality assurance of volumetric modulated arc therapy using deep learning. Med Phys 2023;50:7891-903. [Crossref] [PubMed]
Xie L, Zhang L, Hu T, et al. Neural Network Model Based on Branch Architecture for the Quality Assurance of Volumetric Modulated Arc Therapy. Bioengineering (Basel) 2024;11:362. [Crossref] [PubMed]
Liu S, Ma J, Tang F, et al. Error detection for radiotherapy planning validation based on deep learning networks. J Appl Clin Med Phys 2024;25:e14372. [Crossref] [PubMed]
Hao Y, Zhang X, Wang J, et al. Improvement of IMRT QA prediction using imaging-based neural architecture search. Med Phys 2022;49:5236-43. [Crossref] [PubMed]
Interian Y, Rideout V, Kearney VP, et al. Deep nets vs expert designed features in medical physics: An IMRT QA case study. Med Phys 2018;45:2672-80. [Crossref] [PubMed]
Potter NJ, Mund K, Andreozzi JM, et al. Error detection and classification in patient-specific IMRT QA with dual neural networks. Med Phys 2020;47:4711-20. [Crossref] [PubMed]
Kimura Y, Kadoya N, Oku Y, et al. Development of a deep learning-based error detection system without error dose maps in the patient-specific quality assurance of volumetric modulated arc therapy. J Radiat Res 2023;64:728-37. [Crossref] [PubMed]
Sheen H, Shin HB, Kim H, et al. Application of error classification model using indices based on dose distribution for characteristics evaluation of multileaf collimator position errors. Sci Rep 2023;13:11027. [Crossref] [PubMed]
Baran M, Tabor Z, Tulik M, et al. Are gamma passing rate and dose-volume histogram QA metrics correlated? Med Phys 2021;48:4743-53. [Crossref] [PubMed]
Nejahi Y, Barhaghi MS, Schwing G, et al. Update 2.70 to “GOMC: GPU Optimized Monte Carlo for the simulation of phase equilibria and physical properties of complex fluids”. SoftwareX 2021;13:100627.
Parsa M, Mitchell JP, Schuman CD, et al. Bayesian Multi-objective Hyperparameter Optimization for Accurate, Fast, and Efficient Neural Network Accelerator Design. Front Neurosci 2020;14:667. [Crossref] [PubMed]
Abbasimehr H, Paki R. Prediction of COVID-19 confirmed cases combining deep learning methods and Bayesian optimization. Chaos Solitons Fractals 2021;142:110511. [Crossref] [PubMed]
Li Y, Shen Y, Zhang W, et al. OpenBox: A Generalized Black-box Optimization Service. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. Virtual Event, Singapore: Association for Computing Machinery; 2021:3209-19.
Wang Y, Pang X, Feng L, et al. Correlation between gamma passing rate and complexity of IMRT plan due to MLC position errors. Phys Med 2018;47:112-20. [Crossref] [PubMed]
Kimura Y, Kadoya N, Oku Y, et al. Error detection model developed using a multi-task convolutional neural network in patient-specific quality assurance for volumetric-modulated arc therapy. Med Phys 2021;48:4769-83. [Crossref] [PubMed]
Klein EE, Hanley J, Bayouth J, et al. Task Group 142 report: quality assurance of medical accelerators. Med Phys 2009;36:4197-212. [Crossref] [PubMed]
Frazier PI. A Tutorial on Bayesian Optimization. arXiv 2018. arXiv:1807.02811.
Wang Z, Hutter F, Zoghi M, et al. Bayesian Optimization in a Billion Dimensions via Random Embeddings. arXiv 2013. arXiv:1301.1942.
Valdes G, Scheuermann R, Hung CY, et al. A mathematical framework for virtual IMRT QA using machine learning. Med Phys 2016;43:4323. [Crossref] [PubMed]

Cite this article as: Wang Y, Pang X, Wang H, Bai Y. Research on the error pattern recognition of dosimetric quality assurance by Bayesian optimization. Transl Cancer Res 2025;14(3):2029-2042. doi: 10.21037/tcr-2025-337

Research on the error pattern recognition of dosimetric quality assurance by Bayesian optimization

Highlight box

Introduction

Methods

Dataset

Table 1

Error types and dose calculation

The framework of the error pattern recognition algorithm based on BO

BO

Systematic error analysis based on statistics

Training details

Experimental design

Performance evaluation of the error pattern recognition algorithm in simulated error scenarios

Validation of systematic errors based on statistics in real-world data

Statistical analysis

Results

Performance results of the error pattern recognition algorithm in simulated error scenarios

Plan correction ability test results of the error pattern recognition algorithm using real-world data

Table 2

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share