ESTRO 2023

Session Item

Saturday

May 13

10:30 - 11:30

Business Suite 3-4

Radiomics and modelling

Chair: , Italy

Overview: Poster Discussions are presented in one of the sessions scheduled at the two poster discussion theatres. Each author will present a digital poster orally for 2 minutes, followed by 2 minutes for discussion. Sessions will not be recorded.

Session Type: Poster Discussion

Track: Physics

Journey:

Comparing ML performances of toxicity predictive models on a large cohort of breast cancer patients

Maria Giulia Ubeira Gabellini, Italy

Presentation Number: PD-0170

Abstract

Abstract Title:

Comparing ML performances of toxicity predictive models on a large cohort of breast cancer patients

Authors:

Maria Giulia Ubeira Gabellini¹, Martina Mori¹, Alessandro Cicchetti^1,2, Paola Mangili¹, Gabriele Palazzo¹, Andrei Fodor³, Antonella Del Vecchio¹, Nadia Gisella Di Muzio^4,5, Claudio Fiorino¹

¹IRCCS San Raffaele Scientific Institute, Medical Physics, Milan, Italy; ²Progetto prostata, Fondazione IRCCS Istituto Nazionale dei Tumori, Medical Physics, Milan, Italy; ³IRCCS San Raffaele Scientific Institute, Radiotherapy, Milan, Italy; ⁴ IRCCS San Raffaele Scientific Institute, Radiotherapy, Milan, Italy; ⁵ Vita-Salute San Raffaele University, Radiotherapy, Milan, Italy

Show Affiliations

Purpose or Objective

Studies comparing the performances of machine learning (ML) methods in building predictive models of toxicity in RT are rare. Thanks to the availability of a large cohort (n=1323) of breast cancer patients homogeneously treated with tangential fields (Fodor et al. Clin Breast Cancer 2022), whose acute toxicity was prospectively scored, several ML approaches could be compared.

Material and Methods

The endpoint was RTOG G2/G3 acute toxicity, resulting in 209 and 1114 patients respectively with or without the event. The dataset, including 25 clinical, anatomical and dosimetric features, was split into 992 for the training and 331 for external test.

Several ML methods were considered: Logistic Regression (LR), Lasso, Elastic Net, k-Nearest Neighbors (KNN), Random Forest (RF), Support Vector Machines (SVM), Gaussian Naive Bayes (GNB) and Multi Layer Perceptron (MLP). First, we standardized the training dataset using the Robust Scaling, limiting the impact of outliers. The test dataset was consequently scaled accordingly. A One Hot Encoder on categorical features was applied and variables with high p-value at univariate LR were excluded; also features with Spearman ρ>0.8 were dropped. The Synthetic Minority Oversampling Technique was applied to create synthesized data of the minority to compensate class imbalance.

Finally, we run each model by applying a Bayesian search to maximize the chosen metrics (balanced accuracy and AUC) on a stratified K-Folds cross-validator sample. For each model, we derived the feature importance and complementary metric scoring for both training and test dataset. A Sequential Feature Selector was applied; we chose a parsimonious feature number for which AUC falls inside the minimum between 1% of the max metric value and its SD error. Procedures were implemented in Python v.3.7.9.

Results

As shown in Fig1, model’s performance were compared on training-test dataset over different metrics. The models are in ascending order starting from the one with less training-test discrepancy. Overall, the best models found were MLP and GNB. LR, more often used in literature, showed similar performance (as shown in Fig2, including features importance). KNN, RF and SVC tend instead to overfit.

The AUC of the test data is overall slightly above 0.6, f1 score on patients with toxicity is around 0.25, while it depends on model for patients without toxicity. MLP and GNB have best brier score, while LR is worse and more discrepant in test compared to training. The test slope derived from the calibration plot is low compared to the one expected and its R2 is quite different between training and test.

Conclusion

We set up a pipeline of pre-processing shared between different ML models and applied to a large cohort of patients to predict Acute Toxicity. As shown, no model performed the best for all metrics: more deep ML models have better performances, also if LR is not too different. For all of them we may quantify and see feature importance.

This study is supported by ERAPERMED-2020-110-JTC.