A new machine learning system correctly identifies Mycobacterium abscessus subspecies 97% of the time using MALDI-TOF mass spectra gathered from six European countries.
This study is led by E. Padial-Fuillerat and stems from a collaboration between the University of Granada, Hospital Gregorio Marañón and Clover Biosoft. It has been published as open access in the Journal of Proteome Research.
Mycobacterium abscessus is one of the most antibiotic-resistant bacteria in existence. Three subspecies look almost identical under the microscope but respond very differently to treatment. Telling them apart has long required expensive molecular tests. This study demonstrates that the right machine learning pipeline, applied to MALDI-TOF MS data, can achieve a very promising accuracy across six European countries.
The three subspecies (M. abscessus subsp. abscessus, subsp. bolletii, and subsp. massiliense) differ significantly in resistance. M. massiliense is generally susceptible to both clarithromycin and doxycycline, while M. abscessus and M. bolletii often resist them. Yet the three subspecies share such similar ribosomal sequences that standard MALDI-TOF systems can only identify them to the complex level.
The Problem with “Eyeballing” Spectra
MALDI-TOF mass spectrometry has become ubiquitous in clinical microbiology over the last two decades because it is fast, cheap, and requires minimal sample preparation. A bacterial colony is spotted on a metal plate that is irradiated with a laser, and the resulting mass spectrum, a fingerprint of the organism’s proteins, is compared to a reference database. For species-level identification, this approach works beautifully. For subspecies discrimination among closely related organisms, the differences in the spectral fingerprint are too subtle for database-matching approaches.
Machine learning offers a way through this bottleneck, but it comes with pitfalls. The biggest is the batch effect: when spectra are collected in different hospitals, across different cities, by different technicians, systematic technical noise accumulates and can easily overwhelm the subtle biological signal the model is trying to learn. A model trained on spectra from Madrid may fail completely on spectra from Norway because of the different technical conditions under which the data is acquired.
A Three-Part Solution
The team assembled 325 spectra from eight European hospitals across six geographic batches: Madrid, Barcelona, Norway, France, the Netherlands, and Belgium. Their pipeline attacked the classification problem in three coordinated stages, each addressing a specific failure mode of naive ML approaches.
Step 1: Remove geographic differences with ComBat
A PCA analysis of the preprocessed spectra quickly showed a clear problem: samples grouped didn’t group by subspecies but by the hospital they came from. The signal the model needed to learn was obscured under batch effects. To fix this, the authors applied ComBat, an empirical Bayes correction algorithm originally developed for transcriptomics microarray data. Applied in two variants (correcting variance and mean (“var. + mean”) or mean only) ComBat dramatically homogenised the spectral landscape while preserving the biological variation that distinguishes subspecies. The “var. + mean” correction proved superior for this dataset.
Step 2: Reduce the feature space with Boruta
Raw MALDI-TOF spectra contain 35,493 mass peaks per sample. Training a classifier on 35,493 features from 325 samples is prone to computational problems. The Boruta algorithm, which is a feature selection method based on random forests, reduced the important peaks from 35,493 to only 401 after a correction for “var. + mean”, which is a reduction of 98.87%. Importantly, since Boruta was used separately for each of the three analysis types (uncorrected, “var. + mean”, and “mean”), it chose the right features for each data type without bringing in irrelevant data from the uncorrected spectra into the corrected analysis.
Step 3: Rebalancing the classes
The three subspecies were unevenly represented: 156 spectra for M. abscessus, 116 for M. massiliense, but only 53 for M. bolletii. Some geographic batches had no M. bolletii samples at all. Class imbalance is particularly dangerous for minority class detection, because the model can achieve high average accuracy simply by ignoring the minority class. After testing several resampling strategies, ClusterCentroids emerged as the most effective, particularly for the tricky M. bolletii. This method is a way of undersampling the majority classes while preserving their structural diversity.
The three-stage analytical pipeline. Spectra from 8 hospitals are preprocessed, then subjected to ComBat batch effect correction, Boruta feature selection (reducing 35,493 mass peaks to ~401), and ClusterCentroids resampling to address the class imbalance in M. bolletii. The final SVM classifier achieves 97% F1 score on held-out test samples.
Removing the Batch Effect
The paper’s PCA plots clearly illustrate the story. Before ComBat correction, samples from Madrid form a distinct island, as do those from France, Norway, and Belgium. The first principal component, capturing 36% of all variance, tracks geography. This situation is a catastrophe for a classifier: the most powerful signal in the data is completely irrelevant to the question being asked.
After the “var. + mean” ComBat correction, the geographic islands dissolve. Samples intermingle, and the first principal component no longer tracks country of origin. Looking at the 5700–5800 Da area, where batch effects were strongest, the spectral analysis shows that peak intensities become similar and align across all six batches after correction, while still keeping the original proteomic fingerprints.
Originally designed for genomics microarrays, ComBat uses empirical Bayes shrinkage to estimate and remove batch-specific location and scale effects while borrowing statistical strength across genes (or, here, mass peaks). Its application to MALDI-TOF data is a novel contribution of this work.
Schematic representation based on Figures 2 and 3 from the paper. Left: Without correction, the first principal component (explaining 36.4% of variance) separates samples by hospital of origin — a technical artefact that overwhelms biological signal. Right: After ComBat “var. + mean” correction, geographic clustering dissolves and samples distribute according to biological variation, enabling subspecies-driven classification.
Results
The best-performing combination of ML training method with Combat correction strategy was SVM with “var. + mean”, plus Boruta feature selection and ClusterCentroids resampling. This achieved a 97% accuracy, a 97% weighted F1 score, and a 97.17% AUC-ROC on the held-out test set. These results surpass the previous best for multi-geographical M. abscessus subspecies classification by Rodríguez-Temporal et al. (2023), who reported 88.6% accuracy on the same underlying dataset.
(best model)
(held-out test)
across all subspecies
Particularly notable was the improvement for M. abscessus subsp. bolletii, which is historically the most difficult subspecies to identify. The method achieved perfect precision and recall (1.00) for the SVM model with “var. + mean” correction. The high GEO (geometric mean) and IBA (index of balanced accuracy) scores above 0.90 confirm that this accuracy is not achieved by sacrificing performance on the minority class.
| Condition | Model | Avg F1 | Avg AUC | GEO | IBA |
|---|---|---|---|---|---|
| No correction | RF | 0.86 | 0.97 | 0.87 | 0.76 |
| No correction | SVM | 0.80 | 0.93 | 0.83 | 0.68 |
| “var. + mean” + Boruta + ClusterCentroids | RF | 0.92 | 0.98 | 0.94 | 0.88 |
| “var. + mean” + Boruta + ClusterCentroids | SVM ★ | 0.97 | 0.97 | 0.97 | 0.95 |
| “mean” + Boruta + ClusterCentroids | SVM | 0.97 | 0.98 | 0.98 | 0.95 |
Why Did the Model Work?
A model that achieves 97% accuracy is only useful in a clinical setting if it can be explained. The authors used SHAP (SHapley Additive exPlanations) values to interpret the best-performing SVM model, revealing which mass peaks drove each prediction and in what direction.
The analysis identified three primary discriminative m/z regions, around 2672 Da, 3105 Da, and 3120 Da, with multiple closely spaced features within each region. This pattern is biologically expected: in linear-mode MALDI-TOF MS, a single underlying protein signal can span several adjacent m/z bins. The fine 0.5 Da binning used in this study was intentional, designed to capture subtle within-peak variations in shape and intensity ratios that coarser binning would obscure.
Notably, SHAP confirmed that M. bolletii and M. massiliense share 15 discriminative peaks, while M. abscessus and M. massiliense share up to 22. This pattern is consistent with the known phylogenetic relationships between these subspecies, where M. bolletii is the most genetically divergent of the three.
“Subspecies discrimination depends not only on peak presence/absence, but also on subtle differences in peak morphology and intensity profiles within critical mass ranges — differences that require fine-scale feature resolution to capture.”
— Padial-Fuillerat, Martínez-Manjón et al., J. Proteome Res., 2026Average weighted F1 scores on held-out test data across models and correction conditions. All corrected models use Boruta feature selection and ClusterCentroids resampling. The SVM with “var. + mean” correction (★) achieves a 17 percentage-point improvement over the uncorrected SVM baseline, and surpasses the previously published state-of-the-art (88.6%) by nearly 9 points.
Is This Really About Subspecies?
A key concern with any ML model for microbial identification is whether it is inadvertently learning a confounding variable. In this case, antibiotic resistance status was available for only 13 of the 325 samples. Too few to model directly but enough to investigate as a potential confounder. PCA of those 13 samples showed that resistant and susceptible isolates clustered by subspecies identity, not by resistance phenotype, and were distributed across multiple cross-validation folds with no concentration in any single fold. The dominant signal in the model is genuinely subspecies-specific proteomic content, not resistance status.
Limitations Future Work
The authors are candid about what remains to be done. The 325-sample dataset, while across several countries, was acquired exclusively from European hospitals. And while two institutions used the same instrument model, real-world deployment will encounter a heterogeneous mix of instrument versions and firmware. The ComBat correction worked well in this case, although further testing of this algorithm with MALDI-TOF data is still required, as it may not reflect these results in all clinical settings.
Class imbalance for M. bolletii remains a concern, particularly in geographic batches where it was entirely absent from the training data. The generalisability of the pipeline to other NTM species or to fungal pathogens also remains untested. The authors suggest future work integrating antibiotic resistance profiles, where larger and more balanced datasets would allow this potentially predictive information to be incorporated directly into the model.
The authors envision this pipeline delivered as modular Python scripts embedded in routine laboratory workflows. With continuing improvements in automated preprocessing libraries and instrument software integration, near-real-time subspecies detection from raw spectra may eventually be feasible in well-standardised laboratory environments — without the need for expensive molecular testing.
Impact
This study has clinical relevance. A clinician treating a patient with M. abscessus lung infection, for instance, a cystic fibrosis patient who may face months of combination antibiotic therapy, needs to know whether they are dealing with M. massiliense (treatable with clarithromycin) or M. abscessus/M. bolletii (resistant to it). Today, answering that question reliably requires molecular testing that is expensive, slow, and not universally available. A MALDI-TOF workflow that already exists in most clinical microbiology laboratories, plus a software pipeline that runs in minutes, could bring subspecies-level diagnosis within reach of any hospital that owns a mass spectrometer.
By tackling the three core failure modes of previous ML approaches: batch effects, high dimensionality, and class imbalance, this study establishes both a new performance benchmark and a generalizable preprocessing framework. The same pipeline could be applied to other NTM species, to antibiotic resistance prediction, or more broadly to any clinical proteomics application where multi-site data heterogeneity has previously held back ML-based diagnostics.
Padial-Fuillerat E., Martínez-Manjón J.E. et al., “Toward Robust Machine Learning Models for MALDI-TOF MS: Novel Approaches for Mycobacterium abscessus Subspecies Identification”
Journal of Proteome Research, 2026
doi: 10.1021/acs.jproteome.5c00534
Dataset: zenodo.org/records/17937866
Figures in this post are illustrative diagrams based on data reported in the manuscript. PCA scatter plots (Figure 2) are schematic representations; exact point positions are for illustrative purposes. All numerical values (F1 scores, AUC-ROC etc.) are drawn directly from Tables 1–6 of the paper. This article is published open access under CC-BY 4.0.
