LightGBM, A Gradient Boosting Algorithm

Clover MS Data Analysis software has been updated recently to v6 version which includes improvements and new features like the implementation of Light Gradient Boosting Machine (LightGBM) algorithm (Figure 1). Let’s see what it is and how it works.

Figure 1. Version 6 improvements and new features of Clover MS Data Analysis software.

Gradient Boosting is a family of Machine Learning algorithms used both in classification and regression tasks based on the combination of weak predictive models (weak learners), normally decision trees. These weak learners are generated on a sequential way, in which each new tree is fitted to the errors made by the previous predictor(1).

Gradient Boosting usually requires little to no data pre-processing; handles missing, categorial and numerical data; is robust to over-fitting and often achieves better accuracy than every other Machine Learning algorithm in tabular data(2).

However, in recent years, with the emergence of big data (in terms of both the number of features and instances), conventional Gradient Boosting implementations are inefficient, requiring too much computational power or time. In order to tackle these challenges, Microsoft and the Peking University released LightGBM, a Gradient Boosting implementation that introduces two novel techniques to reduce training time by up to 20 times, compared to other Gradient Boosting implementations, while achieving almost the same state-of-the-art performance: Gradient-based One-Side-sampling (GOSS), which excludes the less informative instances, and Exclusive Feature Bundling (EFB), which bundles mutually exclusive features, reducing the total number of features(3).


Furthermore, LightGBM is different from other algorithms in the way that the trees grow: while others algorithms grow horizontally, or level-wise, LightGBM grows vertically, or leaf-wise, which also contributes to the faster training time(4). In addition, this algorithm have been used recently by C. Weis and colleagues (5) to discriminate resistant from susceptible bacterial strains using mass spectrometry data. The results achieved by this publication proved the power of this algorithm and its use with mass spectrometry data.

Authors: Raúl Miñán & Manuel J. Arroyo

1. https://interactivechaos.com/es/manual/tutorial-de-machine-learning/gradient-boosting

2. https://medium.com/analytics-vidhya/introduction-to-the-gradient-boosting-algorithm-c25c653f826b

3.  https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

4. https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc

5. Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning https://www.nature.com/articles/s41591-021-01619-9