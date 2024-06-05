Published: 9 May 2024
Contributors: Eda Kavlakoglu, Erika Russi
XGBoost (eXtreme Gradient Boosting) is a distributed, open-source machine learning library that uses gradient boosted decision trees, a supervised learning boosting algorithm that makes use of gradient descent. It is known for its speed, efficiency and ability to scale well with large datasets.
Developed by Tianqi Chen from the University of Washington, XGBoost is an advanced implementation of gradient boosting with the same general framework; that is, it combines weak learner trees into strong learners by adding up residuals. The library is available for C++, Python, R, Java, Scala and Julia1.
Decision trees are used for classification or regression tasks in machine learning. They use a hierarchical tree structure where an internal node represents a feature, the branch represents a decision rule and each leaf node represents the outcome of the dataset.
Because decision trees are prone to overfitting, ensemble methods, like boosting, can often be used to create more robust models. Boosting combines multiple individual weak trees—that is, models that perform slightly better than random chance, to form a strong learner. Each weak learner is trained sequentially to correct the errors made by the previous models. After hundreds of iterations, weak learners are converted into strong learners.
Random forests and boosting algorithms are both popular ensemble learning techniques that use individual learner trees to improve predictive performance. Random forests are based on the concept of bagging (bootstrap aggregating) and train each tree independently to combine their predictions, while boosting algorithms use an additive approach where weak learners are sequentially trained to correct the previous models’ mistakes.
Gradient boosted decision trees are a type of boosting algorithm that uses gradient descent. Like other boosting methodologies, gradient boosting starts with a weak learner to make predictions. The first decision tree in gradient boosting is called the base learner. Next, new trees are created in an additive manner based on the base learner’s mistakes. The algorithm then calculates the residuals of each tree’s predictions to determine how far off the model’s predictions were from reality. Residuals are the difference between the model’s predicted and actual values. The residuals are then aggregated to score the model with a loss function.
In machine learning, loss functions are used to measure a model's performance. The gradient in gradient boosted decision trees refers to gradient descent. Gradient descent is used to minimize the loss (i.e. to improve the model’s performance) when we train new models. Gradient descent is a popular optimization algorithm used to minimize the loss function in machine learning problems. Some examples of loss functions include mean squared error or mean absolute error for regression problems, cross-entropy loss for classification problems or custom loss functions may be developed for a specific use case and dataset.
Below is a discussion of some of XGBoost’s features in Python that make it stand out compared to the normal gradient boosting package in scikit-learn2:
In this section, we will go over how to use the XGBoost package, how to select hyperparameters for the XGBoost tree booster, how XGBoost compares to other boosting implementations and some of its use cases.
Assuming you’ve already performed an exploratory data analysis on your data, continue with spitting your data between a training dataset and testing dataset. Next, convert your data into the DMatrix format that XGBoost expects3. DMatrix is XGBoost's internal data structure optimized for memory efficiency and training speed4.
Next, instantiate an XGBoost model and, depending on your use case, select which objective function you’d like to use via the “object” hyperparameter. For example, if you have a multi-class classification task, you should set the objective to “multi:softmax”5. Alternatively, if you have a binary classification problem, you can use the logistic regression objective “binary:logistic”. Now you can use your training set to train the model and predict classifications for the data set aside as the test set. Assess the performance of the model by comparing the predicted values with the test set’s actual values. You may use metrics such as accuracy, precision, recall or f-1 score to evaluate your model. You may also want to visualize your true positives, true negatives, false positives and false negatives using a confusion matrix.
Next, you may want to iterate through a combination of hyperparameters to help improve the performance of your model. Hyperparameter tuning is the optimization process for a machine learning algorithm’s hyperparameters. The best hyperparameters can be found using grid search and cross-validation methods, which will iterate through a dictionary of possible hyperparameter combinations.
Below is an explanation of some of the hyperparameters available to tune for gradient boosted trees in XGBoost:
XGBoost is one of many available open-source boosting algorithms. In this section, we’ll compare XGBoost to three other boosting frameworks.
AdaBoost is an early boosting algorithm invented by Yoav Freund and Robert Schapire in 19957. In AdaBoost, more emphasis is made on incorrect predictions through a system of weights that affect those harder to predict data points more significantly. First, each data point in the dataset is assigned a specific weight. As the weak learners correctly predict an example, the example’s weight is reduced. But if learners get an example wrong, the weight for that data point increases. As new trees are created, their weights are based on the misclassifications of the previous learner trees. As the number of learners increases, the samples that are easy to predict will be used less for future learners while those data points that are harder to predict will be weighted more prominently. Gradient boosting and XGBoost tend to be stronger alternatives to AdaBoost due to their accuracy and speed.
CatBoost is another gradient boosting framework. Developed by Yandex in 2017, it specializes in handling categorical features without any need for preprocessing and generally performs well out-of-the-box without the need to perform extensive hyperparameter tuning8. Like XGBoost, CatBoost has built in support for handling missing data. CatBoost is especially useful for datasets with many categorical features. According to Yandex, the framework is used for search, recommendation systems, personal assistants, self-driving cars, weather prediction and other tasks.
LightGBM (Light Gradient Boosting Machine) is the final gradient boosting algorithm we will review. LightGBM was developed by Microsoft and first released in 20169. Where most decision tree learning algorithms grow trees depth-wise, LightGBM uses a leaf-wise tree growth strategy10. Like XGBoost, LightGBM exhibits fast model training speed and accuracy and performs well with large datasets.
XGBoost and gradient boosted decision trees are used across a variety of data science applications, including:
