Gradient boosting classifiers in Scikit-Learn and Caret

28 April 2025

 

Joshua Noble

Data Scientist

Gradient boosting classifiers

Gradient boosting is a powerful and widely used machine learning algorithm in data science used for classification tasks. It's part of a family of ensemble learning methods, along with bagging, which combine the predictions of multiple simpler models to improve overall performance. Gradient boosting regression uses gradient boosting to better generate output data based on a linear regression. A gradient boosting classifier, which you’ll explore in this tutorial, uses gradient boosting to better classify input data as belonging to two or more different classes. 

Gradient boosting is an update of the adaboost algorithm that uses decision stumps rather than trees. These decision stumps are similar to trees in a random forest but they have only one node and two leaves. The gradient boosting algorithm builds models sequentially, each step tries to correct the mistakes of the previous iteration. The training process often begins with creating a weak learner like a shallow decision tree for the training data. After that initial training, gradient boosting computes the error between the actual and predicted values (often called residuals) and then trains a new estimator to predict this error. That new tree is added to the ensemble to update the predictions to create a strong learner. Gradient boosting repeats this process until improvement stops or until a fixed number of iterations has been reached. Boosting itself is similar to gradient descent but “descends” the gradient by introducing new models.

Boosting has several advantages: it has good performance on tabular data and it can handle both numerical and categorical data. It works well even with default parameters and is robust to outliers in the dataset. However, it can be slow to train and often highly sensitive to the hyperparameters set for the training process. Keeping the number of trees created smaller can speed up the training process when working with a large dataset. This step is usually done through the max depth parameter. Gradient boosting can also be prone to overfitting if not tuned properly. To prevent overfitting, you can configure the learning rate for the training process. This process is roughly the same for a classifier or a gradient boosting regressor and is used in the popular xgboost, which builds on gradient boosting by adding regularization.

In this tutorial, you'll learn how to use two different programming languages and gradient boosting libraries to classify penguins by using the popular Palmer Penguins dataset.

You can download the notebook for this tutorial from Github.

Step 1 Create an R notebook

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook on watsonx®.

Log in to watsonx.ai® by using your IBM® Cloud® account.

Create a watsonx.ai project.

You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.

Create a Jupyter Notebook.

Make sure to select "Runtime 24.1 on R 4.3 S (4 vCPU 16 GB RAM)" when you create the notebook. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more IBM Granite® tutorials, check out the IBM Granite Community. This Jupyter Notebook can be found on GitHub.

Step 2 Configure libraries and data

In R, the caret library is a powerful tool for general data preparation and for model fitting. You'll use it to prepare data and to train the model itself.

install.packages('gbm')
install.packages('caret')
install.packages('palmerpenguins')

library(gbm)
library(caret)  
library(palmerpenguins)

head(penguins) # head() returns the top 6 rows of the dataframe
summary(penguins) # prints a statistical summary of the data columns

The createDataPartition function from the caret package to split the original dataset into a training and testing set and split data into training (70%) and testing set (30%).


dim(penguins)

# get rid of any NA

penguins <- na.omit(penguins)
parts = caret::createDataPartition(penguins$species, p = 0.7, list = F)

train = penguins[parts, ]
test = penguins[-parts, ]

Now you're ready to train and test.

Step 3 Train and test

The train method from the caret library uses R formulas, where the dependent variable (often also called a target) is on the left side of a tilde '~.' The independent variables (often also called a feature) are on the right side of the '~'. For instance:

height ~ age

This step predicts height based on age.

To caret train, you pass the formula, the training data and the method to be used. The caret library provides methods for many different types of training, so setting the method as "gbm" is where you'll specify to use gradient boosting. The next parameter configures the training process. The "repeatedcv" method performs X-fold cross-validation on subsamples of the training set data points. Here, you specify 3 repeats of 5-fold cross-validation by using a different set of folds for each cross-validation.

model_gbm <- caret::train("species ~ .",
                          data = train,
                          method = "gbm", # gbm for gradient boosting machine
                          trControl = trainControl(method = "repeatedcv", 
                                                   number = 5, 
                                                   repeats = 3, 
                                                   verboseIter = FALSE),
                          verbose = 0)

Now you can use the predictive model to make predictions on test data:

pred_test = caret::confusionMatrix(
  data = predict(model_gbm, test),
  reference = test$species
)

print(pred_test)

This step prints:

Confusion Matrix and Statistics
           Reference
Prediction  Adelie Chinstrap Gentoo
  Adelie        42         0      0
  Chinstrap      0        20      0
  Gentoo         1         0     35

Overall Statistics
                                          
               Accuracy : 0.9898          
                 95% CI : (0.9445, 0.9997)
    No Information Rate : 0.4388          
    P-Value [Acc > NIR] : < 2.2e-16       
                  Kappa : 0.984           

 Mcnemar's Test P-Value : NA              

Statistics by Class:
                     Class: Adelie Class: Chinstrap Class: Gentoo
Sensitivity                 0.9767           1.0000        1.0000
Specificity                 1.0000           1.0000        0.9841
Pos Pred Value              1.0000           1.0000        0.9722
Neg Pred Value              0.9821           1.0000        1.0000
Prevalence                  0.4388           0.2041        0.3571
Detection Rate              0.4286           0.2041        0.3571
Detection Prevalence        0.4286           0.2041        0.3673
Balanced Accuracy           0.9884           1.0000        0.9921

Due to the nature of cross validation with folds, the sensitivity and specificity for each class can be slightly different than what is observed here, although the accuracy will be the same. The accuracy is quite good, even with the Chinstrap penguin, which makes up 20% of the training dataset.

Step 4 Create a Python notebook

Now you'll learn how to create a gradient boosting model in Python. In the same project that you created previously, create a Jupyter Notebook.

Make sure to create a Jupyter Notebook by using Python 3.11 in IBM Watson Studio. Make sure to select "Runtime 24.1 on Python 3.11 XXS (1 vCPU 4 GB RAM)" when you create the notebook. You're now ready to create a gradient boosting classifier using Python.

Step 5 Configure libraries and data

This step installs the libraries that you'll use to train and test your gradient boosting classifier. The training itself is done with scikit-learn and the data comes from the palmerpenguins library.

!pip install seaborn pandas scikit-learn palmerpenguins

Now install the libraries into the notebook environment:

import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from palmerpenguins import load_penguins

As in the R code, there are some NAs in the penguins dataset that needs to be removed. This code snippet loads the dataset, removes any NA rows, and then splits the data into features and target.

# Load the penguins
penguins = load_penguins() #initialize the dataset
penguins = penguins.dropna()
X = penguins.drop("species", axis=1)
y = penguins["species"]

Now create a training and testing split of the dataset, with 70% of the data pulled for training and 30% reserved for testing.

X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size=0.3, random_state=42
)

Next, you'll gather two lists of the column names, one for the categorical features of X and another for the numerical features, e.g. float64 or int64. Then, use ColumnTransformer from scikit-learn to apply different preprocessing to different column types. A OneHotEncoder is applied to categorical features to convert them into binary vectors. A StandardScaler is applied to numerical features to standardize them around a mean f 0 and a variance of 1.


# Define categorical and numerical features
categorical_features = X.select_dtypes(
   include=["object"]
).columns.tolist()

numerical_features = X.select_dtypes(
   include=["float64", "int64"]
).columns.tolist()

preprocessor = ColumnTransformer(
   transformers=[
       ("cat", OneHotEncoder(), categorical_features),
       ("num", StandardScaler(), numerical_features),
   ]
)

Step 6 Train and Test

Now that you've created the feature sets and the prepocessor, you can create a pipeline to train the model. That pipeline uses the preprocessor on input data and then passes it to the Gradient Boosting algorithm to build the classifier. With a larger or more complex dataset, there are other training parameters that you might want to configure. For instance, max_features, which sets the number of features to consider when looking for the best split or max_depth, which limits the number of nodes in the tree. This code snippet sets the criterion parameter, which measures the quality of a split for training. In this case, we’re using the mean squared error with improvement score by Jerry Friedman (resides outside of IBM)

pipeline = Pipeline([
       ("preprocessor", preprocessor),
       ("classifier", GradientBoostingClassifier(random_state=42, 
         criterion='friedman_mse', max_features=2)),
   ])

Next, perform cross-validation to evaluate how well your machine learning pipeline performs on the training data. Calling the fit method of the pipeline you created trains the model. The loss function uses Mean Squared Error (or MSE) by default.

# Perform 5-fold cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)

# Fit the model on the training data
pipeline.fit(X_train, y_train)

Now that the model has been trained, predict the test set and check the performance:


# Predict on the test set
y_pred = pipeline.predict(X_test)

# Generate classification report
report = classification_report(y_test, y_pred)

Print the results:

print(f"Mean Cross-Validation Accuracy: {cv_scores.mean():.4f}")
print("\nClassification Report:")
print(report)

This step prints the following results:

Mean Cross-Validation Accuracy: 0.9775
Classification Report:
              precision    recall  f1-score   support
      Adelie       1.00      1.00      1.00        31
   Chinstrap       1.00      1.00      1.00        18
      Gentoo       1.00      1.00      1.00        18
    accuracy                           1.00        67
   macro avg       1.00      1.00      1.00        67
weighted avg       1.00      1.00      1.00        67

These results are very close to the accuracy reported by the R methods in the first part of this tutorial.

Related solutions
IBM watsonx.ai

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo