Time series sales forecasting with IBM Granite

Two people, a man and woman, work together on a laptop in an office setting.

Authors

Joe Sepi

Program Director, Open Technologies and Developer Advocacy

IBM

AI Developer Advocate Lead

Time series sales forecasting with IBM Granite

In this tutorial, we will explore time series forecasting using an IBM Granite® Time Series Foundation Model (TSFM) to predict retail sales. In addition to forecasting, we will cover key machine learning techniques such as few-shot fine-tuning, an efficient fine-tuning strategy by using only a portion of the designated training data. The sales data for this use case comes from the M5 datasets from the M-Competitions repository, provided as part of the long-running M-competition series for encouraging the research and development of forecasting methods. The aim of this tutorial is to forecast future sales aggregated by state, while showcasing how to use a pretrained TSFM for multivariate forecasting. In our analysis, we'll also use features available within the open source Granite Time Series Foundation Models toolkit, granite-tsfm.

In conjunction with thegranite-tsfm toolkit, this forecasting analysis leverages a TinyTimeMixer (TTM) model from the family of Granite TSFMs (available on HuggingFace and other open-source platforms). These TTMs are compact, pretrained models open sourced by IBM Research and capable of multivariate time series forecasting. The name TTM refers to the unique architecture of the models. With less than 1 million parameters, TTM introduced the notion of the first-ever "tiny" pretrained model for time series forecasting. TTM outperforms several popular benchmarks demanding billions of parameters in zero-shot and few-shot forecasting and can easily be fine-tuned for multivariate forecasts. Compared to traditional time series forecasting methods used in data science, Granite TSFMs have impressive zero-shot performance and support efficient fine-tuning for even more accurate predictions.

Step 1: Set up your environment

1.1: Install the TSFM library

The granite-tsfm library provides utilities for working with Time Series Foundation Models (TSFM). Here we retrieve and install the latest version of the Python library. The current Python versions supported are 3.9, 3.10, 3.11, 3.12.

# Install the tsfm library and a utility to help download data files from google drive during the data prep process
! pip install "granite-tsfm[notebooks] @ git+https://github.com/ibm-granite/granite-tsfm.git@v0.2.22" gdown -q

1.2: Import packages

In addition to standard packages for data science, we'll use functionality from thetsfm_public/toolkit directory to prepare the data, fine-tune the model and generate forecasts.

To subset our data between training and test data, we'll use the functionsprepare_data_splits andselect_by_timestamp . To visualize the data, we'll use theplot_predictions function. For preprocessing the data, theTimeSeriesPreprocessor class performs data transformations such as standardization and encoding categorical variables.

For fine-tuning, the toolkit uses a custom dataset type calledForecastDFDataset which is optimized for fast training and resulting forecasting by leveragingtorch . We'll also leverage theTrackingCallback class, thecount_parameters function and theoptimal_lr_finder function during fine-tuning.

To interact with the model, we'll use theTimeTimeMixerForPrediction class. Lastly, for forecasting we'll use theTimeSeriesForecastingPipeline class.

In addition to thetsfm_public toolkit, we'll use functionality fromtransformers andtorch for the fine-tuning step.

import math
import os

import numpy as np
import pandas as pd
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import OneCycleLR
from torch.utils.data import Subset
from transformers import EarlyStoppingCallback, Trainer, TrainingArguments, set_seed

from tsfm_public.toolkit.time_series_preprocessor import prepare_data_splits
from tsfm_public.toolkit.util import select_by_timestamp
from tsfm_public.toolkit.visualization import plot_predictions

from tsfm_public import (
    ForecastDFDataset,
    TimeSeriesForecastingPipeline,
    TimeSeriesPreprocessor,
    TinyTimeMixerForPrediction,
    TrackingCallback,
    count_parameters,
)
from tsfm_public.toolkit.lr_finder import optimal_lr_finder

1.3: Specify configuration variables

Granite Time Series Models are sized with defined context lengths and forecast lengths. The context length will be sized accordingly with historical data, defining how many data points to look back into the past when making predictions. The forecast length specifies how many data points in the future to make predictions for. Here, we defineforecast_length andcontext_length to match the model we will select. It's important to note that these values must match the specification for the selected model.

Next, we declare the Granite Time Series Foundation Model, including the specific revision that we are targeting. In this time series analysis example, we will be working with daily data, so we choose a model suitable for that resolution—90 days of historical data to forecast the next 30 days. If a differentforecast_length orcontext_length makes sense for your data, the granite-timeseries TTM R2 card has several different revisions of the model available for various context lengths and prediction lengths.

forecast_length = 28
context_length = 90

TTM_MODEL_PATH = "ibm-granite/granite-timeseries-ttm-r2"
REVISION = "90-30-ft-l1-r2.1"

device = "cuda" if torch.cuda.is_available() else "cpu"

Step 2: Prepare the Data

As noted earlier, this notebook uses the M5 datasets from the official M-Competitions repository. You can read more about the competition and the dataset here.

2.1: Read in the data

Following initial data analysis, we observed that the original time series data includes hierarchy and product information. To prepare the data for this forecasting experiment, we will aggregate the sales by state into three separate time series. We'll use theprepare_data function in theM5_retail_data_prep.py file included here to download the datasets and prepare them as described.

Let's make sure we have access to theM5_retail_data_prep.py file (in an environment like colab, we need to download the file). Then, we simply run theprepare_data function to save the prepared dataset.

import requests

def download_file(file_url, destination):
    if os.path.exists(destination):
        return
    response = requests.get(file_url)
    if response.status_code == 200:
        with open(destination, "wb") as file:
            file.write(response.content)
        # logger.info(f"Downloaded: {destination}")
    else:
        print(f"Failed to download {file_url}. Status code: {response.status_code}")

download_file(
file_url="https://raw.githubusercontent.com/ibm-granite-community/granite-timeseries-cookbook/refs/heads/main/recipes/Retail_Forecasting/M5_retail_data_prep.py",
destination="./M5_retail_data_prep.py",
)

# From the file we just made sure we had access to, import the data prep method
from M5_retail_data_prep import prepare_data

prepare_data()

Following a typical data science workflow, we parse the resulting CSV file into apandas DataFrame . We also ensure that the timestamp column is a UTC datetime and drop two unnecessary columns of data points.

data_path = "m5_for_state_level_forecasting.csv.gz"

data = pd.read_csv(data_path, parse_dates=["date"]).drop(columns=["d", "weekday"])
data.head()

2.2: Set column identifiers

The next step for our time series analysis is to organize the columns in our sales data with the naming conventions required for input to our model.

In preparation for creating aTimeSeriesPreprocessor object, we'll set up thecolumn_specfiers dictionary here, declaring the names of thetimestamp_column , thetarget_column to be predicted, as well as indicatingcategorical_columns for encoding.

By specifying a list ofcols forcontrol_columns , we include the remaining columns in the forecasting dataset, allowing for the potential interactions of these exogenous variables with ourtarget_column , thesales values.

These column designations are important for the fine-tuning workflow, which permits multivariate forecasting by using the exogenous variables.

cols = list(data.columns)
[cols.remove(c) for c in ["date", "sales", "state_id", "state_id_cat"]]
cols

column_specifiers = {
    "timestamp_column": "date",
    "id_columns": ["state_id"],
    "target_columns": ["sales"],
    "control_columns": cols,
    "static_categorical_columns": ["state_id_cat"],
    "categorical_columns": [
    "event_name_1",
    "event_type_1",
    "event_name_2",
    "event_type_2",
    ],
}

2.3: Train a preprocessor

Next, we set up a TimeSeriesPreprocessor, passing thecolumn_specifiers dictionary we created, as well as indicating that it should scale the data by usingsklearn's StandardScaler and ordinally encode thecategorical_columns we defined in the previous step.

Theselect_by_timestamp method is used to subset the training portion of the input data.

We train the scaling algorithm using thetsp.train method. Later, we'll use thepreprocess method on ourtsp preprocessor object to apply the scaling algorithm.

tsp = TimeSeriesPreprocessor(
    **column_specifiers,
    context_length=context_length,
    prediction_length=forecast_length,
    scaling=True,
    encode_categorical=True,
    scaler_type="standard",
)

df_train = select_by_timestamp(
    data, timestamp_column=column_specifiers["timestamp_column"], end_timestamp="2016-05-23"
)

trained_tsp = tsp.train(df_train)

2.4: Split the data

For fine-tuning, we use the same data splits we defined earlier, but now we can include the extra columns as multivariate input.

We split the time series data into training, validation and test sets by using theprepare_data_splits function. The training and validation sets are used in the fine-tuning loop during model training, while the test set is used to evaluate the model performance after fine-tuning.

split_params = {"train": 0.5, "test": 0.25}

train_data, valid_data, test_data = prepare_data_splits(
data, id_columns=column_specifiers["id_columns"], split_config=split_params, context_length=context_length
)

2.5: Create torch datasets

Next, we will construct three custom type ForecastDFDataset datasets by using the created data splitstrain_data ,valid_data andtest_data . We apply thepreprocess method from theTimeSeriesPreprocessor class to prepare the data before creating the dataset.

In comparison to apandas DataFrame , theForecastDFDataset type is specifically designed for fine-tuning and forecasting, leveragingtorch for faster performance. The HuggingFace Trainer API, which we'll be using for fine-tuning, can use these torch-based datasets.

frequency_token = tsp.get_frequency_token(tsp.freq)

dataset_params = column_specifiers.copy()
dataset_params["frequency_token"] = frequency_token
dataset_params["context_length"] = context_length
dataset_params["prediction_length"] = forecast_length

train_dataset = ForecastDFDataset(tsp.preprocess(train_data), **dataset_params)
valid_dataset = ForecastDFDataset(tsp.preprocess(valid_data), **dataset_params)
test_dataset = ForecastDFDataset(tsp.preprocess(test_data), **dataset_params)

2.6: Sample the data

Before beginning fine-tuning, we want to further subset the training and validation datasets, implementing a few-shot fine-tuning strategy. This strategy will be more efficient than if we fine-tuned the whole training dataset. Here, we sample the torch datasets produced from preceding steps, reducing the length of the dataset to 20% of the original size.

# 20% training and validation data (few-shot finetuning)
fewshot_fraction = 0.20
n_train_all = len(train_dataset)
train_index = np.random.permutation(n_train_all)[: int(fewshot_fraction * n_train_all)]
train_dataset = Subset(train_dataset, train_index)

n_valid_all = len(valid_dataset)
valid_index = np.random.permutation(n_valid_all)[: int(fewshot_fraction * n_valid_all)]
valid_dataset = Subset(valid_dataset, valid_index)

n_train_all, len(train_dataset), n_valid_all, len(valid_dataset)

(2601, 520, 1398, 279)

Step 3: Fine-tune the model

Now we will focus on fine-tuning the pretrained model. The TinyTimeMixer architecture allows for fast fine-tuning, even on a CPU. In the following image from the IBM Research paper, the workflow for fine-tuning is illustrated in part (a).

A flowchart illustrating the process of a Time-to-Market (TTM) system, with various.

In the diagram, the TTM architecture is depicted with 4 main components (a):

the TTM backbone composed of TSMixer blocks,
the TTM decoder, similarly architected, but 10–20% the size of the backbone,
the forecast head that produces forecasts and
an exogenous mixer, an optional component permitting multivariate forecasts.

The TTM decoder (2) and the forecast head (3) make up the TTM head. This head is typically retrained during fine-tuning, which is more efficient than fine-tuning the larger backbone component.

In contrast to the fine-tuning workflow, the pretrain workflow—also shown in part (a)—doesn't permit multivariate input, thus being unable to leverage exogenous variables in forecasting. The model is initially trained with univariate input, without considering interactions between variables, as indicated by the channel-independent methods in both the backbone and decoder components of the model for the pretrain workflow.

In the fine-tuning workflow, multivariate input is permitted and leveraged for its potential interactions with the target variable, by enabling channel mixing in the decoder and the optional exogenous mixer component.

You can read more about the steps for fine-tuning in the Granite docs.

3.1: Load the model

First, we use thefrom_pretrained method to load the TTM model (available on HuggingFace) by using the model and revision set from earlier. As indicated in our previously definedcolumn_specifiers dictionary, for this time series analysis, we have one target channel, several exogenous channels and one static categorical input. These exogenous channels, provided as additional columns beyond the target column in the fine-tuning workflow, might demonstrate interactions with the target channel that we can leverage for forecasting.

To provide these columns as input to the model, we provide theprediction_channel_indices ,exogenous_channel_indices andcategorical_vocab_size_list information to the model via theTimeSeriesPreprocessor objecttsp .

Note that we also enable channel mixing in the decoder by settingdecoder_mode="mix_channel" and forecast channel mixing by settingenable_forecast_channel_mixing=True leveraging the exogenous mixer component. This step allows the decoder to be tuned to capture interactions between the channels as well as to adjust the forecasts based on interactions with the exogenous, permitting multivariate forecasting.

set_seed(1234)

finetune_forecast_model = TinyTimeMixerForPrediction.from_pretrained(
    TTM_MODEL_PATH,
    revision=REVISION,
    context_length=context_length,
    prediction_filter_length=forecast_length,
    num_input_channels=tsp.num_input_channels,
    decoder_mode="mix_channel", # exog: set to mix_channel for mixing channels in history
    prediction_channel_indices=tsp.prediction_channel_indices,
    exogenous_channel_indices=tsp.exogenous_channel_indices,
    fcm_context_length=1, # exog: indicates lag length to use in the exog fusion. for Ex. if today sales can get affected by discount on +/- 2 days, mention 2
    fcm_use_mixer=True, # exog: Try true (1st option) or false
    fcm_mix_layers=2, # exog: Number of layers for exog mixing
    enable_forecast_channel_mixing=True, # exog: set true for exog mixing
    categorical_vocab_size_list=tsp.categorical_vocab_size_list, # sizes of the static categorical variables
    fcm_prepend_past=True, # exog: set true to include lag from history during exog infusion.
)

After loading the model, we see a message reminding us to fine-tune the model before using it on a forecasting task.

3.2: Optional: Freeze the TTM backbone

Oftentimes, during fine-tuning we freeze the backbone component of the model, leaving these pretrained weights unchanged and we focus on tuning only the parameters in the decoder. This step reduces the overall number of parameters being tuned and maintains what the encoder learned during pretraining.

However, in this time series analysis, we found that performance was better when the backbone remained unfrozen—for other datasets one might prefer to freeze the backbone. We have disabled the backbone freezing code, but left it intact as an example of what might need to be done for other datasets.

freeze_backbone = False
if freeze_backbone:
    print(
        "Number of params before freezing backbone",
        count_parameters(finetune_forecast_model),
    )

    # Freeze the backbone of the model
    for param in finetune_forecast_model.backbone.parameters():
        param.requires_grad = False

    # Count params
    print(
        "Number of params after freezing the backbone",
        count_parameters(finetune_forecast_model),
    )

3.3: Set hyperparameters

We'll use the Trainer API from HuggingFace for fine-tuning. To set up our fine-tuning, we need to specify values for hyperparameters. In the following code we set values fornum_epochs ,batch_size andlearning_rate which we'll soon pass asTrainingArguments when we create aTrainer object. We use the optimal_lr_finder function from thetsfm_public toolkit to find the optimallearning_rate for this dataset. Note that theseTrainingArguments are specific to this particular dataset.

num_epochs = 50
batch_size = 64

learning_rate, finetune_forecast_model = optimal_lr_finder(
    finetune_forecast_model,
    train_dataset,
    batch_size=batch_size,
    enable_prefix_tuning=True,
)
print("OPTIMAL SUGGESTED LEARNING RATE =", learning_rate)

OPTIMAL SUGGESTED LEARNING RATE = 0.000298364724028334

3.4: Train the model

Here we train the model on the historical training data by using the hyperparameters that we previously set. We create theTrainer object by using the specifiedTrainingArguments , anEarlyStoppingCallback criteria, an AdamW optimizer andOneCycleLR scheduler. We've set thenum_epochs to50 , but theEarlyStoppingCallback implemented here will stop the training after 10 epochs with no improvement. After theTrainer object is configured, we call thetrain method to perform fine-tuning and the evaluate method to provide metrics about the fine-tuning results.

OUT_DIR = "ttm_finetuned_models/"

print(f"Using learning rate = {learning_rate}")
    finetune_forecast_args = TrainingArguments(
    output_dir=os.path.join(OUT_DIR, "output"),
    overwrite_output_dir=True,
    learning_rate=learning_rate,
    num_train_epochs=num_epochs,
    do_eval=True,
    eval_strategy="epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=2 * batch_size,
    dataloader_num_workers=1,
    report_to="none",
    save_strategy="epoch",
    logging_strategy="epoch",
    save_total_limit=1,
    logging_dir=os.path.join(OUT_DIR, "logs"), # Make sure to specify a logging directory
    load_best_model_at_end=True, # Load the best model when training ends
    metric_for_best_model="eval_loss", # Metric to monitor for early stopping
    greater_is_better=False, # For loss
    use_cpu=device == "cpu",
)

# Create the early stopping callback
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=10, # Number of epochs with no improvement after which to stop
    early_stopping_threshold=0.0, # Minimum improvement required to consider as improvement
)
tracking_callback = TrackingCallback()

# Optimizer and scheduler
optimizer = AdamW(finetune_forecast_model.parameters(), lr=learning_rate)
scheduler = OneCycleLR(
    optimizer,
    learning_rate,
    epochs=num_epochs,
    steps_per_epoch=math.ceil(len(train_dataset) / (batch_size)),
)

finetune_forecast_trainer = Trainer(
    model=finetune_forecast_model,
    args=finetune_forecast_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[early_stopping_callback, tracking_callback],
    optimizers=(optimizer, scheduler),
)

# Fine tune
finetune_forecast_trainer.train()

finetune_forecast_trainer.evaluate(test_dataset)

Using learning rate = 0.000298364724028334

A table displaying Epoch, Training Loss, and Validation Loss values.

{'eval_loss': 0.3611494302749634,
'eval_runtime': 15.3951,
'eval_samples_per_second': 90.613,
'eval_steps_per_second': 0.715,
'epoch': 37.0}

Step 4: Forecasting and evaluation

4.1: Generate forecasts

We'll leverage the TimeSeriesForecastingPipeline fromgranite-tsfm to create forecasts by using our fine-tuned model. The following is a preview of these forecasts.

# generate forecasts using the finetuned model
pipeline = TimeSeriesForecastingPipeline(
    finetune_forecast_model,
    device=device, # Specify your local GPU or CPU.
    feature_extractor=tsp,
    batch_size=batch_size,
)

# Make a forecast on the target column given the input data.
finetune_forecast = pipeline(test_data)
finetune_forecast.head()

4.2: Evaluate the model

To assess whether the model produces accurate predictions for our time series analysis, we’ll evaluate the fine-tuned model on the originaltest_data (apandas dataset). Then, we'll quantify the forecast errors by defining acustom_metric to calculate the MSE (Mean Squared Error), RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error).

pd.options.display.float_format = '{:.2f}'.format

# Define some standard metrics.
def custom_metric(actual, prediction, column_header="results"):
    """Simple function to compute MSE"""
    a = np.asarray(actual.tolist())
    p = np.asarray(prediction.tolist())
    if p.shape[1] < a.shape[1]:
        a = a[:, : p.shape[1]]

    mask = ~np.any(np.isnan(a), axis=1)

    mse = np.mean(np.square(a[mask, :] - p[mask, :]))
    mae = np.mean(np.abs(a[mask, :] - p[mask, :]))
    return pd.DataFrame(
        {
        column_header: {
        "mean_squared_error": mse,
        "root_mean_squared_error": np.sqrt(mse),
        "mean_absolute_error": mae,
        }
    }
)

Then, we'll calculate error metrics using the predefinedcustom_metric . The RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) metrics demonstrate low error for this forecasting problem on this dataset. While the sales values that we are predicting are in the tens of thousands, the RMSE and MAE metrics give errors hovering around USD 1,000. The MSE (Mean Squared Error) is higher, but that might be due to outlier predictions with high errors.

custom_metric(finetune_forecast["sales"], finetune_forecast["sales_prediction"], "fine-tune forecast")

4.3: Plot the predictions vs. actuals

Finally, we use the plot_predictions function fromgranite-tsfm to plot the predictions or future values for sales forecasting vs. the actual values for some random samples of time intervals in test dataset, visualizing the forecast error. Using random samples of time intervals allows us to test how the model handled fluctuations in the data. We can also observe the past values in these plots that served as context length.

plot_predictions(
    input_df=test_data[test_data.state_id == "CA"],
    predictions_df=finetune_forecast[finetune_forecast.state_id == "CA"],
    freq="d",
    timestamp_column=column_specifiers["timestamp_column"],
    channel=column_specifiers["target_columns"][0],
)

time series sales forecast visualization

Summary

In this tutorial, we performed time series forecasting for a sales data use case. Our time series analysis demonstrates forecasting methods by using a Granite TSFM model. This model is a compact pretrained foundation model that has been shown to outperform other machine learning algorithms including ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short-Term Memory networks). While our use case focuses on sales forecasting, the Granite TSFM models can aid data scientists by forecasting values for numerous variables while being robust to outliers and seasonal variations in the data. Some of examples of these variables include weather, stock prices, flu cases, other temporal data and more. Time series predictions like these enable real world data-driven decision making.

Four steps to better business forecasting with analytics

Use the power of analytics and business intelligence to plan, forecast and shape future outcomes that best benefit your company and customers.

Resources

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

From data chaos to AI clarity: Activating AI through high-quality enterprise data

Understand how focusing on well-governed, secure and collaborative access to data at scale empowers enterprises to maximize their AI investments

Decision intelligence: Thoughtful, data-driven choices

Learn how data intelligence helps leaders make sense of data, use generative AI wisely and make decisions based on what truly matters.

Streamlining and evolving fraud investigations with AI

Discover how Cogniware leverages AI solutions from IBM to drive efficiency in the financial crime space.

Turning data strategy into AI impact

Discover how to scale AI with a strong data foundation, deliver explainable and governed outcomes, and apply real-world lessons to your own AI roadmap.

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

Unleash the power of AI for seamless data integration

Understand why organizations need to adopt a unified approach that lets them manage the full spectrum of integration capabilities from a single pane of glass, eliminating the need to rely on numerous tools.