Time Series Forecasting with Lag-Llama

Authors

Joshua Noble

Data Scientist

Introduction

Forecasting is an important task of time series analysis because it allows a data scientist to identify patterns by using machine learning and then generate forecasts about the future. Deep learning for forecasting is an exciting topic in artificial intelligence that is beginning to show promise when compared to benchmarks from more traditional statistical methods such as ARIMA. Foundation models for time series data are similar to other forms of generative AI that are trained on large-scale time series datasets and can output either deterministic or probabilistic forecasts. A time series foundation model can create forecasts without pretraining, similar to how a large language model (LLM) can output text without being pretrained on a task.

Foundation models have been built for time series forecasts such as Moirai, TimeGPT-1 and TimesFM but these are all deterministic models. Lag-Llama is a general-purpose open source foundation model for probabilistic time series forecasting on univariate datasets that uses a transformer architecture. The paper announcing this is titled Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting by Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Hena Ghonia, and others.

To unpack this a little, a probabilistic forecast is one that generates a probability distribution of values for each forecast step rather than just a single value. This method is helpful to indicate how certain the model is about its predictions. A wide distribution would indicate low certainty while a narrower range of values would indicate that the model is fairly certain about its predictions. Purely deterministic forecasts don't tell us how certain they are about their forecasts, which can be problematic if we're trying to ascertain how confident we should be in our forecasts.

A univariate forecast means that a forecast doesn't use covariates. In the Granite Tiny Time Mixer tutorial, we used the air temperature, wind speed and wind direction as covariate terms to forecast air pollution readings. In a univariate time series, there's only one variable changing at a time. Lag-Llama uses lag features, which are previous readings from the time series, as covariates. In this way, it is conceptually similar to ARIMA models.

In this tutorial, we'll use the Lag-Llama model and see how it does in two different forecasting tasks. Zero-shot learning is when the model is not trained on the data it’s trying to predict. It’s an interesting test of how well our model can detect and respond to the patterns present in the time series. After that, we’ll fine-tune the model to see whether there’s a performance boost. On the free tier of watsonx^™ fine-tuning might take up to an hour but the model can be saved after that step for use later. The paid tier of watsonx provides access to a GPU that will decrease the fine-tuning time to around 10 minutes. With that in mind, let's get started.

Steps

Step 1: Set up your environment

In this step, we’ll guide you through creating an IBM account to access Jupyter Notebooks.

1. Log in to watsonx.ai^™ using your IBM^® Cloud^® account.

2. Click + to create a new project.

a. Select Create an empty project.

b. Enter a project name in the Name field.

c. Create a Cloud Object Storage for storing your project assets if not already created.

d. Select Create.

3a. Create a Jupyter Notebook.

a. Select the Assets tab in your project environment.

b. Click New asset.

c. Select the Working with models option in the left panel.

d. Click Working with data and models using Python and R notebooks.

e. Enter a name for your notebook in the Name field. Choose Runtime 23.1 on Python (2 vCPU 8 GB RAM) to define the configuration.

f. Select Create.

3b. Upload a Jupyter Notebook.

a. Select the Assets tab in your project environment.

b. Click New asset.

c. Select the Working with models option in the left panel.

d. Click Working with data and models using Python and R notebooks.

e. Click Local File in the left-hand tab

f. Download the notebook from GitHub

e. Enter a name for your notebook in the Name field. Choose Runtime 23.1 on Python (2 vCPU 8 GB RAM) to define the configuration.

f. Select Create.

Step 2: Clone the Lag-Llama repository and install libraries

To use the Lag-Llama model, we'll clone the open source GitHub repository and install it on our watsonx.ai instance.

!git clone https://github.com/time-series-foundation-models/lag-llama/
%cd lag-llama
!pip install -r requirements.txt --quiet

Next we need to install pretrained model weights from the HuggingFace repository where they're stored. To do this, we use the Hugging-Face Command Line Interface to download the trained Lag-Llama checkpoint:

!huggingface-cli download time-series-foundation-models/Lag-Llama lag-llama.ckpt --local-dir ~/work/lag-llama

Now we have the weights for a pretrained model that we can use in zero shot forecasting and fine-tuning.

Step 3: Import libraries and data

Next, we need to import libraries to work with Lag-Llama. The library that the Lag-Llama team built to work with Lag-Llama uses GluonTS, a PyTorch based library for working with time series data and forecasting models.

from itertools import islice

from matplotlib import pyplot as plt
import matplotlib.dates as mdates
from tqdm.autonotebook import tqdm

import torch
from gluonts.evaluation import make_evaluation_predictions, Evaluator
from gluonts.dataset.repository.datasets import get_dataset

from gluonts.dataset.pandas import PandasDataset
import pandas as pd

from lag_llama.gluon.estimator import LagLlamaEstimator

Next, we load the data from the GitHub repository. This time series contains the daily minimum temperatures in Melbourne, Australia. Because the dataset is a univariate dataset and contains values for only one variable, each row consists simply of a date and a temperature reading in Celsius.

It does have two missing dates that we need to fill in by resampling the dataset and interpolating between values so that there are no missing values in the time series.

df = pd.read_csv("https://raw.githubusercontent.com/joshuajnoble/Lag-Llama-Tutorial/main/daily-min-temperatures.csv")
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index("Date")
df = df.resample('D').sum().interpolate('linear')

With a complete time series, we can split our Pandas dataframe into a train, validate and test dataset.

for col in df.columns:
    # Check if column is not of string type
    if df[col].dtype != 'object' and pd.api.types.is_string_dtype(df[col]) == False:
        df[col] = df[col].astype('float32')

train_end = round(len(df) * 0.7)
valid_end = round(len(df) * 0.9)
train = PandasDataset(df[:train_end], freq="1d", target="Temp")
valid = PandasDataset(df[train_end:valid_end], freq="1d", target="Temp")
test = PandasDataset(df[valid_end:], freq="1d", target="Temp")

Now we're ready to make predictions with our dataset.

Step 4: Create a zero-shot predictor

We'll create some configuration settings to use with our model. The prediction_length is how many time steps each prediction should contain. Because our data is daily, we'll predict the next week of temperatures in each forecast. The context_length sets the number of time points back into the past that the model should look for lagged correlations. We don't want this to be too wide or too narrow, for each dataset the optimal value will be different.

prediction_length=7 # our data is daily, so we'll predict one week out
context_length = prediction_length*3 # how many lags to use as context
num_samples = 20 # how many samples for each distribution
device = "cpu" # where should we run it, options are "cuda" or "cpu"
batch_size = 64

Now we create the forecaster. This step consists of two key steps: first, creating a LagLlamaEstimator which uses all the parameters copied from the downloaded Lag-Llama model. The second step is to create a LagLlamaPredictor using the create_predictor() method of the estimator. This allows us to pass a context_length sized window of data to get forecasts from the predictor.

ckpt = torch.load("lag-llama.ckpt", map_location=device)
estimator_args = ckpt["hyper_parameters"]["model_kwargs"]

zs_estimator = LagLlamaEstimator(
  ckpt_path="lag-llama.ckpt",
  prediction_length=prediction_length,
  context_length=context_length,
  device=torch.device('cpu'),

  # estimator args
  input_size=estimator_args["input_size"],
  n_layer=estimator_args["n_layer"],
  n_embd_per_head=estimator_args["n_embd_per_head"],
  n_head=estimator_args["n_head"],
  scaling=estimator_args["scaling"],
  time_feat=estimator_args["time_feat"],

  nonnegative_pred_samples=True,

  # linear positional encoding scaling
  rope_scaling={
      "type": "linear",
      "factor": max(1.0, (context_length + prediction_length) / estimator_args["context_length"]),
},

  batch_size=batch_size,
num_parallel_samples=num_samples
)

zs_predictor = zs_estimator.create_predictor(zs_estimator.create_transformation(), zs_estimator.create_lightning_module())

Now we're ready to create forecasts for one week out.

Step 5: Zero-shot forecasting

In this step, we'll ask the model to create 9 forecasts using 9 different months from the test dataset. We can use the make_evaluation_predictions from the gluonts.evaluation library to generate our forecasts.

date_list = pd.date_range(df[valid_end:].index[60], periods=9, freq="30d").tolist()

zs_forecasts = []
zs_tss = []

for d in date_list:
forecast_it, ts_it = make_evaluation_predictions(
  dataset=PandasDataset(df[:d], freq="1d", target="Temp"),
  predictor=zs_predictor,
  num_samples=num_samples
)
zs_forecasts.append(list(forecast_it))
zs_tss.append(list(ts_it))

To evaluate the forecasts, we use an evaluator object, also from the gluonts.evaluation library. This approach will generate all the statistics about our predictions that we might want to use to evaluate how accurate our predictions are using MAPE.

evaluator = Evaluator()
zs_a_metrics = [] # aggregated forecast metrics, we'll use the MAPE metric to evaluate
zs_t_metrics = [] # information about each time series, we'll use this to graph

for (t,s) in zip(zs_tss, zs_forecasts):
  agg_metrics, ts_metrics = evaluator(t, s)
  zs_a_metrics.append(agg_metrics)
  zs_t_metrics.append(ts_metrics)

Once we have the evaluations for each prediction, we can graph each prediction:

plt.figure(figsize=(16, 9))
plt.rcParams.update({'font.size': 11})

for idx in range(len(zs_forecasts)):
ax = plt.subplot(3, 3, idx+1)
  t = zs_tss[idx][0][zs_forecasts[idx][0].start_date.to_timestamp() - pd.Timedelta(days=7):]

  smape = float(f'{zs_t_metrics[idx]["MAPE"][0]:.4f}') * 100.0
  smape = float(f'{smape:.2f}')
  ax.set_title("Start: " + str(zs_t_metrics[idx]["forecast_start"][0]) + " MAPE: " + str(smape) + "%")
  plt.plot(t.index.to_timestamp(), t[0])
  ax.set_xticklabels([])

  zs_forecasts[idx][0].plot(color='g')

plt.gcf().tight_layout()
plt.subplots_adjust(top=0.9)
plt.suptitle("Mean Absolute Percentage Error across forecasts")
plt.show()

The generated chart shows each of our 9 zero-shot forecasts (shown in green) and the time series data that lead up to it (the blue line). For each forecast we can see the mean forecast as the green line, the boundaries of the 50% prediction interval in dark green and the boundaries of the 90% prediction interval in lighter green. This shows us how certain our model is about the forecast at each step. This is the advantage of a probabilistic model: it will show us how certain it is at each step in the forecast.

We can see that some of the predictions are quite accurate with a 5.13% error for the period beginning 1990-02-24, while others are much less so, for instance the 62.33% error for the period beginning 1990-08-23.

Step 6: Fine-tuning

Now we'll fine tune our model using the training and validation datasplits that we set aside previously. The Lag-Llama authors have recommendations for fine-tuning:

We recommend tuning two important hyperparameters for each dataset that you finetune on: the context length (suggested values: 32, 64, 128, 256, 512, 1024) and the learning rate (suggested values: 0.01, 0.05, 0.001, 0.005, 0.0001, 0.0005). We also highly recommend using a validation split of your dataset to early stop your model, with an early stopping patience of 50 epochs.

For our dataset, a context_length of 64 and a learning rate of 0.0005 work well enough, though other values might perform better.

The first step is to create a newLagLlamaEstimator :

ckpt = torch.load("lag-llama.ckpt", map_location=device)
estimator_args = ckpt["hyper_parameters"]["model_kwargs"]

finetune_estimator = LagLlamaEstimator(
ckpt_path="lag-llama.ckpt",
  prediction_length=prediction_length,
  context_length=64,

  nonnegative_pred_samples=True,
  aug_prob=0,
  lr=5e-4,
  device=torch.device('cpu'),

  # estimator args
  input_size=estimator_args["input_size"],
  n_layer=estimator_args["n_layer"],
  n_embd_per_head=estimator_args["n_embd_per_head"],
  n_head=estimator_args["n_head"],
  time_feat=estimator_args["time_feat"],

  # linear positional encoding scaling
  rope_scaling={
      "type": "linear",
      "factor": max(1.0, (context_length + prediction_length) / estimator_args["context_length"]),
  },

  batch_size=64,
  num_parallel_samples=num_samples,
  trainer_kwargs = {"max_epochs": 50,} # lightning trainer arguments
)

Now we can retrain using the new estimator:

finetuned_predictor = finetune_estimator.train(train, valid, cache_data=True, shuffle_buffer_length=1000)

This fine-tuning step can take up to an hour to run if you run it on the CPU, if you have access to a GPU it should fine-tune in about 10 minutes. Once fine-tuning is complete, we'll have a new model to generate forecasts.

Optional step:

If you have storage associated with your watsonx instance, you can upload the fine-tuned model there. You can create an Access Token in the Manage tab of the Projects page by going to Access Control and creating a token if one does not exist. Then, copy the token string into the token field of theaccess_project_or_space method.

# optional step to save finetuned model
from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space({"token":"<YOUR_WORKSPACE_TOKEN>"})

from datetime import date
import os

if not os.path.exists("~/lag-llama-models/"):
os.makedirs("~/lag-llama-models/")

torch.save(finetuned_predictor, "~/lag-llama-models/" + date.today().strftime('%m-%d'))
wslib.upload_file("~/lag-llama-models/" + date.today().strftime('%m-%d'))

Step 7: Forecast with fine-tuned model

We'll use the same approach to forecasting with our fine-tuned model as we did with the zero-shot estimator, creating forecasts for one week out at 9 different dates in our test dataset.

date_list = pd.date_range(df[valid_end:].index[60], periods=9, freq="30d").tolist()

finetune_forecasts = []
finetune_tss = []

for d in date_list:
print(d)
forecast_it, ts_it = make_evaluation_predictions(
dataset=PandasDataset(df[:d], freq="1d", target="Temp"),
predictor=finetuned_predictor,
num_samples=num_samples
)
finetune_forecasts.append(list(forecast_it))
finetune_tss.append(list(ts_it))

Now we can reuse the estimator from the zero-shot step:

finetune_a_metrics = []
finetune_t_metrics = []

for (t,s) in zip(finetune_tss, finetune_forecasts):
agg_metrics, ts_metrics = evaluator(t, s)
finetune_a_metrics.append(agg_metrics) # aggregated forecast metrics, we'll use the MAPE metric to evaluate
finetune_t_metrics.append(ts_metrics) # information about each time series, we'll use this to graph

The same graph gives us the ability to compare the accuracy of the forecasts:

plt.figure(figsize=(16, 9))
plt.rcParams.update({'font.size': 11})

for idx in range(len(finetune_forecasts)):
ax = plt.subplot(3, 3, idx+1)
  t = finetune_tss[idx][0][finetune_forecasts[idx][0].start_date.to_timestamp() - pd.Timedelta(days=7):]

  smape = float(f'{finetune_t_metrics[idx]["MAPE"][0]:.4f}') * 100.0
  smape = float(f'{smape:.2f}')
  ax.set_title("Start: " + str(finetune_t_metrics[idx]["forecast_start"][0]) + " MAPE: " + str(smape) + "%")
  plt.plot(t.index.to_timestamp(), t[0])
  ax.set_xticklabels([])

  finetune_forecasts[idx][0].plot(color='g')

plt.gcf().tight_layout()
plt.subplots_adjust(top=0.9)
plt.suptitle("Mean Absolute Percentage Error across forecasts")
plt.show()

This returns the following graph where we can see how the fine-tuned model performs on the same week predictions:

We can see that some of the predictions are still less accurate than we might want, for instance a ~35% error for the week beginning 1990-06-24. Overall, the least accurate forecasts for fine-tuned model are more accurate than for the zero-shot model.

Step 8: Comparisons

Now let's look to see where fine-tuning has improved the performance of our model. We'll just subtract the zero-shot mean absolute percentage error (MAPE) from the MAPE of the fine-tuned forecast to get the improvement of finetuning:

from IPython.display import HTML, display
comparison = pd.DataFrame()

for (z, f) in zip(zs_t_metrics, finetune_t_metrics):
comparison = pd.concat([comparison, pd.DataFrame({"Date":f['forecast_start'], "Finetuned MAPE":f['MAPE'][0] * 100, "Zeroshot MAPE":z['MAPE'][0] * 100, "Finetuning Improvement":(z['MAPE'][0] - f['MAPE'][0]) * 100})])

display(HTML(comparison.to_html()))

This will print out the following:

Date        Finetuned MAPE  Zeroshot MAPE   Finetuning Improvement
1990-02-24  14.074690       5.132133        -8.942557
1990-03-26  12.999803       23.437066       10.437263
1990-04-25  15.111790       18.150030       3.038239
1990-05-25  33.572088       34.048983       0.476895
1990-06-24  34.988431       60.099336       25.110906
1990-07-24  18.796240       18.077844       -0.718396
1990-08-23  33.861474       62.327957       28.466484
1990-09-22  29.453261       51.933639       22.480379
1990-10-22  18.961399       17.984154       -0.977245

We can see that in some forecasts the zero-shot forecast does slightly better, but in most of the forecasts the fine-tuned model performs roughly the same or significantly better.

Conclusion

In this tutorial you learned about foundation models for forecasting, an area of research in artificial intelligence with important applications in healthcare, environmental monitoring, anomaly detection and many other fields and tasks. You used the Lag-Llama model, a foundation model trained on time series data specifically built for forecasting. You used the base model to perform zero-shot forecasting on a daily temperature dataset. You then fine-tuned the Lag-Llama model with training data and generated forecasts from that fine-tuned model and compared its performance to the zero-shot dataset.

You can learn more about the Lag-Llama at the Lag-Llama GitHub Repository.