Multivariate time series forecasting with sktime

9 May 2025

Author

Meredith Syed

AI Developer Advocate Lead

Bryan Clark

Senior Technology Advocate

Multivariate time series forecasting with sktime

Time series forecasting plays a critical role in domains like finance, energy, healthcare and supply chain management, where understanding how things evolve over time is essential. In particular, multivariate time series forecasting allows us to model the interactions between multiple variables as they change over time.

This modeling is helpful when performing estimations with exogenous variables (like weather) that can influence a target variable (like energy demand). In this context, weather is an independent variable that can have an impact on our target variable. In contrast, energy demand represents an endogenous variable, dependent on its own earlier values in the time series and potentially other external variables such as the weather.

This tutorial focuses on using the sktime library, a unified framework for machine learning with time series data. Sktime aims to simplify the workflow for time series problems by providing consistent, composable tools for tasks such as forecasting, classification and transformation. Its design is inspired by the scikit-learn API, making it approachable for people familiar with Python's broader ML ecosystem.

Sktime also allows users to create pipelines tailored to time series data by combining dedicated transformers and classifiers. Forecasting pipelines allow for the sequential application of preprocessing steps and forecasting models, facilitating a streamlined workflow for time series forecasting tasks.

While this tutorial solely focuses on comparing univariate and multivariate forecasting models, it's worth noting that sktime also supports hierarchical forecasting. This feature enables the application of different forecasting models at various levels of data aggregation. It also allows fine tuning by the use of hyperparameters.

Sktime allows for time series classification, regression and clustering while enabling users to switch between models without changing how one prepares or runs the code. This function makes it much easier to experiment and build better models faster.

Univariate versus multivariate time series

 

In a univariate time series, we track a single variable over time (for example, daily temperature). In contrast, a multivariate time series involves multiple interdependent variables tracked over the same time interval (for example, temperature, humidity and energy consumption recorded hourly). Multivariate modeling can help capture the underlying dynamics between these variables, leading to more accurate forecasts.

Forecasting energy demand

 

To demonstrate multivariate forecasting with sktime, we can use the hourly energy demand dataset. This dataset contains four years of hourly electricity consumption and generation data in Spain (2015–2018). Because many factors influence energy demand, such as weather conditions and time of day, this dataset provides an ideal foundation for multivariate modeling.

Step 1: Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

  1. Log in to watsonx.ai by using your IBM Cloud® account.

  2. Create a watsonx.ai project.

  3. Create a Jupyter Notebook.

This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai® project as an asset. This Jupyter Notebook can be found on GitHub.

Step 2: Install and import relevant libraries

We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they're not installed, you can use pip or another package manager to install them.

We're usingos andwget to retrieve the data file andpandas to load the dataset. To visualize the aggregate data series, we're using theplot_series function fromsktime.utils.plotting , which uses functionality from seaborn. To split the data into training and test sets for our experiment, we're using thetemporal_train_test_split function fromsktime.split .

Before modeling, we're using theStationarityADF function fromsktime.param_est.stationarity to test the time series for stationarity and theDifferencer function fromsktime.transformations.series.differencer to transform the time series.

While preparing the data, we're usinggrangercausalitytests fromstatsmodels.tsa.stattools to evaluate whether the exogenous variables in our tabular dataset are predictive of the target variable.

For time series forecasting, we're using theNaiveForecaster for univariate forecasting. For multivariate forecasting, we're using theVAR model as the regressor. Sktime does allow for Prophet, SARIMAX, ARIMA (through thepmdarima library) and some additional non-native deep learning model selection, but for this tutorial we're using theVAR model. To evaluate the performance of the model, we're using themean_absolute_percentage_error (MAPE) metric fromsklearn.performance_metrics.forecasting .

%pip install scikit-learn sktime seaborn statsmodels wget -q import os
import wget
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.colors import ListedColormap

from sktime.split import temporal_train_test_split
from sktime.param_est.stationarity import StationarityADF
from sktime.transformations.series.difference import Differencer
from statsmodels.tsa.stattools import grangercausalitytests
from sktime.utils.plotting import plot_series
from sktime.forecasting.naive import NaiveForecaster
from sktime.forecasting.var import VAR
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error

Step 3: Load and prepare the dataset

As described earlier, this tutorial uses the Hourly energy demand dataset containing four years of electrical consumption and energy generation data gathered in Spain in 2015–2018 aggregated by hour. It is a modified version of the Hourly energy demand generation and weather dataset.

You can find more details about the dataset, including metadata, in the preceding links. You can also use this tutorial as a template to explore other well-known datasets withsktime (such as the classic Box & Jenkins airline data fromsktime.datasets import load_airline ).

For simplicity, the tabular dataset was prepared to have no missing values and to remove irrelevant columns.

filename = 'energy_dataset.csv'
base_url = 'https://github.com/IBM/watson-machine-learning-samples/raw/refs/heads/master/cloud/data/energy/'

if not os.path.isfile(filename): wget.download(base_url + filename)

Let's examine the first few rows of the dataset. We can see the time column showing a timestamp for each hour. Other columns show numeric data types for energy generation from different sources, weather forecast details and actual energy usage, termed astotal load actual . This column is our target, the column for which we are trying to predict values.

In a univariate forecasting scenario, we'll use only thetotal load actual column to forecast future values of that variable. To perform multivariate forecasting, we're using more columns as input to our model to help inform it's predictions fortotal load actual . These additional columns provide details about energy generation and weather forecasts for each hour, enhancing the ability to predict actual energy demand on an hourly basis.

df = pd.read_csv(filename)
df.head()

 

To usesktime for forecasting, we need to set the time index of ourpandas DataFrame to be of typePeriodIndex . We also need to specify the frequency of the index as"h" for hourly.

df.set_index(pd.PeriodIndex(df['time'], freq='h'),inplace=True)
df.drop(['time'], axis=1, inplace=True)
df.head()

Split the data

For our forecasting problem, we need to split the data into two sets, using the first as historical data. We'll provide the historical data to the model and ask it to predict future values from this data.

To test the accuracy of our predictions, we also need to compare these predictions against ground truth values. To make this comparison, we use a second subset of our dataset as the ground truth and we compare the predicted values to the actual values in this ground truth subset.

There are more than 35,000 rows in our dataset, but we need to use only a subset of this data for our forecasting example. To subset the data, we need to specify a value asnum_of_rows , the number of rows of historical data that we want to provide as input to the model. While there are different ways to sample the data, we're using the most recent timestamps or the last rows of the dataset.

The second dataset that we need is the evaluation or ground truth dataset. Following the subset that we created based on the variablenum_of_rows , we can use the following consecutive rows of data to create the evaluation set. We need to create a variablefuture_context that determines how many data points are in the evaluation set. This dataset is used to compare against our predictions.

In choosing the values ofnum_of_rows andfuture_context for this example, they're set identically to the experiment in our tutorial, Using the watsonx.ai Time Series Forecasting API to predict energy demand, so that we can compare the results of these two experiments.

df.shape

(35064, 18)

num_of_rows = 512
future_context = 96
y, y_actual = temporal_train_test_split(df, train_size=num_of_rows, test_size=future_context, anchor="end")
y.shape

(512, 18)

Visualize the time series

Let's examine the data further with this data visualization, which plots the hourly timestamps against our target columntotal load actual . While there are many columns of data in our dataset, this column is the one that we are predicting.

Though we see differing energy demand across days in our dataset, we can also observe patterns in the data within a day, representing the differing energy demand throughout the day. Our data shows that energy demand is typically lower at the beginning and end of a day, with a brief drop at midday.

We can hypothesize that the midday drop might reflect lunchtime behavior in Spain. Patterns that repeat within a fixed period in a time series are referred to as seasonality. In our case, the data exhibits a pattern of daily seasonality, observed by the consistent peaks and valleys at regular intervals within a day.

plot_series(y['total load actual'], markers=",")

Step 4: Univariate forecasting with a seasonal naive forecaster

For this experiment, we want to create forecasts for our target variabletotal load actual , which describes daily energy demand.

Initially, we're using a naive method to create baseline forecasts for our data. Naive forecasting methods are based on the assumption that the best predictor of a future value is the current value.1

For our use case of hourly energy demand, we can think of the simplest implementation of naive forecasting to mean that the predicted energy demand for the next hour should be equal to the previous hour's energy demand. In practice, adding a small amount of random variation to naive predictions often improves them.

We can also improve naive forecasts by taking advantage of seasonal patterns observed in our data. In the case of our hourly energy demand data, we can see patterns in the time series within a one day period, indicating a seasonal periodicity of a day.

Instead of forecasting the next hour's energy demand to be similar to the previous hour's energy demand, we can forecast the next hour's energy demand to be similar to the energy demand for that same hour on the previous day.

To do this, we need to first specify the forecaster to use, in this case the NaiveForecaster. TheNaiveForecaster works by generating a 'last window' of historical data on which to base the forecasts. In the case of our hourly energy demand data with observed intraday seasonality, our window length consists of the previous 24-hour period for this "last window".

We must specify astrategy to the forecaster, describing how to use the data in the "last window". We set the strategy aslast , meaning the forecaster should consider the last value in the window, as opposed to taking a mean over all the data in the window. In combination, we set the seasonal periodicity orsp to24 to account for the intraday seasonality that we have observed.

Together, these two settings implement our seasonal naive strategy described previously, where the value of energy demand at a specific hour is predicted by using the energy demand at that hour on the previous day.

After the forecaster is initialized with the parameters as described previously, we call thefit method on theforecaster , passing our datay['total load actual] which contains the rows of historical data that we pass to our model for input. In this case, we are making predictions based on only one column of input data, so our forecast is a univariate forecast.

Next, we call thepredict method to create the forecasts, passing a forecasting horizonfh equal to ourfuture_context to generate the correct number of forecasted data points into the future.

# specify the forecasting algorithm including strategy and seasonal periodicity
forecaster = NaiveForecaster(strategy="last", sp=24)

# specify the forecasting horizon
fh = np.arange(1, future_context + 1) # we want to predict the next x hours

# fit the forecaster
forecaster.fit(y['total load actual'])

# make predictions using the forecaster
y_pred = forecaster.predict(fh)

For an initial evaluation of our model's performance, we can use a data visualization. Here, we plot the predictions (shown in gold), along with the historical data (shown in blue) for past values. We can also see the green line, oury_actual or the ground truth, against which to compare our predictions.

The gold line predicts hourly energy demand based on patterns in the historical data. We can observe the forecast pattern of our naive seasonal forecaster here. It looks like it repeated the last 24 hours of the input data across the forecast horizon.

In evaluating the green line against the gold line, we can see instances of overlap where the model's predictions are accurate. The naive forecaster has gotten very close on some of the data points.

plot_series(y['total load actual'], y_pred, y_actual['total load actual'], labels=["y", "y_pred", "y_actual"], markers=[',', ',', ','])

We can also calculate evaluation metrics to quantify the accuracy of the model. We're using the Mean Absolute Percentage Error (MAPE) as our metric. A lower percentage is better and our result, at less than 10%, is a good indication of the performance of the model for this use case.

As shown in our example, naive forecasters can often perform surprisingly well. For this reason, they are often used for baseline predictions as a first step before more advanced modeling methods.

mean_absolute_percentage_error(y_actual['total load actual'], y_pred) * 100

7.991385858936119

Step 5: Multivariate forecasting

While our seasonal naive forecaster performed reasonably well on the data, we can likely improve our forecast by using a more sophisticated model. Beyond using only one variable for prediction, we can take advantage of the other columns in the dataset to provide more input for forecasting. In this case, information about weather and the amount of energy generated by various sources might contain signals that affect the target variabletotal load actual .

5.1 Explore the data

We shouldn't assume that all the columns in our dataset are valuable in forecasting our target variable. In forecasting, we want to provide columns to our model only if we know that they show some alignment with our target variable. We must explore the data further to decide which variables to include.

For initial data exploration, we can look at correlations between our variables. The following correlation plot simply examines the relationship between each of our variables on an x and y axis. It can examine only two variables at once and it doesn't consider the temporal component (timestamp) of our data, but these correlations can be a useful first step of exploratory data analysis (EDA).

colors = ["#e5f6ff", "#82cfff", "#3ddbd9", "#be95ff", "#ff7eb6"]

corr = df.corr()

dev_brand_cmap = ListedColormap(colors, "dev_brand")

corr.style.background_gradient(cmap=dev_brand_cmap)
dev_brand_cmap

In interpreting this correlation plot, the colors with the strongest correlations are shown in purple and pink. If we look at thetotal load actual column, we can see which other features in the dataset are strongly correlated with our target column.

Based on this data, we can probably assume that the columntotal load forecast contains previously forecasted values for our target columns. Because it's so closely correlated with our target variable and it's an existing forecast, we should exclude this column from our forecasting model.

Including it might bias the model, making it good at predicting these exact values but bad at generalizing to other unseen values. It would be a bit like studying for a test by using the answer key for the upcoming test.

Other columns, shown in purple, that exhibit correlations with our target variable includegeneration fossil gas ,generation fossil hard coal ,generation fossil oil ,generation hydro water reservoir ,generation solar andforecast solar day ahead . While this correlation matrix is a good first step in exploring our data, we need to perform more statistical testing before selecting the input columns to our multivariate forecast model.

5.2 Split the data

We need to split the data again for our multivariate forecast because our existing data split included only our target variable. Now, we want to include the additional data columns in our training and testing datasets.

# split the data
y, y_actual = temporal_train_test_split(df, train_size=num_of_rows, test_size=future_context, anchor="end")
y.shape

(512, 18)

5.3 Test for stationarity

Some forecasting models have requirements that the time series data exhibits certain properties in order to correctly apply the model. TheNaiveForecaster used in univariate forecasting didn't have many requirements, but theVAR model that we're using in our multivariate forecast requires that the time series be stationary.1

A time series is stationary if it doesn't exhibit seasonality or trend. Ideally, a stationary time series should look like white noise, without any overall trend or patterns repeating at stable intervals. We've already observed that our time series demonstrates daily seasonality, so we likely must transform the data to fit the requirements of theVAR model.

First, we perform a statistical test fromsktime calledStationarityADF , an implementation of the Augmented Dickey-Fuller (ADF) Unit Root Test, to check for stationarity. If the test cannot fit a single unit root to the data, then the time series is determined to be stationary. Otherwise, the time series is nonstationary. For our multivariate analysis, we test all the columns in the dataset by using this method.

If a column doesn't show stationarity based on the ADF test, we can apply a transformation called differencing to remove the seasonality in our data.

Differencing works by calculating the difference between points in a time series and using those differences rather than the absolute values of the time series. In the case of our data with daily seasonality, the differencing transformer is set to use a 24-hr lag.

plot_series(y['total load actual'], markers=",")

Above we can observe the time series with daily seasonality. Below we need to set up the ADF test for stationarity and the transformer to apply differencing, hopefully making the time series stationary.

# set up the ADF test for stationarity
sty_est = StationarityADF()
# set up the differencer transformer to correct for stationarity
transformer = Differencer(lags=[1, 24])
# test each feature for stationarity and difference if necessary, save the results in a dict

params_dict = {}

for col in y.columns:
    sty_est.fit(y[col])
    params_dict[col] = sty_est.get_fitted_params()
    if params_dict[col]['stationary'] == False:
        y = y.apply(lambda x : transformer.fit_transform(x) if x.name == col else x)
        sty_est.fit(y[col])
        params_dict[col] = sty_est.get_fitted_params()

Following testing and transformation, we can observe that all the features in the dataset now exhibit stationarity. We can also visualize the differenced time series for our target variable. This version of the data looks somewhat like white noise, with no periodic patterns or obvious trends.

for key in params_dict.keys():
print(key + ', ' + 'stationary: ' + str(params_dict[key]['stationary']))

generation biomass, stationary: True
generation fossil brown coal/lignite, stationary: True
generation fossil gas, stationary: True
generation fossil hard coal, stationary: True
generation fossil oil, stationary: True
generation hydro pumped storage consumption, stationary: True
generation hydro run-of-river and poundage, stationary: True
generation hydro water reservoir, stationary: True
generation nuclear, stationary: True
generation other, stationary: True
generation other renewable, stationary: True
generation solar, stationary: True
generation waste, stationary: True
generation wind onshore, stationary: True
forecast solar day ahead, stationary: True
forecast wind onshore day ahead, stationary: True
total load forecast, stationary: True
total load actual, stationary: True

plot_series(y['total load actual'])

5.4 Feature selection

Given all the columns in our dataset, we need a way to select the columns that can provide the most predictive input to our model. We need to use a procedure called Granger causality testing to examine the features in our data and to measure the strength of their relationship to our target variable.

While we should be cautious of promises to discern causality between variables, Granger causality examines whether a time series is predictive of another, given a stated time lag. We need to test all of our features (excepttotal load forecast as discussed earlier) against our target variabletotal load actual to determine which variables to include as input to our model. We can fit a lag of up to 24 hours during the testing.

For results with a p value < .05, we can accept that the specific feature in our time series is predictive of our target variable.

# use a Granger causality test to determine which features are predictive of the target variable

import warnings
with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)

    selected_features = []
    for col in y.columns[:-2]:
        if params_dict[col]['used_lag'] > 0:
            res_dict = grangercausalitytests(y[[col, 'total load actual']], 24, verbose=False)
            p_val = res_dict[1][0]['ssr_ftest'][1]
            if p_val < .05:
                selected_features.append(col)
    print(selected_features)

['generation fossil oil', 'generation hydro water reservoir']

After statistical testing, we've selected two features from the dataset to pass to our modelgeneration fossil oil andgeneration hydro water reservoir . These features were correlated withtotal load actual in our initial correlation testing, but Granger causation analysis also allowed us to explore the relationship between these variables while considering the temporal element.

Using this method, we were able to further narrow the list of variables from correlation testing and ultimately select these variables as input to our multivariate forecasting model.

# subset the dataframe by the selected features, plus target variable
selected_features.append('total load actual')
y_subset = pd.DataFrame(y[selected_features], index=y.index)
y_subset.tail()

5.5 Forecasting

After testing several models on this dataset, we're settling on a VAR or vector autoregression model for multivariate forecasting. We've chosen this model because it's fairly fast to fit the model to our data, it performs better than our univariate forecasting model and better than several other multivariate forecasting models we tested.

A vector autoregression model can represent the interactions between different time series and forecast values for any of the examined time series.1 While the model can produce accurate forecasts, one of its shortcomings is that it doesn't provide a hypothesis for why the system of time series behaves the way it does. However, for our purposes, the VAR model is sufficient.

As mentioned previously, one of the requirements for this model is that the time series be stationary, a requirement already met by testing and transforming the data in step 5.3.

In order to create our forecasts, we first create theforecaster , specifying theic or information criterion to optimize for. We're using thebic or Bayesian Information Criterion, which is often used with VAR models.1

Next, we pass our datay_subset to thefit method along with the same forecast horizon that we used for our univariate forecasting model. Finally, we call thepredict method to create our forecasts.

forecaster = VAR(ic='bic')
forecaster.fit(y_subset, fh= np.arange(1,future_context + 1))

y_pred = forecaster.predict()
y_pred.head()

5.6 Evaluation

Let's evaluate our model results by using the same evaluation metrics as our univariate modeling. In order to compare against our ground truth data, we first need to remove the differencing transformation that was applied.

# remove differencing from predictions and columns

y_pred_dediff = transformer.inverse_transform(y_pred['total load actual'])

y = y.apply(lambda x : transformer.inverse_transform(x))

Then, we can plot the predictions, shown in gold, along with the historical data (shown in blue) against the ground truth data (shown in green). You might notice that the forecast initially tracks close to the actual data at the start of the forecast horizon, drifting further from the actual values towards the end of the forecast horizon.

This behavior has been observed in multivariate forecasts. They are often very good in the near short term, but less reliable over time.2

plot_series(y['total load actual'], y_pred_dediff, y_actual['total load actual'], labels=["y", "y_pred", "y_actual"], markers=[',',',',','])

In examining our MAPE for this forecast, at ~7.20 it's slightly better than ~7.99 for our univariate forecast.

mean_absolute_percentage_error(y_actual['total load actual'], y_pred_dediff) * 100

7.195281205188215

The plot, along with our calculated MAPE, demonstrates that our multivariate forecast is more accurate than our univariate forecast. Given these results, it seems likely that the selected columns passed as input to our forecaster have improved the forecast.

Summary

In this tutorial, we used the capabilities ofsktime to perform univariate and multivariate forecasting predicting energy demand. In our univariate forecast, we predicted energy demand by using a naive seasonal forecaster. For multivariate forecasting, we used the additional exogenous variables in our dataset such as weather and other energy sector forecasts to make a prediction by using a VAR forecaster. In general, our multivariate forecast performed better than the univariate forecast.

References

[1] Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on April 9, 2025.

[2] S. Salehi, M. Kavgic, H. Bonakdari, L. Begnoche, Comparative study of univariate and multivariate strategy for short-term forecasting of heat demand density: Exploring single and hybrid deep learning models, Energy and AI, Volume 16, 2024, 100343, ISSN 2666-5468, https://www.sciencedirect.com/science/article/pii/S2666546824000090?via%3Dihub.

Related solutions
IBM watsonx.ai

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services

Think Newsletter

 

The latest AI and tech insights from Think

Sign up today
Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai Book a live demo