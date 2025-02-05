This tutorial uses the Hourly energy demand dataset. This dataset contains 4 years of electrical consumption and energy generation data gathered in Spain from 2015–2018 aggregated by hour. It is a modified version of the Hourly energy demand generation and weather dataset. You can find more details about the dataset, including metadata in the preceding links.

For simplicity, the dataset was prepared to have no missing values and to remove irrelevant columns.

filename = 'energy_dataset.csv'

base_url = 'https://github.com/IBM/watson-machine-learning-samples/raw/refs/heads/master/cloud/data/energy/'

if not os.path.isfile(filename): wget.download(base_url + filename)

Let's examine the last few rows of the dataset. We can see the time column showing a timestamp for each hour. Other columns show numeric data types for energy generation from different sources, weather forecast details and actual energy usage, termed as total_load_actual. This will be our target column, the column for which we are trying to predict values. Since our model is performing multivariate forecasting, we'll use all of the other columns as input to our model to help inform it's predictions. These columns provide details about energy generation and weather forecasts for each hour, enabling us to predict actual energy demand on an hourly basis.

df = pd.read_csv(filename)

df.tail()

Split the data

For our forecasting problem, we'll need to split the data into 2 sets, the first which will be used as historical data. We'll provide the historical data to the model and ask it to predict future values. In order to test the accuracy of our predictions, we'll also need to compare these predictions against ground truth values. For our experiment, we'll use a second subset of our dataset as the ground truth and we'll compare the predicted values to the actual values in this ground truth subset.

Granite timeseries models come in different context lengths of 512, 1024 and 1536 tokens. The context length describes the amount of information the model can consider when making a single prediction. For the Granite timeseries models, each row in a dataset counts as one token towards the context length. We'll be using the 512 token context length timeseries model, ibm/granite-ttm-512-96-r2, in our experiment. In order to do this, we need a dataset of 512 rows to provide as input to the model, our historical data. We'll term this input dataset as data. We have many more rows in our dataset than are needed for this prediction problem. In this case, to subset the data, we'll simply take the most recent timestamps or the last rows of the dataset.

The second dataset we need is our evaluation or ground truth dataset. We'll use the last 96 rows of data in our dataset for this purpose. We'll call this future_context and we'll use this data to compare against our predictions.

Here, we also specify the columns to be used for prediction. The identifiers timestamp_column and target_column set these values for the model.

# how many rows and columns

df.shape

Output:

(35064, 19)

timestamp_column = "time"

target_column = "total load actual"

context_length = 512

future_context = 96

# use the last `context_length` rows for prediction.

future_data = df.iloc[-future_context:,]

data = df.iloc[-(context_length + future_context):-future_context,]

Let's examine the data further with this data visualization, which plots the hourly timestamps against our target column, total load actual.

plt.figure(figsize=(10,2))

plt.plot(np.asarray(data[timestamp_column], 'datetime64[s]'), data[target_column])

plt.title("Actual Total Load")

plt.show()

In preparing data for timeseries forecasting, models can have different requirements for preprocessing the data. The Granite TTM model card recommends that data be scaled and a preprocessing script is provided as an example. For the purposes of this tutorial, we'll use our dataset 'as-is.'