After developing a machine learning model, you need a place to run your model and serve predictions. If your company is in the early stage of its AI journey or has budget constraints, you may struggle to find a deployment system for your model. Building ML infrastructure and integrating ML models with the larger business are major bottlenecks to AI adoption [1,2,3]. IBM Db2 can help solve these problems with its built-in ML infrastructure. Someone with the knowledge of SQL and access to a Db2 instance, where the in-database ML feature is enabled, can easily learn to build and use a machine learning model in the database.
In this post, I will show how to develop, deploy, and use a decision tree model in a Db2 database.
These are my major steps in this tutorial:
I implemented these steps in a Db2 Warehouse on-prem database. Db2 Warehouse on cloud also supports these ML features.
I will use a dataset of historical flights in the US. For each flight, the dataset has information such as the flight’s origin airport, departure time, flying time, and arrival time. Also, a column in the dataset indicates if each flight had arrived on time or late. Using examples from the dataset, we’ll build a classification model with decision tree algorithm. Once trained, the model can receive as input unseen flight data and predict if the flight will arrive on time or late at its destination.
The dataset I use in this tutorial is available here (link resides outside ibm.com) as a csv file.
I use the following SQL for creating a table for storing the dataset.
db2start
connect to <database_name>
db2 "CREATE TABLE FLIGHTS.FLIGHTS_DATA_V3 (
ID INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
YEAR INTEGER ,
QUARTER INTEGER ,
MONTH INTEGER ,
DAYOFMONTH INTEGER ,
DAYOFWEEK INTEGER ,
UNIQUECARRIER VARCHAR(50 OCTETS) ,
ORIGIN VARCHAR(50 OCTETS) ,
DEST VARCHAR(50 OCTETS) ,
CRSDEPTIME INTEGER ,
DEPTIME INTEGER ,
DEPDELAY REAL ,
DEPDEL15 REAL ,
TAXIOUT INTEGER ,
WHEELSOFF INTEGER ,
CRSARRTIME INTEGER ,
CRSELAPSEDTIME INTEGER ,
AIRTIME INTEGER ,
DISTANCEGROUP INTEGER ,
FLIGHTSTATUS VARCHAR(1) )
ORGANIZE BY ROW";
After creating the table, I use the following SQL to load the data, from the csv file (link resides outside ibm.com), into the table:
db2 "IMPORT FROM 'FLIGHTS_DATA_V3.csv' OF DEL COMMITCOUNT 50000 INSERT INTO FLIGHTS.FLIGHTS_DATA_V3"
I now have the ML dataset loaded into the FLIGHTS.FLIGHTS_DATA_V3 table in Db2. I’ll copy a subset of the records from this table to a separate table for the ML model development and evaluation, leaving the original copy of the data intact.
SELECT count(*) FROM FLIGHTS.FLIGHTS_DATA_V3
— — —
1000000
Create a table with 10% sample rows from the above table. Use the RAND function of Db2 for random sampling.
CREATE TABLE FLIGHT.FLIGHTS_DATA AS (SELECT * FROM FLIGHTS.FLIGHTS_DATA_V3 WHERE RAND() < 0.1) WITH DATA
Count the number of rows in the sample table.
SELECT count(*) FROM FLIGHT.FLIGHTS_DATA
— — —
99879
Look into the scheme definition of the table.
SELECT NAME, COLTYPE, LENGTH FROM SYSIBM.SYSCOLUMNS WHERE TBCREATOR = 'FLIGHT' AND TBNAME = 'FLIGHTS_DATA' ORDER BY COLNO
FLIGHTSTATUS is the response or the target column. Others are feature columns.
Find the DISTINCT values in the target column.
From these values, I can see that it’s a binary classification task where each flight arrived either on time or late.
Find the frequencies of distinct values in the FLIGHTSTATUS column.
SELECT FLIGHTSTATUS, count(*) AS FREQUENCY, count(*) / (SELECT count(*) FROM FLIGHT.FLIGHTS_DATA) AS FRACTION FROM FLIGHT.FLIGHTS_DATA fdf GROUP BY FLIGHTSTATUS
From the above, I see the classes are imbalanced. Now I’ll not gain any further insights from the entire dataset, as this can leak information to the modeling phase.
Before collecting deeper insights into the data, I’ll divide this dataset into train and test partitions using Db2’s RANDOM_SAMPLING SP. I apply stratified sampling to preserve the ratio between two classes in the generated training data set.
Create a TRAIN partition.
call IDAX.RANDOM_SAMPLE('intable=FLIGHT.FLIGHTS_DATA, fraction=0.8, outtable=FLIGHT.FLIGHTS_TRAIN, by=FLIGHTSTATUS')
Copy the remaining records to a test PARTITION.
CREATE TABLE FLIGHT.FLIGHTS_TEST AS (SELECT * FROM FLIGHT.FLIGHTS_DATA FDF WHERE FDF.ID NOT IN(SELECT FT.ID FROM FLIGHT.FLIGHTS_TRAIN FT)) WITH DATA
In this step, I’ll look at both sample records and the summary statistics of the training dataset to gain insights into the dataset.
Look into some sample records.
SELECT * FROM FLIGHT.FLIGHTS_TRAIN FETCH FIRST 10 ROWS ONLY
Some columns have encoded the time as numbers:
— CRSDEPTIME: Computer Reservation System (scheduled) Departure Time (hhmm)
— DepTime: Departure Time (hhmm)
— CRSArrTime: Computer Reservation System (scheduled) Arrival Time
Now, I collect summary statistics from the FLIGHTS_TRAIN using SUMMARY1000 SP to get a global view of the characteristics of the dataset.
CALL IDAX.SUMMARY1000('intable=FLIGHT.FLIGHTS_TRAIN, outtable=FLIGHT.FLIGHTS_TRAIN_SUM1000')
Here the intable has the name of the input table from which I want SUMMARY1000 SP to collect statistics. outtable is the name of the table where SUMMARY1000 will store gathered statistics for the entire dataset. Besides the outtable, SUMMARY1000 SP creates a few additional output tables — one table with statistics for each column type. Our dataset has two types of columns — numeric and nominal. So, SUMMARY1000 will generate two additional tables. These additional tables follow this naming convention: the name of the outtable + column type. In our case, the column types are NUM, representing numeric, and CHAR, representing nominal. So, the names of these two additional tables will be as follows:
FLIGHTS_TRAIN_SUM1000_NUM
FLIGHTS_TRAIN_SUM1000_CHAR
Having the statistics available in separate tables for specific datatypes makes it easier to view the statistics that apply to specific datatype and reduce the number of columns whose statistics are viewed together. This simplifies the analysis process.
Check the summary statistics of the numeric column.
SELECT * FROM FLIGHT.FLIGHTS_TRAIN_SUM1000_NUM
For the numeric columns, SUMMARY1000 gather the following statistics:
Each of these statistics can help uncover insights into the dataset. For instance, I can see that DEPDEL15 and DEPDELAY columns have 49 missing values. There are large values in these columns: AIRTIME, CRSARRTIME, CRSDEPTIME, CRSELAPSEDTIME, DEPDELAY, DEPTIME, TAXIOUT, WHEELSOFF, and YEAR. Since I will create a decision tree model, I don’t need to deal with the large value and the missing values. Db2 will deal with both issues natively.
Next, I investigate the summary statistics of the nominal columns.
select * from FLIGHT.FLIGHTS_TRAIN_SUM1000_CHAR
For nominal columns, SUMMARY1000 gathered the following statistics:
From the above data exploration, I can see that the dataset has no missing values. These four TIME columns have large values: AIRTIME, CRSARRTIME, DEPTIME, WHEELSOFF. I’ll leave the nominal values in all columns as-is, as the decision tree implementation in Db2 can deal with them natively.
Extract the hour part from the TIME columns — CRSARRTIME, DEPTIME, WHEELSOFF.
From looking up the description of the dataset, I see the values in the CRSARRTIME, DEPTIME, and WHEELSOFF columns are encoding of hhmm of the time values. I extract the hour part of these values to create, hopefully, better features for the learning algorithm.
Scale CRSARRTIME COLUMN: divide the value with 100 gives the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET CRSARRTIME = CRSARRTIME / 100
Scale DEPTIME COLUMN: divide the value by 100 gives the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET DEPTIME = DEPTIME / 100
Scale WHEELSOFF COLUMN: divide the value by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET WHEELSOFF = WHEELSOFF / 100
Now the training dataset is ready for the decision tree algorithm.
I train a decision tree model using GROW_DECTREE SP.
CALL IDAX.GROW_DECTREE('model=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TRAIN, id=ID, target=FLIGHTSTATUS')
I called this SP using the following parameters:
After completing the model training, the GROW_DECTREE SP generated several tables with metadata from the model and the training dataset. Here are some of the key tables:
This link has the complete list of model tables and their details.
Since the FLIGHT_DECTREE model is trained and deployed in the database, I can use it for generating predictions on the test records from the FLIGHTS_TEST table.
First, I preprocess the test dataset using the same preprocessing logic that I applied to the TRAINING dataset.
Scale CRSARRTIME COLUMN: divide the value by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET CRSARRTIME = CRSARRTIME / 100
Scale DEPTIME COLUMN: divide the value by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET DEPTIME = DEPTIME / 100
Scale WHEELSOFF COLUMN: divide the value by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET WHEELSOFF = WHEELSOFF / 100
I use PREDICT_DECTREE SP to generate predictions from the FLIGHT_DECTREE model:
CALL IDAX.PREDICT_DECTREE('model=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TEST, outtable=FLIGHT.FLIGHTS_TEST_PRED, prob=true, outtableprob=FLIGHT.FLIGHTS_TEST_PRED_DIST')
Here is the list of parameters I passed when calling this SP:
Using generated predictions for the test dataset, I compute a few metrics to evaluate the quality of the model’s predictions.
I use CONFUSION_MATRIX SP to create a confusion matrix based on the model’s prediction on the TEST dataset.
CALL IDAX.CONFUSION_MATRIX('intable=FLIGHT.FLIGHTS_TEST, resulttable=FLIGHT.FLIGHTS_TEST_PRED, id=ID, target=FLIGHTSTATUS, matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')
In calling this SP, here are some of the key parameters that I passed:
After the SP completes its run, we have the following output table with statistics for the confusion matrix.
FLIGHTS_TEST_CMATRIX:
This table has three columns. The REAL column has the actual flight status. PREDICTION column has the predicted flight status. Since flight status takes two values – 0 (on time) or 1 (delayed), we have four possible combinations between values in the REAL and the PREDICTION columns:
I use these counts to compute a few evaluation metrics for the model. For doing so, I use CMATRIX_STATS SP as follows:
CALL IDAX.CMATRIX_STATS('matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')
The only parameter this SP needs is the name of the table that contains the statistics generated by the CONFUSION_MATRIX SP in the previous step. CMATRIX_STATS SP generates two sets of output. The first one shows overall quality metrics of the model. The second one includes the model’s predictive performance for each class.
First output — overall model metrics include correction predictions, incorrect prediction, overall accuracy, weighted accuracy. From this output, I see that the model has an overall accuracy of 83.98% and a weighted accuracy of 80.46%.
With classification tasks, it’s usually useful to view the model’s quality factors for each individual class. The second output from the CMATRIX_STATS SP includes these class level quality metrics.
For each class, this output includes the True Positive Rate (TPR), False Positive Rate (FPR), Positive Predictive Value (PPV) or Precision, and F-measure (F1 score).
If you want to build and deploy an ML model in a Db2 database using Db2’s built-in stored procedures, I hope you’ll find this tutorial useful. Here are the main takeaways of this tutorial:
O’Reilly’s 2022 AI Adoption survey[3] (link resides outside ibm.com) underscored challenges in building technical infrastructure and skills gap as two top bottlenecks to AI adoption in the enterprise. Db2 solves the first one by supplying an end-to-end ML infrastructure in the database. It also lessens the latter, the skills gap, by providing simple SQL API for developing and using ML models in the database. In the enterprise, SQL is a more common skill than ML.
Check out the following resources to learn more about the ML features in IBM Db2 and see additional examples of ML use cases implemented with these features.
IBM web domains
ibm.com, ibm.org, ibm-zcouncil.com, insights-on-business.com, jazz.net, mobilebusinessinsights.com, promontory.com, proveit.com, ptech.org, s81c.com, securityintelligence.com, skillsbuild.org, softlayer.com, storagecommunity.org, think-exchange.com, thoughtsoncloud.com, alphaevents.webcasts.com, ibm-cloud.github.io, ibmbigdatahub.com, bluemix.net, mybluemix.net, ibm.net, ibmcloud.com, galasa.dev, blueworkslive.com, swiss-quantum.ch, blueworkslive.com, cloudant.com, ibm.ie, ibm.fr, ibm.com.br, ibm.co, ibm.ca, community.watsonanalytics.com, datapower.com, skills.yourlearning.ibm.com, bluewolf.com, carbondesignsystem.com, openliberty.io