Machine Learning Optimizer Technology Preview for Db2 11.5.4 and 11.5.5

News

Abstract

Machine Learning Optimizer Technology Preview for Db2 11.5.4 and 11.5.5.

Content

Changes from Db2 11.5

The Machine Learning Optimizer (also referred to as ML Optimizer) Technology Preview was initially released for Db2 11.5 (see https://www.ibm.com/support/pages/db2-machine-learning-optimizer-technology-preview for more information). In Db2 11.5.4, the following enhancements have been made to the ML Optimizer:

Python dependencies have been removed. Model training and inference are now performed directly inside the Db2 engine.
Models can now be interacted with concurrently. This means models can be retrained while queries are still using them.
A new stored procedure is provided to create models manually.
Adjustments to model discovery.

Changes in Db2 11.5.5

In Db2 11.5.5, the following enhancements have been made to the ML Optimizer:

Models can now be fully exploited for IN predicates and OR predicates with eligible OR subterms.
The ML Optimizer can now be used on any Db2 supported platform.
Model training time has been improved.

Note that models created from the Db2 11.5 ML Optimizer Technology Preview are not compatible with the Db2 11.5.4 ML Optimizer Technology Preview. If models from the first technology preview still exist, they must first be removed before using the ML Optimizer in 11.5.4 or 11.5.5. The same applies for models created in the Db2 11.5.4 ML Optimizer; they are not compatible with the Db2 11.5.5 ML Optimizer and must be removed. For more information about removing models, please see the “Removing Models” subtopic.

Introduction

Accurate cardinality estimates are fundamental to cost-based optimizers, such as the Db2 optimizer. Cardinality estimation is a process where the optimizer traditionally uses statistics to determine the size of an intermediate query result such as the number of rows of a scan in a query execution plan that applies one or more predicates. In the absence of multi-column statistics, the optimizer will default to assuming independence between multiple predicates. For example, the PROJNAME and DEPTNO columns in the PROJECT table from the SAMPLE database are not independent. If statistics are not collected on the combination of these two columns, cardinality estimates may be inaccurate for a pair of predicates such as:

WHERE PROJNAME = ‘APPLICATIONS SUPPORT’ AND DEPTNO = ‘E21’

In this example, the DEPTNO predicate value ‘E21’ corresponds to DEPTNAME = ’SOFTWARE SUPPORT’ in the DEPARTMENT table. In general, even though the predicates in this scenario may seem redundant if there is a one-to-one correspondence between DEPTNO and DEPTNAME, it may not be uncommon in practice to have two or more such predicates in a query that are strongly correlated. If these columns only have basic independent statistics collected, the estimates of the number of qualifying rows (cardinality) after applying the predicates may be very inaccurate using basic probability formulae resulting in poor performing query plans.

As part of a larger plan to develop a database optimizer founded on machine learning, cardinality estimation is a first step. Using ML for cardinality estimation will allow Db2 to automatically estimate cardinalities based on continuously learning targeted customer data and workloads. These techniques also eliminate the need for custom tuning of statistics collection options by the user.

A technology preview of the Db2 Machine Learning Optimizer for cardinality estimation was introduced in Db2 11.5 and enhanced in Db2 11.5.4. This unsupported version of the feature should only be used in non-production environments.

When the ML Optimizer feature is enabled, a machine learning model will be built for any desired tables. Either an invocation of a specific stored procedure, a RUNSTATS command, or the automatic statistics collection facility can be used to trigger the building of a machine learning model on a table. The model can then be used to predict the estimated cardinalities for operators that apply multiple predicates.

Requirements for use

Db2 11.5.4 for Linux on AMD64 or Intel EM64T systems (x64)
Db2 11.5.5 on any Db2 supported platform

Enabling the Machine Learning Optimizer Feature

Enable the ML Optimizer feature for your Db2 instance by running the following command as the Db2 instance owning userid (i.e. db2inst1):

db2set DB2_ML_OPT="ENABLE:ON;DISCOVER:ON"

After setting this registry variable, you must deactivate and re-activate your database for the new setting (and the ML Optimizer feature) to take effect.

If you are using the ML Optimizer feature with column-organized tables, real-time statistics must be disabled for your database to avoid a known issue. Real-time statistics can be disabled by switching the AUTO_STMT_STATS database configuration parameter to 'OFF' by running the following command:

db2 update database configuration for <database name> using auto_stmt_stats off

To create ML Optimizer models using the stored procedure (named “DEFINE_MODEL”), it must first be created in the database that contains the tables which models are to be created for. The stored procedure DDL is located in DFTDBPATH/sqllib/misc/ and can be invoked by running the following command:

db2 –td@ -f MLOptimizerRoutines.ddl

To utilize model discovery (explained below in the “Training Models” subtopic), the SYSTOOLS.TABLE_MODELS table must also be created. See the “SYSTOOLS.TABLE_MODELS Table” subtopic for more information on creating the SYSTOOLS.TABLE_MODELS table. Note that the SYSTOOLS.TABLE_MODELS table will automatically be created the first time the DEFINE_MODEL stored procedure is invoked.

Assuming you have enabled the ML Optimizer feature and optionally created the DEFINE_MODEL stored procedure and SYSTOOLS.TABLE_MODELS table, you are now ready to use the ML Optimizer feature. See the “Usage” topic for more information.

Disabling and Turning off the Machine Learning Optimizer Feature

There are two ways to prevent the ML Optimizer feature from being used once it has been turned on: disabling it temporarily, and turning it off entirely. Once the ML Optimizer feature has been disabled or turned off, the traditional Db2 optimizer cardinality estimates are once again used for queries.

The ML Optimizer use of the feature can be disabled immediately by running the following command:

db2set -im DB2_SELECTIVITY=”ML_PRED_SEL OFF”

Disabling the ML Optimizer feature is useful if you are planning to resume using the feature at a later time and do not want to deactivate and re-activate your database.

Turning off the ML Optimizer feature temporarily will only prevent the optimizer from using model predictions, it will not prevent the ML Optimizer feature from being able to create new models. The ML Optimizer feature can be turned off entirely by running the following command:

db2set DB2_ML_OPT="ENABLE:OFF"

After setting this registry variable, you must deactivate and re-activate your database for the new setting to take effect. If models have already been trained and you wish to remove them, see the “Removing Models” subtopic for more information.

Usage

Training Models

When a machine learning model is trained for a table, the model and metadata are stored in the SYSTOOLS.TABLE_MODELS table. Manual modifications to the SYSTOOLS.TABLE_MODELS table is not recommended.

When the ML Optimizer feature is enabled, Db2 will automatically look for suitable tables to build machine leaning models for the optimizer to exploit. The auto-RUNSTATS feature, if turned on, will periodically build models through the auto-RUNSTATS feature for tables that are recently created or changed significantly - similarly to how it collects regular statistics. This process is called model discovery.

Model discovery is a method of automatic model training that does not require invocation of the DEFINE_MODEL stored procedure. By using model discovery, the ML Optimizer feature automatically chooses training columns it thinks would best represent the “discovered” correlations present in the representative sampling of rows.

When model discovery is enabled, any manual or automatic statistics collection will automatically attempt to create a discovered model on the provided table if a manually created model doesn’t already exist. Consequently, models created via the DEFINE_MODEL stored procedure will always take precedence over discovered models, and subsequent statistics collection will attempt to retrain existing discovered models. Therefore, if a manually created model exists on a particular table that you are interested in using a discovered model for, the existing model must first be removed. See the “Removing Models” subtopic for more details on removing existing models.

To allow the ML Optimizer feature to attempt to use model discovery, the following command can be used:

db2set DB2_ML_OPT="ENABLE:ON;DISCOVER:ON"

After setting this registry variable, you must re-activate the database for the new setting to take effect. If the SYSTOOLS.TABLE_MODELS table has not yet been created, it must be created before model discovery can be used.

Note that model discovery will not occur if a statistics profile on the table is already set.

If the model needs to be built manually (this may be required if one cannot wait for auto-RUNSTATS to kick in) the only required user interaction with the Db2 ML Optimizer feature is to choose which tables should be used to train a machine learning model, and to choose which columns from those tables are typically used by query predicates from analytical queries issued against those tables. The first step is to determine which tables you think may benefit from improved cardinality estimates from the ML Optimizer feature, and then select from two up to ten columns from those tables. Machine learning model training time increases as the number of selected columns increases, so it is recommended to select only those columns that are commonly used in analytical query predicates.

Assuming your table is in the schema MYSCHEMA, and is named MYTABLE, and the columns you have selected to train the model on are named C1, C2, and C3, you can manually train a model by invoking the DEFINE_MODEL stored procedure (for instructions on creating this stored procedure, see the “Enabling the Machine Learning Optimizer feature” topic):

CALL "SYSTOOLS"."DEFINE_MODEL"(‘MYSCHEMA’, ‘MYTABLE’, ‘C1,C2,C3’, OUT_TEXT)

Once the procedure completes successfully, you should look at the data returned in OUT_TEXT. This output parameter of the DEFINE_MODEL procedure provides information to confirm that the procedure invocation and model training were successful. A successful invocation of the procedure will display the following message in OUT_TEXT:

Successfully trained model.

Please note that the column name argument of the DEFINE_MODEL procedure (in the example above, where we specify ‘C1,C2,C3’) must separate the column names using a comma. The column names are also treated as is and are case-sensitive. For example, this means that if leading or trailing spaces are present next to a column name, those spaces will be treated as part of the column name.

After a model has been trained, it will automatically be considered during Db2 optimizer access plan selections for queries issued against that table, and so can automatically benefit from improved cardinality estimates.

Retraining/Creating New Models

In some cases, a model on a table may need to be retrained because there may have been significant changes to the data in the table. When a model is retrained, it is trained using a new representative sampling of rows through Db2 RUNSTATS.

Models are retrained using the automatic statistics collection facility if data in the table has changed significantly, or can be retrained manually. There are two ways to manually retrain a model: using the DEFINE_MODEL stored procedure or using RUNSTATS. To retrain an existing model using the DEFINE_MODEL stored procedure, simply resubmit the original command used to train that model. To retrain an existing model using RUNSTATS, submit a RUNSTATS command on the associated table.

Sometimes it may be helpful to create and train a new model to capture different columns than the ones the original model was trained on (for example, if the focus for a model is on a new set of queries). To create and train a new model, resubmit the original command using the DEFINE_MODEL stored procedure with the new set of columns. Note that only one model is allowed per table as mentioned in the “Restrictions/Limitations” topic, so a previously existing model on the same table will no longer be used for predictions.

For example, to manually create and train a new model for the table named MYTABLE using columns C2, C3, and C4, you can use the following command:

CALL "SYSTOOLS"."DEFINE_MODEL"(‘MYSCHEMA’, ‘MYTABLE’, ‘C2,C3,C4’, OUT_TEXT)

This new model will replace the current model for MYTABLE which was initially trained on columns C1, C2, and C3.

Alternatively, you can ask Db2 to automatically choose up to 10 columns to train a new model with if you specify an empty column list in the command above.

Removing Models

To remove a model and its metadata entirely, it must first be removed from the SYSTOOLS.TABLE_MODELS table and then from memory. To remove the model and metadata from the SYSTOOLS.TABLE_MODELS table, a DELETE statement can be issued that specifies the table the model was trained on. To remove the model and metadata from memory, the database the model resides in needs to be re-activated. When a model no longer exists, the traditional Db2 optimizer cardinality estimates are once again used for that table.

For example, to remove the model and metadata for the table named MYTABLE, the following command can be used to remove the SYSTOOLS.TABLE_MODELS entries for that model:

DELETE FROM SYSTOOLS.TABLE_MODELS WHERE SCHEMANAME = 'MYSCHEMA' AND TABLENAME = 'MYTABLE';

Once all connections are terminated, the database can be re-activated using these commands to remove the model from memory:

DEACTIVATE DATABASE MYDB;

ACTIVATE DATABASE MYDB;

See the “Advanced Configuration Information” topic for more information on the SYSTOOLS.TABLE_MODELS table.

Troubleshooting

If any errors are encountered, there are two logs to check that may have more information about the error: the db2diag.log and the db2optstats log.

For example, if the DEFINE_MODEL stored procedure experiences an error, there will likely be a corresponding entry for that error in the db2diag.log (located in the directory specified by your DIAGPATH database manager configuration setting) that could provide additional information about the error. Any db2diag.log entries that contain “DB2 UDB, Machine Learning” as part of the FUNCTION header are entries created by the ML Optimizer feature.

If an error is encountered while using manual or automatic RUNSTATS to retrain models or for model discovery, db2diag.log entries may not contain enough information about the RUNSTATS command that failed. In a case like this, the failing RUNSTATS command and related model information may be found in the db2optstats log (located in your DIAGPATH/events directory). Db2optstats log entries containing “LEARNED CARDINALITY MODEL” as part of the DISCOVER event type are entries created by the ML Optimizer feature when a discovered model is attempting to be created. If such an entry is found or a model is being retrained, there should also be entries containing “TABLE AND INDEX STATS” as part of the COLLECT event type that contain the RUNSTATS command being attempted.

If you experience any error that appears to be a problem with the ML Optimizer feature (for example, experiencing SQL1224N or SQL0901N errors, hangs, etc. only while the feature is enabled), please send as much of the following information as you can to the e-mail listed in the “Feedback” topic:

Exported data from the SYSTOOLS.TABLE_MODELS table. The SYSTOOLS.TABLE_MODELS table does not contain any table data; it contains information about the tables on which models are trained on (including the schema and table name, as well as the training columns), metadata about the models, and a binary representation of the models. This table can be exported using the following Db2 command:

EXPORT TO mloptTableModels.ixf OF IXF LOBS TO mloptLobs/ LOBFILE mloptLobExport MODIFIED BY LOBSINFILE SELECT * FROM SYSTOOLS.TABLE_MODELS

The output of db2pd -db <database name> -mlopt
The db2diag.log and db2optstats logs
db2support <output_path> -d <database name> -cl 0

Example

This example uses the SAMPLE database. An access plan of the query with default statistics on the SALES table will be compared to an access plan using a learned model.

First, statistics are collected on the SALES table:

RUNSTATS ON TABLE MYSCHEMA.SALES ON ALL COLUMNS WITH DISTRIBUTION ON ALL COLUMNS AND SAMPLED DETAILED INDEXES ALL;

Consider the following query on the SALES table:

SELECT SALES_DATE, SALES_PERSON, REGION, SALES FROM MYSCHEMA.SALES WHERE SALES_PERSON = 'LUCCHESSI' AND REGION = 'Manitoba' AND SALES = 1;

This query returns the following three records:

SALES_DATE SALES_PERSON REGION SALES

---------- --------------- --------------- -----------

03/30/2006 LUCCHESSI Manitoba 1

03/31/2006 LUCCHESSI Manitoba 1

04/01/2006 LUCCHESSI Manitoba 1

3 record(s) selected.

The captured access plan for this query is shown below. The cardinality estimate of the base table is shown above the schema and table name. The output from the table scan operator (annotated as TBSCAN and is where the predicates are applied for this query) is shown above the operator and denotes the operator’s cardinality estimate:

0.706722

TBSCAN

( 3)

539.221

192

CO-TABLE: MYSCHEMA

SALES

The cardinality estimate is 0.706722 after applying the predicates which doesn’t match the actual result of 3.

A model is trained for the SALES table on the three predicate columns:

CALL "SYSTOOLS"."DEFINE_MODEL"(‘MYSCHEMA’, ‘SALES’, ‘SALES_PERSON,REGION,SALES’, OUT_TEXT)

When the query is executed again, the newly trained model will be used to predict the cardinality, producing the following access plan:

3.08214

TBSCAN

( 3)

539.221

192

CO-TABLE: MYSCHEMA

SALES

Using the model results in a cardinality estimate of 3.08214 which is much closer to the actual result of 3.

Restrictions/Limitations

The ML Optimizer feature has some restrictions and limitations that are important to be aware of while using it:

Models can only be used on equality, range, and between predicates in Db2 11.5.4 (and additionally on IN and OR predicates in Db2 11.5.5) on individual tables and derived query blocks. This also applies to OR predicate subterms that qualify these requirements. Predicates not supported by the ML optimizer will be evaluated by the traditional Db2 optimizer.
For example, the following predicates in blue are supported by the ML Optimizer in Db2 11.5.4 and 11.5.5 and the predicates in green are supported by the ML Optimizer in Db2 11.5.5 (join predicates, predicates with host variables or parameter markers that are not using REOPT, and predicates with expressions around the columns are not supported):

SELECT * FROM T1, T2 WHERE

T1.C0 = T2.C0 AND

T1.C6 IN (5, 3, 205) AND

T1.C3 = ? AND

MOD(T1.C4, 10) = 1 AND

T1.C1 = ‘abc’ AND

T1.C2 BETWEEN 5 AND 10 AND

T2.C2 <= 120 AND T2.C3 = 20 AND

(T1.C4 > 5 AND T1.C5 < 20 OR T1.C4 < 2 AND T1.C5 > 100) AND
(T2.C5 <= 12 OR T2.C6 = 20);

Models can only be trained for permanent base tables. Other table object variations including temporary tables, views, etc. are not supported.
The default maximum number of models that can exist in each database is 100. Retraining an existing model will not increase the count of existing models. See the “Overriding Default 100-Model Limit” subtopic for information on overriding this default.
Only one model can be trained per table at any given time. Any subsequent training attempts on a table with an existing model will retrain that model or replace it with a new one. See the “Retraining/Creating New Models” subtopic for more information.
The DEFINE_MODEL stored procedure only allows up to 10 columns to be specified.
Commands that modify or alter a table (such as ALTER, RENAME, and DROP COLUMN commands) will not update an existing model on that table. Similarly, dropping a table will not cascade to remove a model associated with that table. When altering a table, it is recommended to train a new model (which will replace any existing models) or to remove the model entirely if the table is dropped or renamed. For further information about how to remove a model, see the “Removing Models” subtopic.
Real-time statistics may timeout for row-organized tables with models on them.
Model discovery is not limited to the number of training columns it can use. This means that in some cases, model discovery may be too liberal in terms of the training columns chosen.
Models will only be successfully trained for tables with more than two rows.
Models cannot be trained on columns of the following types: Boolean, Graphic, Large Object, XML, Anchored, User-Defined. This restriction has been removed for Graphic columns in Db2 11.5.5.

Advanced Configuration Information

None of the information mentioned in this topic is required for basic usage of the ML Optimizer feature. For basic usage information, see the “Usage” topic.

For those looking to explore additional capabilities or seeking more information regarding the ML Optimizer feature, this section will describe additional topics of interest.

SYSTOOLS.DEFINE_MODEL Stored Procedure

The DEFINE_MODEL stored procedure is the main entry point for manually training a model on a table. It takes as input a schema and table name as well as columns from the provided table to train a model. These columns can be referred to as the “training columns”. If no training columns are provided to the stored procedure, up to ten columns from the provided table will be chosen automatically as the training columns. The DEFINE_MODEL stored procedure also provides output through an output parameter named OUT_TEXT. This parameter provides information to confirm that the procedure invocation and model training were successful.

Once the stored procedure verifies the inputs, it creates and interacts with a table called TABLE_MODELS to process the new model. The model will then be trained, and if successful, the TABLE_MODELS table will be updated with the entry for the newly trained model. For more information about the TABLE_MODELS table, see the “SYSTOOLS.TABLE_MODELS Table” subtopic.

If at any point model training fails to complete, specific messages will be stored in the OUT_TEXT parameter that pertains to the error encountered. It is also advisable to check the db2diag.log file for additional information on any failures.

For more information about encountered errors, see the “Troubleshooting” subtopic.

SYSTOOLS.TABLE_MODELS Table

The SYSTOOLS.TABLE_MODELS table is used to store and manage model metadata whenever a model is trained or retrained. Each entry in this table corresponds to information for a unique model that was created via the DEFINE_MODEL stored procedure or model discovery. Each entry contains the columns SCHEMANAME and TABLENAME to indicate the schema and name of the table that particular model was trained on. Other than what is mentioned in the “Removing Models” subtopic, it is not recommended to attempt to make manual modifications to the SYSTOOLS.TABLE_MODELS table as it can cause changes to the model configuration or database errors to occur.

The following is the table definition for SYSTOOLS.TABLE_MODELS:

CREATE TABLESPACE SYSTOOLSPACE IN IBMCATGROUP MANAGED BY AUTOMATIC STORAGE EXTENTSIZE 4;

CREATE TABLE SYSTOOLS.TABLE_MODELS ( VERSIONID INTEGER NOT NULL,

MODELID INTEGER,

SCHEMANAME VARCHAR(128 OCTETS) NOT NULL,

TABLENAME VARCHAR(128 OCTETS) NOT NULL,

COLIDS VARCHAR(256 OCTETS),
MODEL BLOB(2G),

TIMEOFCREATION TIMESTAMP WITH DEFAULT CURRENT TIMESTAMP,

ISTRAINED BOOLEAN WITH DEFAULT FALSE,

METAMODEL BLOB(2G),

PRIMARY KEY (VERSIONID, SCHEMANAME, TABLENAME) ENFORCED )

IN SYSTOOLSPACE ORGANIZE BY ROW;

You may notice that there are sometimes two entries per table in SYSTOOLS.TABLE_MODELS, one with VERSIONID = 1 and one with VERSIONID = 0. You can assume that entries with VERSIONID = 1 contain the active and current model metadata for those particular tables.

In some cases, it may be helpful to verify that training has successfully completed on a particular model (for example, when the output of the DEFINE_MODEL stored procedure was not kept). To verify this for a particular table, the ISTRAINED column can be queried. If the value is ‘1’, the model was successfully trained. If the value is ‘0’, the model was not successfully trained (in which case, refer to the “Troubleshooting” subtopic). For example, to verify that the model for the MYSCHEMA.MYTABLE table has successfully been trained, the following query can be used:

SELECT ISTRAINED FROM SYSTOOLS.TABLE_MODELS WHERE SCHEMANAME = 'MYSCHEMA' AND TABLENAME='MYTABLE' AND VERSIONID=1;

The DEFINE_MODEL stored procedure executes a similar query to verify that the current model being created was successfully trained.

Overriding Default 100-Model Limit

By default, up to 100 trained models may exist for a database at any given time. To override this limit, the DB2_ML_OPT registry variable can be updated to specify a new limit using the MAX_MODELS setting. For example:

db2set DB2_ML_OPT="ENABLE:ON;MAX_MODELS:200"

This will limit the maximum number of trained models to 200. The database must be re-activated for the new setting to take effect.

Registry variables

The ML Optimizer feature introduces a new registry variable called DB2_ML_OPT as well a new setting for the DB2_SELECTIVITY registry variable. An overview is provided below.

DB2_ML_OPT:

Allows the usage and configuration of model training and inference (predictions) for tables. Accepted settings for DB2_ML_OPT are of the form: '<Option> <colon> <value>' with multiple settings delimited with a semi-colon ';'. The entire registry setting should be surrounded in double quotes to ensure the shell does not interpret the semi-colon as the end-of-statement token.
Available settings:
- ENABLE:<ON | OFF>
  - ON: Enable the creation and training of models as well as the usage of models for predictions.
  - OFF: Disable the creation, training, and usage of models (default).
- DISCOVER:<ON | OFF>
  - ON: Enable the ML Optimizer feature to attempt to automatically discover and train a model for tables during manual or automatic statistics collection. For more information, see the “Training Models” subtopic.
  - OFF: Disable the automatic discovery and training of models during manual or automatic statistics collection (default). Note that this only applies to tables with no model. Existing models will still continue to be re-trained via RUNSTATS according to the “Retraining/Creating New Models” subtopic.
- MAX_MODELS:<integer>
  - integer: The maximum number of trained models that may exist for a single database. The default setting is 100.

DB2_SELECTIVITY:

A new setting has been added to this registry variable to disable the use of model predictions.
Available setting:
- ML_PRED_SEL OFF
  - OFF: If set, the optimizer will not use model predictions.

Feedback

Although the ML Optimizer feature is not supported (since it is a Technology Preview), we would like to hear how your experience has been while using it! Please feel free to send any feedback including questions and concerns to calisto@ca.ibm.com.

[{"Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"ARM Category":[{"code":"a8m500000008PkyAAE","label":"Compiler->Optimizer"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.5.4;11.5.5"}]

Tips

Machine Learning Optimizer Technology Preview for Db2 11.5.4 and 11.5.5

News

Abstract

Content

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?