Machine Learning Optimizer Technology Preview for Db2 11.5.6

News

Abstract

Machine Learning Optimizer Technology Preview for Db2 11.5.6.

Content

Highlights

The Machine Learning Optimizer (also referred to as ML Optimizer) Technology Preview was initially released for Db2 11.5 with subsequent enhancements made in Db2 11.5.4 and Db2 11.5.5.

In Db2 11.5.6, the following enhancements have been made to the ML Optimizer:

Models can now be exploited for cardinality estimation over single-column equality pairwise joins between two base tables
Model training time has significantly improved allowing for more columns to be trained on
Models are significantly smaller in size (approximately 100 times smaller)
Model discovery is enabled by default when the Machine Learning Optimizer is enabled

The Machine Learning Optimizer feature is designed to fully automate the building and training of models via a process called model discovery. This automation eliminates the need for users to create models manually (which can still be done, optionally, if a model is needed as soon as the associated table is populated).

This unsupported version of the feature should only be used in non-production environments.

Note that the TABLE_MODELS table definition from previous versions is not compatible with 11.5.6 and must be dropped and recreated. For more information about the creation of the TABLE_MODELS table, please see the SYSTOOLS.TABLE_MODELS Table subtopic.

Introduction

Accurate cardinality estimates are fundamental to cost-based optimizers, such as the Db2 optimizer. Cardinality estimation is a process where the optimizer traditionally uses statistics to determine the size of an intermediate query result such as the number of rows of a scan in a query execution plan that applies one or more predicates. In the absence of statistics that capture the correlations, the optimizer will default to assuming independence between multiple predicates. For example, the DEPTNO and PROJNO columns in the PROJECT table from the SAMPLE database are not independent. If correlation statistics are not collected on the combination of these two columns, cardinality estimates may be inaccurate for a query with a pair of predicates such as:

SELECT * FROM PROJECT WHERE DEPTNO = 'E21' AND PROJNO BETWEEN 'OP2010' AND 'OP2013'

PROJNO PROJNAME                 DEPTNO RESPEMP PRSTAFF PRSTDATE   PRENDATE   MAJPROJ
------ ------------------------ ------ ------- ------- ---------- ---------- -------
OP2010 SYSTEMS SUPPORT          E21    000100     4.00 01/01/2002 02/01/2003 OP2000
OP2011 SCP SYSTEMS SUPPORT      E21    000320     1.00 01/01/2002 02/01/2003 OP2010
OP2012 APPLICATIONS SUPPORT     E21    000330     1.00 01/01/2002 02/01/2003 OP2010
OP2013 DB/DC SUPPORT            E21    000340     1.00 01/01/2002 02/01/2003 OP2010
  4 record(s) selected.

In this example, the DEPTNO value 'E21' is strongly correlated to the PROJNO values between 'OP2010' and 'OP2013'. In general, even though the predicates in this scenario may seem redundant if there is correlation between DEPTNO and PROJNO, it may not be uncommon in practice to have two or more such predicates in a query that are strongly correlated. If these columns only have basic independent statistics collected, the estimates of the number of qualifying rows (cardinality) after applying the predicates may be very inaccurate using basic probability formulae resulting in poor performing query plans.

As part of a larger plan to develop a database optimizer founded on machine learning, cardinality estimation is a first step. Using ML for cardinality estimation will allow Db2 to automatically estimate cardinalities based on continuously learning targeted customer data and workloads. These techniques will significantly reduce (and possibly eliminate) the need for custom tuning of the optimizer by relying primarily on the automatic building of machine learning models.

When the ML Optimizer feature is enabled, a machine learning model can be built on any table. Either the automatic statistics collection facility or an invocation of a specific stored procedure can be used to trigger the building of a machine learning model on a table. The model can then be used to predict the estimated cardinalities for operators that apply multiple predicates.

Enabling the Machine Learning Optimizer Feature

Enable the ML Optimizer feature for your Db2 instance by running the following command as the Db2 instance owning userid (i.e. db2inst1):

db2set DB2_ML_OPT=ENABLE:ON

After setting this registry variable, you must deactivate and re-activate your database for the new setting (and the ML Optimizer feature) to take effect.

To utilize model discovery (explained below in the Training Models subtopic), the SYSTOOLS.TABLE_MODELS table must also be created in the database. The SYSTOOLS.TABLE_MODELS DDL is located in DFTDBPATH/sqllib/misc/ and can be created by running the following command:

db2 –tf MLOptimizerCreateTables.ddl

Note that the SYSTOOLSPACE table space must exist before creating the TABLE_MODELS DDL. See the SYSTOOLS.TABLE_MODELS Table subtopic for more information about the TABLE_MODELS table.

To manually create ML Optimizer models using the stored procedure (named SYSTOOLS.DEFINE_MODEL), the stored procedure must also be created in the database. The stored procedure DDL is located in DFTDBPATH/sqllib/misc/ and can be created by running the following command:

db2 –td@ -f MLOptimizerRoutines.ddl

Note that the SYSTOOLS.TABLE_MODELS table will automatically be created the first time the DEFINE_MODEL stored procedure is invoked.

Assuming you have enabled the ML Optimizer feature, created the SYSTOOLS.TABLE_MODELS table, and optionally created the DEFINE_MODEL stored procedure, you are now ready to use the ML Optimizer feature.

Usage

Training Models

When a machine learning model is trained for a table, the model and metadata are stored in the SYSTOOLS.TABLE_MODELS table. Manual modifications to the SYSTOOLS.TABLE_MODELS table is not recommended.

When the ML Optimizer feature is enabled, Db2 will automatically look for suitable tables to build machine leaning models for the optimizer to exploit. The auto-RUNSTATS feature, if turned on, will periodically build models for tables that have been recently created or changed significantly. This process is called model discovery.

Model discovery is a method of automatic model training that does not require invocation of the DEFINE_MODEL stored procedure. By using model discovery, the ML Optimizer feature automatically chooses columns it thinks would best represent the correlations present in the representative sampling of rows.

When model discovery is enabled, the automatic statistics collection facility will automatically attempt to create a discovered model on the provided table if a manually created model doesn’t already exist. Consequently, models created via the DEFINE_MODEL stored procedure will always take precedence over discovered models, and subsequent statistics collection will attempt to retrain existing discovered models. Therefore, if a manually created model exists on a particular table that you are interested in using a discovered model for, the existing model must first be removed. See the Removing Models subtopic for more details on removing existing models.

Note that model discovery will not occur if a statistics profile on the table is already set.

If the model needs to be built manually (this may be required if one cannot wait for auto-RUNSTATS to kick in) the only required user interaction with the Db2 ML Optimizer feature is to choose which tables should be used to train a machine learning model, and to choose which columns from those tables are typically used by query predicates from analytical queries issued against those tables. The first step is to determine which tables you think may benefit from improved cardinality estimates from the ML Optimizer feature, and then select from two up to twenty columns from those tables. Machine learning model training time increases as the number of selected columns increases, so it is recommended to select only those columns that are commonly used in query predicates.

Assuming your table is in the schema MYSCHEMA, and is named MYTABLE, and the columns you have selected to train the model on are named C1, C2, and C3, you can manually train a model by invoking the DEFINE_MODEL stored procedure:

CALL "SYSTOOLS"."DEFINE_MODEL"('MYSCHEMA', 'MYTABLE', 'C1,C2,C3', OUT_TEXT)

Once the procedure completes successfully, you should look at the data returned in OUT_TEXT. This output parameter of the DEFINE_MODEL procedure provides information to confirm that the procedure invocation and model training were successful. A successful invocation of the procedure will display the following message in OUT_TEXT:

Successfully trained model.

Please note that the column name argument of the DEFINE_MODEL procedure (in the example above, where we specify 'C1,C2,C3') must separate the column names using a comma. The column names are also treated as is and are case-sensitive. For example, this means that if leading or trailing spaces are present next to a column name, those spaces will be treated as part of the column name.

After a model has been trained, it will automatically be considered during Db2 optimizer access plan selections for queries issued against that table, and so can automatically benefit from improved cardinality estimates.

Retraining/Creating New Models

In some cases, a model on a table may need to be retrained because there may have been significant changes to the data in the table. When a model is retrained, it is trained using a new representative sampling of rows through Db2 RUNSTATS.

Models are retrained automatically using the automatic statistics collection facility if data in the table has changed significantly, or they can be retrained manually. There are two ways to manually retrain a model: using the DEFINE_MODEL stored procedure or using RUNSTATS. To retrain an existing model using the DEFINE_MODEL stored procedure, simply resubmit the original command used to train that model. To retrain an existing model using RUNSTATS, submit a RUNSTATS command on the associated table.

Sometimes it may be helpful to create and train a new model to capture different columns than the ones the original model was trained on (for example, if the focus for a model is on a new set of queries). To create and train a new model, resubmit the original command using the DEFINE_MODEL stored procedure with the new set of columns. Note that only one model is allowed per table, so a previously existing model on the same table will no longer be used for predictions.

For example, to manually create and train a new model for the table named MYTABLE using columns C2, C3, and C4, you can use the following command:

CALL "SYSTOOLS"."DEFINE_MODEL"('MYSCHEMA', 'MYTABLE', 'C2,C3,C4', OUT_TEXT)

This new model will replace the current model for MYTABLE which was initially trained on columns C1, C2, and C3.

Alternatively, you can ask Db2 to automatically choose up to twenty columns to train a new model with if you specify an empty column list in the command above.

Removing Models

To remove a model and its metadata entirely, it must first be removed from the SYSTOOLS.TABLE_MODELS table and then from memory. To remove the model and metadata from the SYSTOOLS.TABLE_MODELS table, a DELETE statement can be issued that specifies the table the model was trained on. To remove the model and metadata from memory, the database the model resides in needs to be re-activated. When a model no longer exists, the traditional Db2 optimizer cardinality estimates are once again used for that table.

For example, to remove the model and metadata for the table named MYTABLE, the following command can be used to remove the SYSTOOLS.TABLE_MODELS entries for that model:

DELETE FROM SYSTOOLS.TABLE_MODELS WHERE SCHEMANAME = 'MYSCHEMA' AND TABLENAME = 'MYTABLE'

Once all connections are terminated, the database can be re-activated using these commands to remove the model from memory:

DEACTIVATE DATABASE MYDB
ACTIVATE DATABASE MYDB

See the Advanced Configuration Information topic for more information on the SYSTOOLS.TABLE_MODELS table.

Troubleshooting

If any errors are encountered, there are two logs to check that may have more information about the error: the db2diag.log and the db2optstats log.

For example, if the DEFINE_MODEL stored procedure experiences an error, there will likely be a corresponding entry for that error in the db2diag.log (located in the directory specified by your DIAGPATH database manager configuration setting) that could provide additional information about the error. Any db2diag.log entries that contain "DB2 UDB, Machine Learning" as part of the FUNCTION header are entries created by the ML Optimizer feature.

If an error is encountered while using manual or automatic RUNSTATS to retrain models or for model discovery, db2diag.log entries may not contain enough information about the RUNSTATS command that failed. In a case like this, the failing RUNSTATS command and related model information may be found in the db2optstats log (located in your DIAGPATH/events directory). Db2optstats log entries containing "LEARNED CARDINALITY MODEL" as part of the DISCOVER event type are entries created by the ML Optimizer feature when a discovered model is attempting to be created. If such an entry is found or a model is being retrained, there should also be entries containing "TABLE AND INDEX STATS" as part of the COLLECT event type that contain the RUNSTATS command being attempted.

If you experience any error that appears to be a problem with the ML Optimizer feature (for example, experiencing SQL1224N or SQL0901N errors, hangs, etc. only while the feature is enabled), please send as much of the following information as you can to the e-mail listed in the Feedback topic:

Exported data from the SYSTOOLS.TABLE_MODELS table. The SYSTOOLS.TABLE_MODELS table does not contain any table data; it contains information about the tables on which models are trained on (including the schema and table name, as well as the training columns), metadata about the models, and a binary representation of the models. This table can be exported using the following Db2 command:

EXPORT TO mloptTableModels.ixf OF IXF LOBS TO mloptLobs/ LOBFILE mloptLobExport MODIFIED BY LOBSINFILE SELECT * FROM SYSTOOLS.TABLE_MODELS

The output of db2pd -db <database name> -mlopt
The db2diag.log and db2optstats logs
db2support <output_path> -d <database name> -cl 0

Example

This example uses the SAMPLE database. An access plan of the query with default statistics on the STAFF table will be compared to an access plan using a learned model.

First, statistics are collected on the STAFF table:

RUNSTATS ON TABLE MYSCHEMA.STAFF ON ALL COLUMNS WITH DISTRIBUTION ON ALL COLUMNS AND SAMPLED DETAILED INDEXES ALL

Consider the following query on the STAFF table:

SELECT NAME, DEPT, YEARS, SALARY, COMM FROM MYSCHEMA.STAFF WHERE JOB = 'Sales' AND DEPT IN (15,20,38,51,66,84) AND YEARS >= 5 AND SALARY > 60000 AND COMM BETWEEN 200 AND 1300

This query returns the following nine records:

NAME      DEPT   YEARS  SALARY    COMM
--------- ------ ------ --------- ---------
Pernal        20      8  78171.25    612.45
O'Brien       38      6  78006.00    846.55
Rothman       15      7  76502.83   1152.00
Williams      51      6  79456.50    637.65
Smith         51      7  87654.50    992.80
Wilson        66      9  78674.50    811.50
Davis         84      5  65454.50    806.10
Graham        66     13  71000.00    200.30
Edwards       84      7  67844.00   1285.00

  9 record(s) selected.

The captured access plan for this query is shown below. The cardinality estimate of the base table is shown above the schema and table name. The output from the table scan operator (annotated as TBSCAN and is where the predicates are applied for this query) is shown above the operator and denotes the operator's cardinality estimate:

     1.7513
     TBSCAN
     (   2)
     13.7662
       1
       |
       35
TABLE: MYSCHEMA
      STAFF
       Q3

The optimizer underestimates the cardinality, estimating 1.7513 compared to the actual result of 9.

A model is trained for the STAFF table on the five predicate columns:

CALL "SYSTOOLS"."DEFINE_MODEL"('MYSCHEMA', 'STAFF', 'JOB,DEPT,YEARS,SALARY,COMM', OUT_TEXT)

When the query is executed again, the newly trained model will be used to predict the cardinality, producing the following access plan:

     11.2483
     TBSCAN
     (   2)
     14.429
       1
       |
       35
TABLE: MYSCHEMA
      STAFF
       Q3

Using the model results in a cardinality estimate of 11.2483 which is much closer to the actual result of 9.

Restrictions/Limitations

The ML Optimizer feature has some restrictions and limitations that are important to be aware of while using it:

Models can only be used on equality, range and between predicates and on IN and OR predicates on individual tables and derived query blocks. This also applies to OR predicate subterms that qualify these requirements. Models can also be used on single-column equality pairwise join predicates over base tables. Predicates not supported by the ML optimizer will be evaluated by the traditional Db2 optimizer.

For example, all predicates in the following query are supported by the ML Optimizer:
```
SELECT * FROM T1, T2 WHERE
T1.C0 = T2.C0 AND
T1.C6 IN (5, 3, 205) AND
T1.C1 = 'abc' AND
T1.C2 BETWEEN 5 AND 10 AND
T2.C2 <= 120 AND T2.C3 = 20 AND
(T1.C4 > 5 AND T1.C5 < 20 OR T1.C4 < 2 AND T1.C5 > 100) AND
(T2.C5 <= 12 OR T2.C6 = 20)
```
The query below contains predicates that are not supported (multi-column and non-equality join predicates, predicates with host variables or parameter markers that are not using REOPT, and predicates with expressions around the columns):
```
SELECT * FROM T1, T2 WHERE
T1.C0 <= T2.C0 AND
T1.C1 > T2.C1 AND
T1.C3 = ? AND
MOD(T1.C4, 10) = 1
```

Models can only be trained for permanent base tables. Other table object variations including temporary tables, views, etc. are not supported.

The default maximum number of models that can exist in each database is 100. Retraining an existing model will not increase the count of existing models. See the Overriding Default 100-Model Limit subtopic for information on overriding this default.
Only one model can be trained per table at any given time. Any subsequent training attempts on a table with an existing model will retrain that model or replace it with a new one. See the Retraining/Creating New Models subtopic for more information.
The DEFINE_MODEL stored procedure only allows up to twenty columns to be specified.
Commands that modify or alter a table (such as ALTER, RENAME, and DROP COLUMN commands) will not update an existing model on that table. Similarly, dropping a table will not cascade to remove a model associated with that table. When altering a table, it is recommended to train a new model (which will replace any existing models) or to remove the model entirely if the table is dropped or renamed. For further information about how to remove a model, see the Removing Models subtopic.
Real-time statistics may timeout for row-organized tables with models on them.
Model discovery is not limited to the number of training columns it can use. This means that in some cases, model discovery may be too liberal in terms of the training columns chosen.
Models will only be successfully trained for tables with more than two rows.
Models cannot be trained on columns of the following data types: Boolean, Large Object, XML, Anchored, User-Defined.

Disabling and Turning off the Machine Learning Optimizer Feature

There are two ways to prevent the ML Optimizer feature from being used once it has been turned on: disabling models from being used, and turning the feature off entirely to prevent the creation, training, and usage of models. Once models are no longer used, the traditional Db2 optimizer cardinality estimates are used once again.

ML Optimizer model usage can be disabled by running the following command:

db2set -im DB2_SELECTIVITY="ML_PRED_SEL OFF"

Disabling ML Optimizer model usage is useful if you are planning to resume using the models at a later time and do not want to deactivate and re-activate your database.

The ML Optimizer feature can be turned off entirely by running the following command:

db2set DB2_ML_OPT=ENABLE:OFF

After setting this registry variable, you must deactivate and re-activate your database for the new setting to take effect. If models have already been trained and you wish to remove them, see the Removing Models subtopic for more information.

Advanced Configuration Information

None of the information mentioned in this topic is required for basic usage of the ML Optimizer feature.

For those looking to explore additional capabilities or seeking more information regarding the ML Optimizer feature, this section will describe additional topics of interest.

SYSTOOLS.DEFINE_MODEL Stored Procedure

The DEFINE_MODEL stored procedure is the main entry point for manually training a model on a table. It takes as input a schema and table name as well as columns from the provided table to train a model. These columns can be referred to as the "training columns". If no training columns are provided to the stored procedure, up to twenty columns from the provided table will be chosen automatically as the training columns. The DEFINE_MODEL stored procedure also provides output through an output parameter named OUT_TEXT. This parameter provides information to confirm that the procedure invocation and model training were successful.

Once the stored procedure verifies the inputs, it creates and interacts with a table called TABLE_MODELS to process the new model. The model will then be trained, and if successful, the TABLE_MODELS table will be updated with the entry for the newly trained model. For more information about the TABLE_MODELS table, see the SYSTOOLS.TABLE_MODELS Table subtopic.

If at any point model training fails to complete, specific messages will be stored in the OUT_TEXT parameter that pertains to the error encountered. It is also advisable to check the db2diag.log file for additional information on any failures.

For more information about encountered errors, see the Troubleshooting subtopic.

SYSTOOLS.TABLE_MODELS Table

The SYSTOOLS.TABLE_MODELS table is used to store and manage model metadata whenever a model is trained or retrained. Each entry in this table corresponds to information for a unique model that was created via the DEFINE_MODEL stored procedure or model discovery. Each entry contains the columns SCHEMANAME and TABLENAME to indicate the schema and name of the table that particular model was trained on. Other than what is mentioned in the Removing Models subtopic, it is not recommended to attempt to make manual modifications to the SYSTOOLS.TABLE_MODELS table as it can cause changes to the model configuration or database errors to occur.

The SYSTOOLS.TABLE_MODELS DDL is located in DFTDBPATH/sqllib/misc/MLOptimizerCreateTables.ddl.

You may notice that there are sometimes two entries per table in SYSTOOLS.TABLE_MODELS, one with VERSIONID = 1 and one with VERSIONID = 0. You can assume that entries with VERSIONID = 1 contain the active and current model metadata for those particular tables.

In some cases, it may be helpful to verify that training has successfully completed on a particular model (for example, when the output of the DEFINE_MODEL stored procedure was not kept). To verify this for a particular table, the ISTRAINED column can be queried. If the value is '1', the model was successfully trained. If the value is '0', the model was not successfully trained (in which case, refer to the Troubleshooting subtopic). For example, to verify that the model for the MYSCHEMA.MYTABLE table has successfully been trained, the following query can be used:

SELECT ISTRAINED FROM SYSTOOLS.TABLE_MODELS WHERE SCHEMANAME = 'MYSCHEMA' AND TABLENAME = 'MYTABLE' AND VERSIONID = 1

The DEFINE_MODEL stored procedure executes a similar query to verify that the current model being created was successfully trained.

Overriding Default 100-Model Limit

By default, up to 100 trained models may exist for a database at any given time. To override this limit, the DB2_ML_OPT registry variable can be updated to specify a new limit using the MAX_MODELS setting. For example:

db2set DB2_ML_OPT="ENABLE:ON;MAX_MODELS:200"

This will limit the maximum number of trained models to 200. The database must be re-activated for the new setting to take effect.

Registry variables

The ML Optimizer feature introduces a new registry variable called DB2_ML_OPT as well a new setting for the DB2_SELECTIVITY registry variable. An overview is provided below.

DB2_ML_OPT:

Allows the usage and configuration of model training and inference (predictions) for tables. Accepted settings for DB2_ML_OPT are of the form: '<Option> <colon> <value>' with multiple settings delimited with a semi-colon ';'. The entire registry setting should be surrounded in double quotes to ensure the shell does not interpret the semi-colon as the end-of-statement token.
Available settings:
- ENABLE:<ON | OFF>
  - ON: Enable the creation and training of models as well as the usage of models for predictions.
  - OFF: Disable the creation, training, and usage of models (default).
- DISCOVER:<ON | OFF>
  - ON: Enable model discovery to train models during manual or automatic statistics collection (default when the ENABLE setting is set to ON).
  - OFF: Disable model discovery during manual or automatic statistics collection. Note that this only applies to tables with no model. Existing models will still continue to be re-trained via RUNSTATS.
- MAX_MODELS:<integer>
  - integer: The maximum number of trained models that may exist for a single database. The default setting is 100.

DB2_SELECTIVITY:

A new setting has been added to this registry variable to disable the use of model predictions.
Available setting:
- ML_PRED_SEL OFF
  - OFF: If set, the optimizer will not use model predictions.

Feedback

Although the ML Optimizer feature is not supported (since it is a Technology Preview), we would like to hear how your experience has been while using it! Please feel free to send any feedback including questions and concerns to calisto@ca.ibm.com.

[{"Type":"SW","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"ARM Category":[{"code":"a8m500000008PkyAAE","label":"Compiler->Optimizer"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.5.6"}]

Tips

Machine Learning Optimizer Technology Preview for Db2 11.5.6

News

Abstract

Content

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?