Running AI queries with SQL Data Insights
SQL Data Insights (SQL DI) is an AI-powered Db2 feature. It combines deep learning in artificial intelligence (AI) with advanced IBM® Z technologies to infuse the Db2 engine with SQL-based semantic queries on user tables and views.
The existing data model in a relational database, SQL queries, text extensions, and user-defined functions are incapable of capturing the semantic relationships among value entities within and across columns or rows. SQL DI uses database embedding, a self-supervised learning approach in deep learning and AI, to train a neural network model and infer semantic meanings for the unique values in a relational table. The inferred meanings in the form of numeric vectors encapsulate the inter-column and intra-column data relationships. SQL DI then uses the trained model that consists of numeric vectors to run AI queries that discover, match, and cluster semantic similarities and dissimilarities in your Db2 data.
SQL DI is seamlessly integrated into your Db2 for z/OS® environment. The feature is comprised of an AI library stack, a data and query engine, and a web application. The AI library stack is provided by the Z Deep Neural Network Library component that resides natively in z/OS. The stack consist of the zDNN, zAIO, and zADE libraries, which enables SQL DI to take full advantage of IBM Z processors.
The data and query engine is built into Db2 for z/OS and provides services for processing data and semantic queries. The core of this processing engine is the set of AI_SIMILARITY
, AI_SEMANTIC_CLUSTER
, AI_ANALOGY
, and AI_COMMONALITY
scalar functions. You can use these built-in functions in SQL statements to ask semantic questions about your data.
The SQL DI web application leverages the analytic framework of IBM Machine Learning for z/OS and the runtime engines of z/OS Spark. The application provides the primary user interface for you to create AI objects, enable and run AI queries, and visualize query results.
With SQL DI, you can quickly tap into the vast amount of mission-critical data in your Db2 and easily uncover the hidden information across tables and views for actionable insights. You can achieve all these without the costly effort of moving massive data across platforms, procuring expensive AI infrastructures, and acquiring advanced AI skills.
How does SQL DI work?
In the web user interface (UI), you connect SQL DI to your Db2, create an AI object from a selected source Db2 table or view, enable the object for AI queries, and run queries on the object at any time.
To enable the object for AI queries, SQL DI preprocesses the data in the object, trains a neural network model on the data, and loads the model into Db2. For data preprocessing, SQL DI uses an embedded Spark cluster to convert Db2 columns selected for enablement as text, known as the AI object text data. After data transformation, SQL DI clusters numeric data type values and replace all numeric values with cluster identifiers. In text format, every numeric or categorical data type value in the data is tagged with its column name. In addition, every numeric data type value is associated a cluster identifier.
After the data is preprocessed, SQL DI uses z/OS AI capabilities provided by the zDNN stack to train the model with input from the AI object text data set. The training process generates numeric vectors for every unique value in the data set. The numeric vectors represent the inter-column and intra-column semantic relationships between different unique values, also referred as the vocabulary, in the object. SQL DI uses the Db2 LOAD utility to load the trained model into a Db2 table.
After successful enablement, you can run similarity, dissimilarity, clustering, analogy, or commonality queries on the object at any time. Simply select a query type, enter a SQL statement, and run the query. Depending on the query type that you select, SQL DI uses a corresponding Db2 AI function to process your query. It displays the first 50 rows of the query results in the UI and loads the remaining rows into the Db2 table. You can export the displayed rows from the SQL DI UI or download the entire result set from Db2.
During model training, SQL DI also collects key data statistics and renders them into column influence and discriminator scores for the object and the model. A column influence score correlates to the number of user-specified NULL values in a column and indicates the column's influence on the training of the object model. On the other hand, a discriminator score correlates to the number of unique values in a column and measures the column's ability to semantically distinguish its values from other values in the table. The visualized scores can help you understand the results of your AI queries on the object.
As your data change, the accuracy of the object model may degrade over time. When needed, you can retrain the model if it's eligible. By default, a new model is eligible for retraining if it is originally trained in an SQL DI environment where all the required maintenances (APARs) are already applied. For a model that exists prior to the APAR application, you must delete it first. After applying the required APARs and upgrading your SQL DI environment, disable the object for AI query. The disablement removes the existing model. Then, re-enable the object for AI query. Successful enablement trains a new model that is automatically eligible for retraining.
SQL DI concepts and definitions
The following concepts and definitions can help you learn, use, and administer SQL DI:
- AI object
- A SQL DI asset that you can create to contain a Db2 user table or view. You can then enable the object for AI query in the SQL DI UI.
- AI object text data
- The data in an AI object that is transformed and formatted as text for AI query enablement. This text data is the input to model training. The number of tokens in the data are the values in every categorical and numeric column that are selected for model training for the AI object.
- Vocabulary
- The total number of unique values in the selected SQL DI categorical columns and the number of clusters associated with the selected numeric columns in the AI object for AI query enablement.
- SQL DI data type
- A SQL DI-specific data
category descriptor that you can assign to a column in an AI object for column configuration and
model training. You can assign the column one of the following SQL DI data types:
- Categorical: The SQL DI
categorical
data type is used for columns with discrete values, each of which is its own entity. - Numeric: The SQL DI
numeric
data type is used for columns with continuous values. - Key: The SQL DI
key
data type is used to indicate that a column represents an entire row.
See Enabling AI query for more information.
- Categorical: The SQL DI
- AI query
- A semantic query that you can run on an AI object to infer hidden relationships between entities
in the object. You can select from the following AI query types:
SQL DI AI query types and corresponding Db2 AI functions
SQL DI query type Description Corresponding Db2 AI function Semantic similarity A similarity query identifies groups of similar records or entities in records. AI_SIMILARITY Semantic dissimilarity A dissimilarity query finds the outliers from the norm in records. AI_SIMILARITY Semantic clustering A clustering query forms a cluster of entities in records and evaluates whether or not an additional entity belongs in the cluster. AI_SEMANTIC_CLUSTER Semantic analogy An analogy query determines if the relationship between two entities applies to that of a second pair of entities. AI_ANALOGY Semantic commonality A commonality query identifies the entities in records that exhibit the most common or uncommon patterns. AI_COMMONALITY See Running an AI query for more information.
- Vector prefetch
- The advanced processes that SQL DI uses to upload the numeric vectors to the zAIO library for calculating the similarity scores. The vectors are generated from an AI object text data set. The existing query processing architecture dictates that Db2 AI functions submit the vectors, record-by-record, to z/OS for processing. To significantly improve the query performance, SQL DI implements vector prefetch, which enables Db2 to upload multiple vectors in a batch at a time.
- ibm-data2vec
- A z/OS native AI function
that SQL DI uses for model
training. The
ibm-data2vec
function is an implementation of a self-supervised database embedding algorithm. The database embedding takes as input a text file that is created from a multi-modal relational table and builds a relationship map between text tokens by using the relational data model. The input training document generated from a relational table consists of string tokens that represent different relational entities in the original Db2 table or view. The ibm-data2vec function views the training document as a set of sentences, where each sentence represents a relational table row.After model training is completed, ibm-data2vec generates a numeric vector of a pre-defined length (dimension) for each token, and the vector encodes the meaning of that token. The core numerical computations of the training process are paralleled by using multiple threads and accelerated by using hardware-accelerated numerical computations. The final trained model is stored as a binary file that uses the Db2 ZLOAD utility. See ibm-data2vec for more information.
- base10Cluster
- A z/OS numerical clustering algorithm that clusters together numerically closer items in different buckets, each of which represents a distinct cluster. SQL DI uses the base10Cluster algorithm to process an input data set and generate an output file that lists the number of buckets and their corresponding minimum values. See base10Cluster for more information.
- Column influence score
- A score that correlates to the number of user-specified and SQL NULL values in a column and indicates the column's influence on the training of the object model. The fewer NULL values the column has, the higher influence score it generates.
- Column discriminator score
- A score that correlates to the number of unique values in a column and measures the column's ability to semantically distinguish its values from other values in the table. The more unique values the column has, the higher discriminator score it generates.