Table of contents

Data quality project settings (Watson Knowledge Catalog)

Configure project settings for all data quality projects globally or for each data quality project individually.

When you configure project settings globally, they are applied to all available data quality projects but can be overwritten at a lower level (project level or, for some settings, at job level). When you configure settings for a particular data quality project, then these settings apply to this project only.

Required permissions:
To access global project settings, go to Governance > Data quality, and click the global settings icon. You must have these user permissions to access and change the global settings:
  • Manage asset discovery
  • Manage data quality
  • Manage users

To access settings for a particular data quality project, open the project and go to the Settings tab.

To view the settings, you must have these user permissions:

  • Access data quality
  • Manage asset discovery

You must also have at least the Business Analyst project role assigned.

To change any settings, you must have these user permissions:

  • Manage asset discovery
  • Manage data quality

You must also have the Business Analyst and Data Operator project roles assigned.

The following sections list settings that you can configure globally or for individual data quality projects:

General settings

These settings are project specific.

Description
Provide a short and long description of a data quality project. The short description is displayed in the Dashboard tab of a data quality project.
Steward
Select a user responsible for the data quality project.
Enable drill-down security
Restrict the ability to drill down source data. If you enable this option, only users with the Drill Down User project role can drill down source data.

Column analysis settings

These settings can be configured globally and at data quality data quality project level.

Null threshold
Specify a value between 0.010 and 10% as the null threshold. This setting determines whether a column or flat file field allows null values. If a column or flat file has null values with a frequency percentage equal to or greater than the nullability threshold, it is determined that the data field allows null values. If null values do not exist in the data field or the frequency percentage is less than the threshold, it is determined that the data field does not allow null values. The default is 1%.
Cardinality
Uniqueness threshold
Specify a value in the range 90 - 100% as the uniqueness threshold. This setting determines whether a data field contains unique values. A column or flat file is considered unique if it has a percentage of distinct values equal to or greater than the threshold that you set. The default is 99%.
Constant threshold
Specify a value in the range 90 - 100% as the constant threshold. This setting determines whether a column or flat file contains constant values. It is determined that a field is constant if it has a single distinct value with a frequency percentage equal to or greater than the constant threshold that you set. The default is 99%.
Frequency distribution settings
Configure the amount of statistics and values that are stored after each column analysis.

  • Store only the analysis statistics - This option runs the fastest and uses the least amount of storage space because it does not generate or store distinct values.
  • Store the analysis statistics and a limited number of distinct values - This option runs efficiently and takes up a limited amount of storage space. It provides a good representation of values because a third of values are the most frequent, a third are the least frequent, and a third are random. Specify the maximum number of stored values. For optimal performance, specify a value in the range 500 - 5000. The default frequency distribution setting is 1000.
  • Store the analysis statistics and all distinct values - This option might significantly slow down the speed of the analysis and use large amounts of storage space in the analysis database.
  • Store the analysis statistics, all distinct values, and all updated values from previous analysis jobs - This option is the slowest one and uses the largest amount of storage space in the analysis database.
Data classification settings
Specify which data classes can be assigned to a column or flat file during a column analysis. All data classes that are enabled are applied during all column analysis jobs that are run in your data quality project.

Data quality settings

These settings can be configured globally and at data quality project level.

Data quality threshold
Specify the data quality threshold to identify all data assets that have a data quality score that is less than the threshold. For example, if you set the threshold to 95%, then the data quality scores of all data assets with a data quality score of 95% or below violate the threshold and are marked with orange on the data asset tile.
Data quality dimensions
Select the data quality dimensions that you want to apply when you run quality analysis on your data assets.
Ignore new data quality dimensions that are installed
Select this option if you don’t want your data quality scores to be impacted by any updates. Updates include adding custom data quality dimensions, or adding new data quality dimensions as part of an upgrade or fix pack installation.
Enable automation rules
Select this option if you have automation rules and you want them to be applied automatically as part of a data quality or column analysis. Automation rules might update data rule, quality rule, data quality score threshold, and data quality dimension settings.

Enabling this option makes a data quality project a governed project. Whether the project is governed is shown in the information side panel when you open a project.

Keys and relationships analysis settings

These settings are project specific.

Primary key settings
Minimum uniqueness allowed
Minimum uniqueness factor a column must have to be detected as a primary key candidate. It is determined based on the unique values in the column.
Compound keys and relationships
Search for compound keys relationships
Select this option if you want to run a multiple column analysis to determine which combination of columns can be used as primary keys. Specify the maximum number of columns that can be part of a set of searched multiple column primary keys. The default value is 2. You can provide any value in the range 2 - 32. The higher the number, the longer the analysis might take.
Foreign key settings
Maximum percentage of allowed orphan values
Maximum percentage of values not found in the primary key for a column to be considered as a foreign key candidate.
Minimum percentage of common distinct values
Minimum percentage of common distinct values that must exist between two columns to be considered related when a relationship analysis is run. The percentage must be between 0.0 and 100.0. The default is 30%. Raise the percentage if your relationship analysis is generating relationships that you do not consider valid. This reduces false positives. Lower the percentage if you want to detect relationship candidates that have fewer similarities. This might create more false positives.
Minimum confidence for the relationships
Minimum confidence for the relationships to be automatically selected.
Limit columns to speed up analysis
These settings apply to relationship analysis and overlap analysis. Define the columns that you want to include or exclude in the analysis to reduce the time of the analysis. You can use the following filters:
  • Only columns whose names match the specified Java regular expressions are included or excluded.
  • Only columns of the selected types are included or excluded.
  • Only the first n columns that match the specified conditions are analyzed.

Sampling settings

These settings can be configured globally and at project level. Some of these settings can be overwritten at job level.

Based on these settings, a data sample can be created to run an analysis against. The data sample can be used when running a data quality analysis or data rules. The sample can help you determine performance and can provide a preview of your analysis results.

Use sample
By default, sampling is enabled globally, and the maximum data sample size for new data quality projects is 1,000 records. If not overwritten at project or job level, the global sampling configuration is applied to all data analyses, data rules, and quality rules that are run in automated discovery jobs in any data quality project. If enabled at project level, these sampling settings overwrite any global sampling settings and are applied to all automated discovery jobs in this data quality project. If not enabled globally or at project level, you can still enable sampling for a single job. With the appropriate permissions, you can overwrite the project-level sampling settings.

Changes to the global sampling settings are reflected only in new data quality projects. In data quality projects that were created before the change, the original sampling settings applied.

Quick scan jobs use sampling by default regardless of the global or individual project settings.

Set the maximum number of records that you want to include in your data asset sample
The maximum number of records that you specify is the maximum number of records that are returned regardless of how the jobs are run on the parallel engine. For new data quality projects, the default value is 1,000.
Use the first x number of rows where x is the maximum number of records allowed
The sample includes the first x records that you specify. For example, if you have 1,000,000 records and you specify a sample size of 2,000, the sample includes the first 2,000 records.
Use every Nth value up to maximum number of records allowed
The sample reads every nth interval that you specify until the number of records in the sample size is reached. For example, if you have 1,000,000 records and specify a sample size of 2,000 with an interval of 10, a maximum of 20,000 records is read (2,000*10) with every 10th record selected to retrieve the sample size of 2,000.
Use a random sample
The sample randomly selects records in your sample size. The formula used to randomly select records is (100/sample_percent)sample_size2. The number 2 is used in the formula to ensure that enough records are read to produce a valid random sample size. For example, if you have 1,000,000 records and you specify a sample size of 2,000 and a percent of 5, the sample returns 2,000 records and read, at most, 80,000 records ((100/5)2,0002=80,000).

In the Seed field, specify a number that is used to initialize a random number generator. The output of the random number generator is used to select the records for the sample. Two samplings that use the same seed value contain the same records.

In the Percentage field, specify the sampling percentage that you want to use for each output data asset. Specify the percentage as an integer value in the range of 0, corresponding to 0.0%, to 100, corresponding to 100.0%.

As an administrator, you can also modify the default sampling options from the command line (within the iis-services pod):

" /opt/IBM/InformationServer/ASBServer/bin/iisAdmin.sh -set -k com.ibm.iis.ia.default.sample.options -value "{"useSample":true,"size":2000,"sampleType":"SEQUENTIAL"}"

Engine settings

These settings are configured globally and can be overwritten at project level. These settings by default are inherited from the global settings that apply to all available data quality projects. If you want to use a different connection for the particular data quality project, clear the Inherit global settings checkbox and specify new connection information.

Specify connection information for an analysis engine:

Host
The host name of the DataStage™ analysis engine.
Port
Specify the port number to connect to the analysis engine.
Data stage project
Specify the internal project in which analysis jobs are run. This project is only used dynamically at runtime. The default project is ANALYZERPROJECT.
Array size
Specify the number of data rows to be grouped as a single operation, which affects performance.

Important: By increasing the array size, you increase the number of elements included in each INSERT statement. If you increase the array size, memory consumption is also increased and could lead to performance degradation.
Use static engine credentials
Select this option if you want to specify the credentials to use to access the analysis engine.

Database settings

These settings are configured globally and can be overwritten at project level.

The analysis database stores analysis results and specifies some details about system-generated tables and columns. The database settings by default are inherited from the global settings that apply to all available data quality projects. If you want to use a different connection for the particular data quality project, clear the Inherit global settingscheckbox and specify new connection information.

Specify connection information for the analysis database:

Host
The host name of the computer that is used to store analysis results.
Data Connection
The name of the data connection that is used to connect to the host.
JDBC Data Source
The name of the data source. The default name that is used if the analysis database is a Db2 database is jdbc/IADB.

Additional settings:

Automatically register data rule output tables as data assets
Select this option if you want the output tables from the data rules to automatically be registered as data assets in the data quality project.
Maximum length for system generated columns
Specifies the maximum number of characters in the names of output columns that are generated by data rules. The default is 30, which is the maximum length permitted by Oracle databases. If you are using another type of database for the analysis database, you can edit this value to specify the maximum length supported by the database. For example, for Db2 and SQL Server databases, the maximum is 128.

This is a global setting and cannot be changed at project level.

Users and groups settings

These settings are project specific. Add users and configure project roles for them. By default, any user that you add to the data quality project has the Information Analyzer Business Analyst project role assigned. Note that the platform-level permissions that users have also have an impact on what they can do within a data quality project. For example, the default platform-level Data Steward role allows users in a data quality project to add or delete assets regardless of their project role. All data quality project users have access to rules and rules definitions. However, whether they can only view them or manage them within the data quality project depends on their platform-level permissions.

The Information Analyzer Data Operator and Information Analyzer Drill down project roles can be used only in combination with the Information Analyzer Business Analyst project role.

In the following table, the project role names are shortened for better readability.

Action Business Analyst Data Operator Data Steward Drill down user
Open analysis results X   X  
Publish analysis results X   X  
Delete analysis results X      
Mark data assets as reviewed X   X  
Manage virtual columns and SQL virtual tables X      
Run analysis
(in the context of a project or a discovery job)
  X    
Delete quick scan jobs for a given project X      
Drill down into source data
if the Enable drill-down security option is enabled in the general project settings
      X

The list that is displayed when you add users can contain 1,000 entries at maximum even if the database has more entries. Therefore, the user you are looking for might not be displayed. When you search for a user by name, all entries in the database are searched.

User groups are currently not supported.

Advanced options settings

These settings are project specific.

Retain the (DataStage) analysis jobs and the job logs
Use this option only for debugging and troubleshooting. Analysis jobs and job logs are normally deleted after a job completes. By selecting this option, you can retain the logs.
Automatically delete output tables for data rules and rule sets
Select this option to automatically delete output tables, which are generated when you run data rules. Choose one of the following options:
  • Delete tables older than the specified time frame.
  • Delete tables when their number exceeds the specified value. The oldest tables are deleted in such case.