Data sampling

Data sampling is a way for the IBM® Cognos® Analytics artificial intelligence (AI) to learn about the data in the underlying data server. This data is used by the AI to do better automation, and make better visualization suggestions.

Without the sample data, some Cognos Analytics features don’t work. For example, the relationships diagram in Explore is displayed only if the sample data is available. Otherwise, it's not displayed.

When loading the schema metadata or enriching a package, a sample of statistical data can be retrieved from the underlying data server. To enable this functionality, select the Retrieve sample data checkbox in the related dialog box.

By default, 1000 rows of a statistical sample of the data per table (query subject for packages) is retrieved from the underlying data server. This sample data is used by the Cognos Analytics AI to infer characteristics, or "advanced metadata", that help the AI in its automation choices and visualization suggestions.

An example of a characteristic that can be inferred from the sample data is the approximate number of unique values in each field. This information helps the AI make visualization type recommendations. For example, a bar chart is recommended only if there are not too many unique values to display as bars. A bubble chart is more appropriate for data fields that have hundreds of unique values.

The type and amount of data that is retrieved with the data sample is not always the same, and is influenced by the following factors:

User's permissions to query specific tables or columns in the table.
Data security that constraints the rows that the user can see.
Data masking that might change the data that the user can see.
Expressions might be dynamic because of macros.
Data server connections might be dynamic because of macros or security on the connections.

Disabling data sampling

To disable data sampling for all tables in the data server, clear the Retrieve sample data checkbox in the metadata loading or package enrichment user interface.

If data sampling is disabled, the Cognos Analytics AI doesn’t know as many characteristics about the data. It still knows some characteristics by looking at the data server metadata, but not as many as it would know if it had access to the sample data. Using the example above, without sampling, a bar chart might be recommended even for data fields that have too many unique values to be appropriate for this visualization type. In summary, the visualization recommender works without the characteristics inferred from the sample data, but it doesn’t work as well.

Here is a list of features that are negatively affected when data sampling is disabled:

Forecasting
Assistant
Relationship diagram
Decision tree visualization
Spiral visualization
Driver analysis visualization
Sunburst visualization
Natural language details
Insights in visualizations
Correlated insights
Recommended visualizations in Explore
Related visualizations.
Recommended visualizations in dashboards

Disabling data sampling is justified in the following situations:

Errors occur when the sample data is retrieved.
The negative performance impact on the system is too significant.

Instead of disabling data sampling entirely, you can keep the Retrieve sample data checkbox selected, but exclude some tables from the process. Both the metadata loading and package enrichment user interfaces include options to deselect tables. For example, you could exclude tables that generate errors. You can also reduce the number of sample rows that are retrieved.