Important:

IBM Cloud Pak® for Data Version 4.6 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.6 reaches end of support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1 documentation.

Profiles of data assets

An asset profile includes generated metadata and statistics about the asset content, and helps you understand what actions to take to improve the data quality. You can see the profile on an asset's Profile page.

Profiles can be created for data assets that contain relational or structured data.

Requirements and restrictions

You can view the profile of assets under the following circumstances.

Required service
Watson Knowledge Catalog service.
Required permissions
To view this page, you can have any role in a project or catalog.
To create or update a profile or to run metadata enrichment in a project, you must have the Admin or Editor role in the project.
To create or update a profile in a catalog, you must have the Admin role in the catalog, or you must have the Editor role and must be an asset owner or an asset member.
Workspaces
You can view the asset profile in these workspaces:
  • Projects
  • Catalogs
Types of assets
These types of assets have a profile:
  • Data assets from relational or nonrelational databases from a connection to the data sources

  • Data assets from partitioned data sets, where a partitioned data set consists of multiple files and is represented by a single folder uploaded from the local file system or from file-based connections to the data sources

  • Data assets from files uploaded from the local file system or from file-based connections to the data sources, with these formats:

    • CSV
    • XLS, XLSM, XLSX (Only the first sheet in a workbook is profiled.)
    • TSV
    • Avro
    • Parquet

    However, structured data files are not profiled when data assets do not explicitly reference them, such as in these circumstances:

    • The files are within a connected folder asset. Files that are accessible from a connected folder asset are not treated as assets and are not profiled.
    • The files are within an archive file. The archive file is referenced by the data asset and the compressed files are not profiled.

Ways to create a profile

Asset profiles can be created in different ways:

  • In governed catalogs, profiles for individual data assets are created automatically when the data assets are added to the catalog with these exceptions:

    • You disabled automatic profiling for the catalog.
    • The asset comes from a connection that is configured to use personal credentials.
    • The asset was profiled through metadata enrichment before it was published. Such assets already have a profile that's added to the catalog along with the asset.
  • In projects and in catalogs without data protection rule enforcement, you can manually create profiles for individual data assets. You can also create a profile manually in a governed catalog if the asset wasn't profiled before.

  • In projects, you can create and run a metadata enrichment asset to profile large sets of data assets in one go. These asset profiles are available in the project. You can publish the enriched assets with their profiles to any type of catalog. See Managing metadata enrichment.

You can update an individual asset profile from the asset's Profile page in a project or a catalog. If you manually update a profile of a data asset that is included in a metadata enrichment, the profile and analysis information is also reflected in the respective enrichment results. Profiles are also updated when new enrichment results are published.

When you update an existing profile, you can change the data classes to include in the profile.

What is analyzed during profiling?

If you create or update an asset profile from the Profile page in a project or a catalog, columns and data quality are analyzed.

When a single asset is profiled in a project or a catalog, the profile is by default created based on the first 5,000 rows of data. If the data asset has more than 250 columns, the profile is created based on the first 1,000 rows of data. If the profile is created through metadata enrichment, sampling is determined by the metadata enrichment settings.

To profile and classify data, and to find inconsistencies and anomalies, analysis includes the following tasks:

  • Compute statistics about the data of each analyzed column.
  • Compute data types for columns and data types distribution.
  • Computes data formats for columns and formats distribution.
  • Classify the data and compute data class candidates for columns.
  • Capture frequency distributions.

To identify the structure, content, and overall quality of your data, analysis includes the following tasks:

If you run metadata enrichment on data assets, the enrichment option Profile data does not include data quality analysis. See the information about metadata enrichment objectives.

Profile information

The profile of a data asset shows information about each column in the data asset.

The Profile tab provides some general information and an overview of the analysis results:

  • When was the profile created or last updated.

  • How many columns and rows were analyzed.

  • The overall quality score for the data asset and a separate quality score for each column. Data quality scores for individual columns in the data asset are computed based on quality dimensions. The overall quality score for the entire data asset is the average of the scores for all columns. A dash (—) is shown in profiles generated through metadata enrichment without data quality analysis.

    To prevent records with multiple quality issues to unnecessarily weigh down the data quality score, values that are identified with more than one issue do not weigh differently against the quality score as values with only one.

  • The inferred data class for each column and the confidence for that data class. Data classes describe the contents of the data in the column: for example, city, account number, or credit card number. Data classes can be used to mask data or to restrict access to data assets with data protection rules. The data classes appear for each column on the asset's Overview page and on the Profile page.

    The confidence of a data class is the percentage of non-null values that match the data class.

    Several data classes are more generic identifiers that are detected and assigned at a column level. These data classes are assigned when a more specific data class could not be identified at a value level. Generic identifiers always have a confidence of 100% and include the following data classes: code, date, identifier, indicator, quantity, and text.

  • The percentage of matching, mismatching, or missing data for each column.

  • The frequency distribution for all values identified in a column.

  • Statistics about the data for each column such as the number of distinct values, the percentage of unique values, minimum, maximum, or mean, and sometimes the standard deviation in that column. The number of distinct values indicates how many different values exist in the sampled data for the column. The percentage of unique values indicates the percentage of distinct values that appear only once in the column.

    Depending on a column’s data format, the statistics vary slightly. For example, statistics for a column of data type integer have minimum, maximum, and mean values and a standard deviation value while statistics for a column of data type string have minimum length, maximum length, and mean length values.

More detailed information about column data is available when you click the column name. See Detailed profiling results.

Learn more

Parent topic: Asset types and properties