Data curation

Data curation is the process of adding data assets to a project or a catalog, enriching them by assigning classifications, data classes, and business terms, and analyzing and improving the quality of the data.

Before you can start curating data, you must have set up IBM Knowledge Catalog so that you have a governance framework and at least one catalog for sharing the curated assets (see Planning to implement data governance).

Curation can be a mostly manual process where you curate data assets one at a time. Advanced curation is a more automated process where many of the curation tasks are completed automatically for multiple data assets simultaneously.

Requirements and restrictions

For data curation, the following requirements and restrictions exist.

Data curation tools

You work with these tools:

Required service

Data curation requires IBM Knowledge Catalog, IBM Knowledge Catalog Standard, or IBM Knowledge Catalog Premium. Advanced analysis in the context of metadata enrichment (advanced profiling, and in-depth key and relationship analyses) also requires the DataStage service.

Data formats

The following data formats are supported:

  • Tables from relational and nonrelational data sources, Amazon S3 Delta Lake tables
  • Metadata import: Any format from file-based connections to the data sources
  • Metadata enrichment: Tabular: CSV, TSV, Avro, Parquet, Microsoft Excel

For information about supported connectors, see Supported data sources for curation and data quality.

Data size

Data curation works with data of any size.

Required permissions

Your roles determine which curation tasks you can perform:

  • You must have the CloudPak Data Steward role or a custom role with at least the same set of actions. See Predefined roles.
  • To work with the assets associated with the curation tools, you must also have specific roles in projects and catalogs. For the exact requirements, see the individual tools.

Workspaces

You can perform curation tasks in these workspaces:

  • Projects
  • Catalogs

Depending on the curation tasks you want to perform, you need to work on the data asset in a project, a catalog, or both before the data is ready for use by other users.

A project is a collaborative workspace where you usually prepare and analyze data before you publish it to a catalog to make the data available to other users in your organization. You can also add data to a catalog directly if you can share it without further preparation. Certain types of data can be added to catalogs only.

Curation tasks

These curation tasks let you develop valuable data assets:

  • Add data assets to a project or a catalog:

    • Add assets from a connection to a data source, manually one by one or multiple data assets automatically through metadata import. Leave your data where it is in the cloud or on premises, and just add asset metadata and the connection information to access the data within a project or a catalog.
    • Upload individual files to the storage that’s associated with the project or catalog.
    • Manually add assets from a catalog to a project to work with them.
  • Analyze and enrich your data:

    • Profile individual data assets to get basic statistics about the asset content and to assign data classes, within a project or a catalog. See Profiling data assets.

    • Create and run a metadata enrichment in a project. See Enriching data assets.

      • Profile multiple data assets in a single run to automatically assign data classes and identify data types and formats of columns.
      • Run quality analysis on multiple data sets in a single run to scan for common data quality issues like missing values or data class violations.
      • Automatically assign business terms to assets and generate term suggestions based on data classification or machine-learning algorithms.
    • Review the enrichment results. An overall view of the quality scores for the data assets is available in the metadata enrichment asset in the project. You can view the detailed results for each data asset or column by clicking the quality score. Alternatively, you can access the information on an asset's Data quality tab, within a project or a catalog.

    • Rerun the import and the enrichment jobs at intervals to discover and evaluate changes to data assets. You can do this manually or set up schedules for import and enrichment.

  • Assess data quality by running data quality rules.

  • Refine data to improve its quality and usefulness in a project.

  • Publish assets from a project to a catalog.

  • Rate and review data assets within a catalog.

  • Create tags and add them to data assets within a catalog.

  • Add classifications and business terms to individual data assets within a catalog.

Curation tasks
Task Where can you do it manually? Where can you do it automatically?
Create assets Projects
Catalogs
Projects
Catalogs
Assign data classes Projects
Catalogs
Projects
Catalogs
Assign classifications Catalogs
Assign business terms Projects
Catalogs
Projects
Analyze data quality
(metadata enrichment)
Projects Projects
Assess data quality (rules) Projects Projects

Sample flow: advanced curation

A curation flow might have these tasks:

  1. In a project, create and run a metadata import with the goal Discover to do a bulk import of metadata from a connection into the project. You can also configure the metadata import to run on a one-time or a repeating schedule.

  2. In the same project, create and run a metadata enrichment to complete these tasks for the set of imported data assets in a single run:

    • Profile the data assets.
    • Run quality analysis on the data assets.
    • Automatically assign business terms to imported assets and generate term suggestions.

    You can also set up a one-time or a repeating schedule for your metadata enrichment. You can align your enrichment schedule with the schedule configured for the metadata import.

  3. Review the enrichment results for the data assets and their columns.

  4. Publish enriched data assets to the catalog.

You can perform most curation tasks with APIs instead of the user interface. Links to Watson Data API are listed for each applicable task.

Learn more

Parent topic: Preparing data