Curating structured data

Curating structured data transforms raw data into trusted, well-documented assets that your organization can confidently use for analytics, reporting, and decision-making.

Required services
IBM watsonx.data intelligence
Some capabilities require additional services:
  • IBM watsonx.data integration for using the DataStage tool for advanced analysis in the context of metadata enrichment (advanced profiling, and in-depth key and relationship analyses)
  • watsonx.ai for all capabilities that use generative AI

Before you can start curating data, you must have set up watsonx.data intelligence so that you have a governance framework and at least one catalog for sharing the curated assets (see Planning to implement data governance).

Curation can be a mostly manual process where you curate data assets one at a time. Advanced curation is a more automated process where many of the curation tasks are completed automatically for multiple data assets simultaneously.

Advanced curation has these benefits:

  • Scale: Process thousands of data assets in a single operation.
  • Consistency: Apply the same standards and rules across all assets.
  • Speed: Complete in hours what would take weeks manually.
  • Accuracy: Machine-learning algorithms and LLM-based term generation and assignment improve term and classification assignments over time.
  • Maintenance: Scheduled jobs keep metadata current automatically.

Your watsonx.data intelligence tools are in collaborative workspaces called projects. Create a project, or join one, and use the tools to curate structured data. After assets are properly enriched and validated, you publish them to catalogs for broader consumption.

Data formats

You can work with structured data in the following formats:

  • Tables from relational and nonrelational data sources, Amazon S3 Delta Lake tables
  • Data from file-based connections to the data sources
  • Tabular data
  • Tool-specific formats from connections to external tools

Not all tools support the same data formats. For details, see the tool-specific information:

Capabilities

You curate structured data in these ways:

Discover and import
Connect to your data sources and import metadata to create data assets. The metadata import tool discovers tables, files, and their structures without moving the actual data. See Importing metadata.
Enrich your data
Examine the data to understand its content, quality, and characteristics. With the metadata enrichment tool, identify data types, detect patterns, and generate statistics. Add business meaning by assigning terms, classifications, and descriptions to connect technical data structures to business concepts. Identify connections between data assets, such as primary and foreign keys, to show how data relates across your organization. See Enriching data assets.
Assess quality
Run quality checks to identify issues like missing values, duplicates, or data that doesn't match expected patterns. Quality scores help users understand data reliability. The metadata enrichment tool can automatically detect and run relevant check, but you can also manually create and run data quality checks with the data quality tools. See Managing data quality.
Refine data
Cleanse and shape data with the Data Refinery tool. Cleanse data to fix or remove data that is incorrect, incomplete, improperly formatted, or duplicated. Shape data to customize it by filtering, sorting, combining or removing columns. See Refining data.
Visualize data
Create visualizations to explore data from different perspectives, so that you can identify patterns, connections, and relationships within that data and quickly understand large amounts of information. See Visualizing your data.

Ways to work

You work with these tools in the UI:

  • Metadata import
  • Metadata enrichment
  • Data quality definitions
  • Data quality rules
  • Data Refinery

Create visualizations on the Visualization tab of the project that is your workspace for curating data.

You can perform most curation tasks with APIs instead of the user interface. Links to IBM Knowledge Catalog API are listed for each applicable task.

Workflow

To begin curating structured data:

  1. Create or join a project for your curation work.
  2. Add connections to your data sources.
  3. Import metadata to create data assets.
  4. Run metadata enrichment to add context and quality information.
  5. Review and refine the results.
  6. Publish curated assets to a catalog.

Each step builds on the previous one, gradually transforming raw data into valuable, trusted assets your organization can use with confidence.

Learn more