Curating structured data
Curating structured data transforms raw data into trusted, well-documented assets that your organization can confidently use for analytics, reporting, and decision-making.
- Required services
- IBM watsonx.data intelligence
- Some capabilities require additional services:
- IBM watsonx.data integration for using the DataStage tool for advanced analysis in the context of metadata enrichment (advanced profiling, and in-depth key and relationship analyses)
- watsonx.ai for all capabilities that use generative AI
Before you can start curating data, you must have set up watsonx.data intelligence so that you have a governance framework and at least one catalog for sharing the curated assets (see Planning to implement data governance).
Curation can be a mostly manual process where you curate data assets one at a time. Advanced curation is a more automated process where many of the curation tasks are completed automatically for multiple data assets simultaneously.
Advanced curation has these benefits:
- Scale: Process thousands of data assets in a single operation.
- Consistency: Apply the same standards and rules across all assets.
- Speed: Complete in hours what would take weeks manually.
- Accuracy: Machine-learning algorithms and LLM-based term generation and assignment improve term and classification assignments over time.
- Maintenance: Scheduled jobs keep metadata current automatically.
Your watsonx.data intelligence tools are in collaborative workspaces called projects. Create a project, or join one, and use the tools to curate structured data. After assets are properly enriched and validated, you publish them to catalogs for broader consumption.
Data formats
You can work with structured data in the following formats:
- Tables from relational and nonrelational data sources, Amazon S3 Delta Lake tables
- Data from file-based connections to the data sources
- Tabular data
- Tool-specific formats from connections to external tools
Not all tools support the same data formats. For details, see the tool-specific information:
- Metadata import, metadata enrichment, and data quality: Supported connectors for discovery, enrichment, and data quality of structured data
- Lineage import: Supported connectors for lineage import
- Data Refinery: Refining data
- Visualizations: Visualizing your data
Capabilities
You curate structured data in these ways:
- Discover and import
- Connect to your data sources and import metadata to create data assets. The metadata import tool discovers tables, files, and their structures without moving the actual data. See Importing metadata.
- Enrich your data
- Examine the data to understand its content, quality, and characteristics. With the metadata enrichment tool, identify data types, detect patterns, and generate statistics. Add business meaning by assigning terms, classifications, and descriptions to connect technical data structures to business concepts. Identify connections between data assets, such as primary and foreign keys, to show how data relates across your organization. See Enriching data assets.
- Assess quality
- Run quality checks to identify issues like missing values, duplicates, or data that doesn't match expected patterns. Quality scores help users understand data reliability. The metadata enrichment tool can automatically detect and run relevant check, but you can also manually create and run data quality checks with the data quality tools. See Managing data quality.
- Refine data
- Cleanse and shape data with the Data Refinery tool. Cleanse data to fix or remove data that is incorrect, incomplete, improperly formatted, or duplicated. Shape data to customize it by filtering, sorting, combining or removing columns. See Refining data.
- Visualize data
- Create visualizations to explore data from different perspectives, so that you can identify patterns, connections, and relationships within that data and quickly understand large amounts of information. See Visualizing your data.
Ways to work
You work with these tools in the UI:
- Metadata import
- Metadata enrichment
- Data quality definitions
- Data quality rules
- Data Refinery
Create visualizations on the Visualization tab of the project that is your workspace for curating data.
You can perform most curation tasks with APIs instead of the user interface. Links to IBM Knowledge Catalog API are listed for each applicable task.
Workflow
To begin curating structured data:
- Create or join a project for your curation work.
- Add connections to your data sources.
- Import metadata to create data assets.
- Run metadata enrichment to add context and quality information.
- Review and refine the results.
- Publish curated assets to a catalog.
Each step builds on the previous one, gradually transforming raw data into valuable, trusted assets your organization can use with confidence.