What is data curation?

15 April 2025

Authors

Alice Gomstyn

IBM Content Contributor

Alexandra Jonker

Editorial Content Lead

What is data curation?

Data curation is the process of creating and managing datasets so that people can find, access, use and reuse data as necessary. It involves adding data assets (valuable collections of data) to a central repository in order to consolidate asset metadata, enrich them with additional information, and analyze and improve the quality of the data over its lifecycle.
 

Today, organizations generate ever-growing mountains of data, with more than 400 million terabytes worth per day. Much of this data can prove tremendously valuable, but only if businesses can understand and leverage it successfully.

As part of effective data management, data curation helps businesses derive important insights from enterprise data and use these insights for decision-making. Well-curated data is also considered critical to improving the performance of artificial intelligence (AI) initiatives and helping ensure regulatory compliance with data management and data privacy requirements.

Outside of the enterprise, data curation is a key process in research and academic settings. For instance, the curation of research data can improve data sharing and archiving among developers, scientists, healthcare professionals and other researchers.

The data curation process can be manual, or it can be performed with the help of automation, with software designed to execute curation activities at scale.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Why is data curation important?

At its core, data curation empowers businesses to use their data to find value. But it also helps them manage exponential data growth, support effective and responsible AI initiatives, maintain regulatory compliance and ensure data usability.

Growing data volumes

The exponential growth of data volumes has given organizations more business-relevant data than ever, with some amassing datasets containing terabytes or petabytes of information from a variety of data sources. On a macro level, an estimated 149 zettabytes of data was generated globally in 2024 and that figure is expected to more than double by 2028.

Performing quality assurance and data discovery on such unprecedentedly large and complex datasets known as “big data”, is no simple feat. However, it is a critical one, as enterprise data is increasingly proving to be a source of valuable insights. Annotating and organizing data for data-driven decision-making can deliver a competitive edge and elevate performance for businesses in industries across the board.

Effective artificial intelligence

Addressing data quality and usability challenges has become especially urgent as organizations embrace AI-powered capabilities as a strategic imperative. AI systems have the potential to transform business and elevate productivity, but their data needs are substantial: They require high-quality data to perform effectively. 

Low-quality data can result in poor model performance, a “garbage in, garbage out” scenario. Datasets with data quality issues such as missing values, outliers or inconsistencies can distort analysis and lead to incorrect outputs.

Regulatory compliance

Data curation also helps ensure regulatory compliance, particularly in the context of AI. Many industries, especially those that handle sensitive information such as healthcare or financial services, must navigate an evolving landscape of regulations dictating how they collect, process, store and secure data. 

Effective data curation practices help ensure data is collected, stored, processed and labeled in accordance with these rules. The EU AI Act, for instance, requires that high-risk AI systems adopt rigorous data governance practices to ensure that training, validation and testing data meet specific quality criteria. For example, effective governance around the data collection process is essential.

Data reusability

Data curation is also key to helping ensure the reusability of high-quality datasets. For example, through data curation, organizations can create and maintain a centralized glossary tailored specifically to the business. Through this single source of truth, users across the organization can better understand and use data. When data is accessible and universally usable, it’s more likely that users will repeatedly turn to it for insights.

Mixture of Experts | 11 April, episode 50

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

What are key steps for data curation?

While data curation practices may vary by organization, researchers have identified curation activities common among data curators, data engineers, data scientists, data stewards and other data management professionals over big data lifecycles.1 They include:

  • Planning
  • Description
  • Preparation
  • Assurance
  • Storage and preservation
  • Discovery and access

Planning

Setting strategies and criteria for data collection, production and ingestion. Data ingestion includes data acquisition from various sources, including structured databases and application programming interfaces (APIs), as well as databases for unstructured data. The planning step of data curation may also consider data governance, which helps ensure data integrity and data security.

Description

Creating, collecting, preserving and maintaining metadata, which is information that describes a data point or dataset, such as author, creation date or file size. Successful metadata management can help make data more findable, enable data lineage tracing and improve system interoperability.

Preparation

Engaging in data preparation methods. For example, data cleaning is the process of identifying and correcting errors and inconsistencies in raw data sets. Data transformation is the conversion of clean, raw data into a usable format for analysis. And the anonymization of sensitive data helps ensure data privacy and regulatory compliance.

Assurance

Assessing and achieving validation of data quality, tracing data provenance and helping ensure the protection of sensitive data. Data quality can be categorized through metrics such as accuracy, completeness and consistency. Meanwhile, tracking data provenance can help confirm the trustworthiness of data and assure that necessary usage permissions from data providers have been obtained.

Storage and preservation

Transferring data from data processing units to data repositories and data storage systems, such as data lakes and data warehouses. Considerations for data preservation may include storing different varieties of data and ensuring data security.

Discovery and access

Making data searchable and accessible by developing taxonomies, standardizing metadata and establishing data retrieval methods.

Data curation software solutions

Manual processes can make data curation a slow, tedious and inefficient endeavor. However, the right data governance and data management solutions can help businesses automate data curation workflows and optimize data pipelines.

Leading solutions might include features such as:

Governed data catalogs

A data catalog is a detailed inventory of all data assets in an organization, designed to help data professionals quickly find the data they need. Governed data catalogs use data classification and masking functions to enable secure data handling.

Curated glossaries

Glossaries of industry-specific business vocabularies can improve data classification, regulatory compliance and other governance activities.

AI-powered metadata enrichment

Large language models (LLMs) can be deployed for metadata enrichment, adding more context, labels or descriptions to large volumes of data assets at once.

Intelligent search

Intelligent search can improve data accessibility and eliminate silos. Powered by AI, it allows users to extract information from anywhere (inside or outside the company) regardless of format, helping them find the data they need quickly and easily.

Use cases for data curation

Data curation plays an important role in various fields and disciplines. Use cases include:

Advancing medical research

Curated data can help propel advancements and breakthroughs in treating disease. For instance, a US-based healthcare clinic recently announced a partnership with an AI health data platform to curate datasets focused on multiple sclerosis (MS), a chronic neurological disease.

The objective of the project, which will include collected data from over 3,000 patients, is to develop data-driven insights on disease subtypes, disease progression and more.2

Keeping AI in insurance compliant

Data curation can help ensure that organizations adopting AI are doing so in line with applicable regulations and requirements.

For instance, the insurance industry has widely adopted AI and machine learning technologies to modernize. But the regulatory landscape surrounding AI adoption in the industry is complex and dynamic. Relevant laws such as the Solvency II Directive include strict policies for insurers regarding “the sufficiency and quality of relevant data for underwriting and reserving processes.” These regulations also require that data used to test and train AI systems is complete, accurate and appropriate.3

Personalizing consumer marketing

Digital and brick-and-mortar retailers often curate their shopper data by engaging in segmentation processes, organizing customers into groups based on their characteristics, behaviors and preferences. This allows retailers to be more effective in targeting different groups of customers with promotions, product recommendations and other personalized marketing efforts.

For example, an analysis of retail email marketing campaigns determined that segmented emails were read 15% more often than those that were not segmented.4

Related solutions
IBM Knowledge Catalog

Activate data for AI and analytics with intelligent cataloging and policy management. IBM Knowledge Catalog is data governance software that provides a data catalog to automate data discovery, data quality management and data protection.

Discover Knowledge Catalog
IBM Data Intelligence solutions

Transform raw data into actionable insights swiftly, unify data governance, quality, lineage and sharing, and empower data consumers with reliable and contextualized data.

Explore data intelligence solutions
Data and analytics consulting services

Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.

Discover analytics services
Take the next step

Find, understand, curate and access data, knowledge assets and their relationships—wherever they reside—on cloud or on-premises. IBM Knowledge Catalog is data governance software that provides a data catalog to automate data discovery, data quality management and data protection.

Explore IBM Knowledge Catalog Explore data intelligence solutions
Footnotes

Big data curation framework: Curation actions and challenges.” Journal of Information Science. 11 November 2022.

2 “Exclusive: Century Heath, Nira Medical partner to provide AI-curated EHR data.” MobiHealthNews. 14 January 2025.

Consultation Paper: On Opinion on Artificial Intelligence Governance and Risk Management.” European Insurance and Occupational Pensions Authority (EIOPA). 10 February 2025.

4Sophisticated email segmentation boosts open rates, engagement: report.” Retail Dive. Accessed 28 March 2025.