What is data curation?

Man types on laptop in front of two large computer monitors.

Authors

Alice Gomstyn

Staff Writer

IBM Think

Alexandra Jonker

Staff Editor

IBM Think

What is data curation?

Data curation is the process of creating and managing datasets so that people can find, access, use and reuse data as necessary. It involves adding data assets (valuable collections of data) to a central repository in order to consolidate asset metadata, enrich them with additional information, and analyze and improve the quality of the data over its lifecycle.

Today, organizations generate ever-growing mountains of data, with more than 400 million terabytes worth per day. Much of this data can prove tremendously valuable, but only if businesses can understand and leverage it successfully.

As part of effective data management, data curation helps businesses derive important insights from enterprise data and use these insights for decision-making. Well-curated data is also considered critical to improving the performance of artificial intelligence (AI) initiatives and helping ensure regulatory compliance with data management and data privacy requirements.

Outside of the enterprise, data curation is a key process in research and academic settings. For instance, the curation of research data can improve data sharing and archiving among developers, scientists, healthcare professionals and other researchers.

The data curation process can be manual, or it can be performed with the help of automation, with software designed to execute curation activities at scale.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Why is data curation important?

At its core, data curation empowers businesses to use their data to find value. But it also helps them manage exponential data growth, support effective and responsible AI initiatives, maintain regulatory compliance and ensure data usability.

Growing data volumes

The exponential growth of data volumes has given organizations more business-relevant data than ever, with some amassing datasets containing terabytes or petabytes of information from a variety of data sources. On a macro level, an estimated 149 zettabytes of data was generated globally in 2024 and that figure is expected to more than double by 2028.

Performing quality assurance and data discovery on such unprecedentedly large and complex datasets known as “big data”, is no simple feat. However, it is a critical one, as enterprise data is increasingly proving to be a source of valuable insights. Annotating and organizing data for data-driven decision-making can deliver a competitive edge and elevate performance for businesses in industries across the board.

Effective artificial intelligence

Addressing data quality and usability challenges has become especially urgent as organizations embrace AI-powered capabilities as a strategic imperative. AI systems have the potential to transform business and elevate productivity, but their data needs are substantial: They require high-quality data to perform effectively.

Low-quality data can result in poor model performance, a “garbage in, garbage out” scenario. Datasets with data quality issues such as missing values, outliers or inconsistencies can distort analysis and lead to incorrect outputs.

Regulatory compliance

Data curation also helps ensure regulatory compliance, particularly in the context of AI. Many industries, especially those that handle sensitive information such as healthcare or financial services, must navigate an evolving landscape of regulations dictating how they collect, process, store and secure data.

Effective data curation practices help ensure data is collected, stored, processed and labeled in accordance with these rules. The EU AI Act, for instance, requires that high-risk AI systems adopt rigorous data governance practices to ensure that training, validation and testing data meet specific quality criteria. For example, effective governance around the data collection process is essential.

Data reusability

Data curation is also key to helping ensure the reusability of high-quality datasets. For example, through data curation, organizations can create and maintain a centralized glossary tailored specifically to the business. Through this single source of truth, users across the organization can better understand and use data. When data is accessible and universally usable, it’s more likely that users will repeatedly turn to it for insights.

Mixture of Experts | 6 March, episode 97

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

What are key steps for data curation?

While data curation practices may vary by organization, researchers have identified curation activities common among data curators, data engineers, data scientists, data stewards and other data management professionals over big data lifecycles.¹ They include:

Planning
Description
Preparation
Assurance
Storage and preservation
Discovery and access

Planning

Setting strategies and criteria for data collection, production and ingestion. Data ingestion includes data acquisition from various sources, including structured databases and application programming interfaces (APIs), as well as databases for unstructured data. The planning step of data curation may also consider data governance, which helps ensure data integrity and data security.

Description

Creating, collecting, preserving and maintaining metadata, which is information that describes a data point or dataset, such as author, creation date or file size. Successful metadata management can help make data more findable, enable data lineage tracing and improve system interoperability.

Preparation

Engaging in data preparation methods. For example, data cleaning is the process of identifying and correcting errors and inconsistencies in raw data sets. Data transformation is the conversion of clean, raw data into a usable format for analysis. And the anonymization of sensitive data helps ensure data privacy and regulatory compliance.

Assurance

Assessing and achieving validation of data quality, tracing data provenance and helping ensure the protection of sensitive data. Data quality can be categorized through metrics such as accuracy, completeness and consistency. Meanwhile, tracking data provenance can help confirm the trustworthiness of data and assure that necessary usage permissions from data providers have been obtained.

Storage and preservation

Transferring data from data processing units to data repositories and data storage systems, such as data lakes and data warehouses. Considerations for data preservation may include storing different varieties of data and ensuring data security.

Discovery and access

Making data searchable and accessible by developing taxonomies, standardizing metadata and establishing data retrieval methods.

Data curation software solutions

Manual processes can make data curation a slow, tedious and inefficient endeavor. However, the right data governance and data management solutions can help businesses automate data curation workflows and optimize data pipelines.

Leading solutions might include features such as:

Governed data catalogs

A data catalog is a detailed inventory of all data assets in an organization, designed to help data professionals quickly find the data they need. Governed data catalogs use data classification and masking functions to enable secure data handling.

Curated glossaries

Glossaries of industry-specific business vocabularies can improve data classification, regulatory compliance and other governance activities.

AI-powered metadata enrichment

Large language models (LLMs) can be deployed for metadata enrichment, adding more context, labels or descriptions to large volumes of data assets at once.

Intelligent search

Intelligent search can improve data accessibility and eliminate silos. Powered by AI, it allows users to extract information from anywhere (inside or outside the company) regardless of format, helping them find the data they need quickly and easily.

Use cases for data curation

Data curation plays an important role in various fields and disciplines. Use cases include:

Advancing medical research

Curated data can help propel advancements and breakthroughs in treating disease. For instance, a US-based healthcare clinic recently announced a partnership with an AI health data platform to curate datasets focused on multiple sclerosis (MS), a chronic neurological disease.

The objective of the project, which will include collected data from over 3,000 patients, is to develop data-driven insights on disease subtypes, disease progression and more.²

Keeping AI in insurance compliant

Data curation can help ensure that organizations adopting AI are doing so in line with applicable regulations and requirements.

For instance, the insurance industry has widely adopted AI and machine learning technologies to modernize. But the regulatory landscape surrounding AI adoption in the industry is complex and dynamic. Relevant laws such as the Solvency II Directive include strict policies for insurers regarding “the sufficiency and quality of relevant data for underwriting and reserving processes.” These regulations also require that data used to test and train AI systems is complete, accurate and appropriate.³

Personalizing consumer marketing

Digital and brick-and-mortar retailers often curate their shopper data by engaging in segmentation processes, organizing customers into groups based on their characteristics, behaviors and preferences. This allows retailers to be more effective in targeting different groups of customers with promotions, product recommendations and other personalized marketing efforts.

For example, an analysis of retail email marketing campaigns determined that segmented emails were read 15% more often than those that were not segmented.⁴

IBM named a Leader for Metadata Management

Discover why IBM has been named a Leader in the 2025 Gartner® Magic Quadrant™ for Metadata Management for its watsonx.data intelligence.

Resources

AI agents run on data—is yours ready?

Your data is your competitive edge. Learn how to unlock it securely and drive measurable ROI from AI in this short webinar.

IBM named a leader for Metadata Management

Discover why IBM has been named a leader in the 2025 Gartner® Magic Quadrant™ for Metadata Management for its watsonx.data intelligence.

Is your data ready for gen AI?

Explore our Data Matters hub to learn how you can tackle data and AI challenges like integration.

Turning data strategy into AI impact

Discover how to scale AI with a strong data foundation, deliver explainable and governed outcomes, and apply real-world lessons to your own AI roadmap.

Data intelligence: Get your data out of the dark

Discover how data intelligence brings governance, quality, lineage and sharing together to turn raw data into insights you can trust.

Build a unified trust framework for data and AI

Discover why a disciplined approach to data and AI that unites people, processes and technology accelerates adoption, innovation and ROI.

From data to insight: Charting your journey through data intelligence

Explore the transformative process and best practices for deriving actionable insights from data intelligence.

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.

Footnotes

¹“Big data curation framework: Curation actions and challenges.” Journal of Information Science. 11 November 2022.

² “Exclusive: Century Heath, Nira Medical partner to provide AI-curated EHR data.” MobiHealthNews. 14 January 2025.

³“Consultation Paper: On Opinion on Artificial Intelligence Governance and Risk Management.” European Insurance and Occupational Pensions Authority (EIOPA). 10 February 2025.

⁴ “Sophisticated email segmentation boosts open rates, engagement: report.” Retail Dive. Accessed 28 March 2025.

What is data curation?

Authors

What is data curation?

The latest AI News + Insights

Why is data curation important?

Growing data volumes

Effective artificial intelligence

Regulatory compliance

Data reusability

Decoding AI: Weekly News Roundup

What are key steps for data curation?

Planning

Description

Preparation

Assurance

Storage and preservation

Discovery and access

Data curation software solutions

Governed data catalogs

Curated glossaries

AI-powered metadata enrichment

Intelligent search

Use cases for data curation

Advancing medical research

Keeping AI in insurance compliant

Personalizing consumer marketing

Share

Resources

Footnotes

The latest AI News + Insights