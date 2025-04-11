Today, organizations generate ever-growing mountains of data, more than 400 million terabytes’ worth per day. Much of this data can prove tremendously valuable, but only if businesses can understand and leverage it successfully.
As part of effective data management, data curation helps businesses derive important insights from enterprise data and use these insights for decision-making. Well-curated data is also considered critical to improving the performance of artificial intelligence initiatives.
Outside of the enterprise, data curation is a key process in research and academic settings. For instance, the curation of research data can improve data sharing and archiving among developers, scientists, healthcare professionals and other researchers.
The data curation process can be manual, or it can be performed with the help of automation, with software designed to execute curation activities at scale.
The exponential growth of data volumes has given organizations more business-relevant data than ever, with some amassing datasets containing terabytes or petabytes of information from a variety of data sources. On a macro level, an estimated 149 zettabytes of data was generated globally in 2024 and that figure is expected to more than double by 2028.
Performing quality assurance and data discovery on such unprecedently large and complex datasets known as “big data”, is no simple feat. However, it is a critical one, as enterprise data is increasingly proving to be a source of valuable insights. Annotating and organizing data for data-driven decision-making can deliver a competitive edge and elevate performance for businesses in industries across the board.
Addressing data quality and usability challenges has become especially urgent as companies embrace AI-powered capabilities as a strategic imperative. AI systems have the potential to transform business and elevate productivity, but their data needs are substantial: They require high-quality data to perform effectively. Low-quality data can result in poor model performance, a “garbage in, garbage out” scenario. Datasets with data quality issues such as missing values, outliers or inconsistencies can distort analysis and lead to incorrect outputs.
Effective data curation is also key to helping ensure the reusability of high-quality datasets. For example, inadequate column labeling in a table may yield confusion over what the table’s values represent, rendering its reuse unlikely in the future.1
While data curation practices may vary by organization, researchers have identified curation activities common among data curators, data engineers, data scientists, data stewards and other data management professionals over big data lifecycles.2 They include:
Setting strategies and criteria for data collection, production and ingestion. Data ingestion includes data acquisition from various sources, including structured databases and application programing interfaces (APIs), as well as databases for unstructured data. The planning step of data curation may also consider data governance, which helps ensure data integrity and data security.
Creating, collecting, preserving and maintaining metadata, which is information that describes a datapoint or dataset, such as author, creation date or file size. Successful metadata management can help make data more findable, enable data lineage tracing and improve system interoperability.
Engaging in data preparation methods. For example, data cleaning is the process of identifying and correcting errors and inconsistencies in raw data sets. Data transformation is the conversion of clean, raw data into a usable format for analysis. And the anonymization of sensitive data helps ensure data privacy and regulatory compliance.
Assessing and achieving validation of data quality, tracing data provenance and helping ensure the protection of sensitive data. Data quality can be categorized through metrics such as accuracy, completeness and consistency. Meanwhile, tracking data provenance can help confirm the trustworthiness of data and assure that necessary usage permissions from data providers have been obtained.
Transferring data from data processing units to data repositories and data storage systems, such as data lakes and data warehouses. Considerations for data preservation may include storing different varieties of data and ensuring data security.
Making data searchable and accessible by developing taxonomies, standardizing metadata and establishing data retrieval methods.
Manual processes can make data curation a slow, tedious and inefficient endeavor. However, the right data governance and data management solutions can help businesses automate data curation workflows and optimize data pipelines.
Leading solutions might include features such as:
A data catalog is a detailed inventory of all data assets in an organization, designed to help data professionals quickly find the data they need. Governed data catalogs use data classification and masking functions to enable secure data handling.
Glossaries of industry-specific business vocabularies can improve data classification, regulatory compliance and other governance activities.
Large language models (LLMs) can be deployed for metadata enrichment, adding more context, labels or descriptions to large volumes of data assets at once.
Intelligent search, which is also powered by AI, improves data accessibility and provides users with information and answers relevant to their specific queries.
Data curation plays an important role in various fields and disciplines. Use cases include:
Curated data can help propel advancements and breakthroughs in treating disease. For instance, a US-based healthcare clinic recently announced a partnership with an AI health data platform to curate datasets focused on multiple scleroris (MS), a chronic neurological disease.
The objective of the project, which will include collected data from over 3,000 patients, is to develop data-driven insights on disease subtypes, disease progression and more.3
Data curation is central to the functionality and performance of AI models, with machine learning algorithms and generative AI models requiring high-quality data for optimal training.
Recently, data curation proved key to enabling IBM’s Granite Vision, a lightweight large language model, to extract content from tables, charts, infographics and other visual data representations. Data used to train the model included document pages from PDF files originating from 45,000 carefully cataloged web domains.4
Digital and brick-and-mortar retailers often curate their shopper data by engaging in segmentation organizing customers into groups based on their characteristics, behaviors and preferences. This allows retailers to be more effective in targeting different groups of customers with promotions, product recommendations and other personalized marketing efforts.
For example, an analysis of retail email marketing campaigns determined that segmented emails were read 15% more often than those that were not segmented.5
Discover, curate, trust, and access data through cataloging, quality assurance, governance, lineage tracing, and sharing platforms.
Read this Q&A with IDC's Stewart Bond to learn about using the right data for the right reason by providing transparency, context, and control in delivering AI-ready data for the AI-fueled business.
Learn why IBM was named a leader in this IDC evaluation report that assesses and positions vendors in the data intelligence software market.
Learn why IBM is recognized as a Leader in the 2024 Gartner® Magic Quadrant™ for Data and Analytics Governance Platforms.
Gain a holistic view of a software provider’s ability to deliver the combination of functionality to provide a complete view of data production and data consumption with either a single data intelligence product or suite of products.
Explore the transformative process and best practices for deriving actionable insights from data intelligence.
Design the right data strategy to achieve your organization’s business goals and create success with AI.
Activate data for AI and analytics with intelligent cataloging and policy management. IBM Knowledge Catalog is data governance software that provides a data catalog to automate data discovery, data quality management and data protection.
Transform raw data into actionable insights swiftly, unify data governance, quality, lineage and sharing, and empower data consumers with reliable and contextualized data.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.
Find, understand, curate and access data, knowledge assets and their relationships—wherever they reside—on cloud or on-premises. IBM Knowledge Catalog is data governance software that provides a data catalog to automate data discovery, data quality management and data protection.
