Data wrangling is the process of cleaning, structuring and enriching raw data to be used in data science, machine learning (ML) and other data-driven applications.
Also known as data munging or data preparation, data wrangling is a way to address data quality issues such as missing values, duplicates, outliers and formatting inconsistencies. The goal of data wrangling is to transform raw, unstructured or problematic data into clean data sets that can be analyzed effectively. Data wrangling helps data scientists, data analysts and other business users apply data in ways that support informed decision-making.
Today, organizations have access to an avalanche of data from different sources. However, this raw data can be messy, inconsistent or unsuitable for use with various processes and tools that turn it into valuable insights. Without proper data wrangling, the results of data analysis can be misleading. Businesses could draw inaccurate conclusions and make flawed business decisions.
Data wrangling is a key way to support high-quality results. It transforms and maps data through a series of steps to become clean, consistent, reliable and useful for its intended application. The resulting data sets are used for tasks, such as building machine learning models, performing data analytics, creating data visualizations, generating business intelligence reports and making informed executive decisions.
As data-driven technologies, including artificial intelligence (AI), grow more advanced, data wrangling becomes more important. AI models are only as good as the data on which they are trained.
The data wrangling process helps ensure that the information used to develop and enhance models is accurate. It improves interpretability, as clean and well-structured data is easier for humans and algorithms to understand. It also aids with data integration, making it easier for information from disparate sources to be combined and interconnected.
The data wrangling process typically involves these steps:
This initial stage focuses on assessing the quality of the complete data set, including data sources and data formats. Is the data coming from databases, application programming interfaces (APIs), CSV files, web scraping or other sources? How is it structured? How will it be used?
The discovery process highlights and addresses quality issues, such as missing data, formatting inconsistencies, errors or bias and outliers that might skew the analysis. The findings are typically documented in a data quality report or a more technical document known as a data profiling report, which includes statistics, distributions and other results.
The data structuring step, sometimes called data transformation, focuses on organizing the data into a unified format so that it is suitable for analysis. It involves:
Data cleaning involves handling missing values, removing duplicates and correcting errors or inconsistencies. This process might also involve smoothing “noisy” data, that is, applying techniques that reduce the impact of random variations or other issues in the data. When cleaning, it is important to avoid unnecessary data loss or overcleaning, which can remove valuable information or distort the data.
Data enrichment involves adding new information to existing data sets to enhance their value. Sometimes called data augmentation, it involves assessing what additional information is necessary and where it might come from. Then, the additional information must be integrated with the existing data set and cleaned in the same ways as the original data.
Data enrichment might involve pulling in demographic, geographic, behavioral or environmental data relevant to the intended use case. For example, if the data wrangling project is related to supply chain operations, enriching shipment data with weather information might help predict delays.
This step involves verifying the accuracy and consistency of the wrangled data. First, validation rules must be established based on business logic, data constraints and other issues. Then, validation techniques are applied, such as:
After thorough validation, a business might publish the wrangled data or prepare it for use in applications. This process might involve loading the data into a data warehouse, creating data visualizations or exporting the data in a specific format for use with machine learning algorithms.
The data wrangling process can be time-consuming, especially as the volume of complex data continues to grow. In fact, research suggests that preparing data and working to transform it into usable forms takes up between 45% and 80% of a data analyst’s time. 1 2
Data wrangling requires a certain level of technical expertise in programming languages, data manipulation techniques and specialized tools. But it ultimately improves data quality and supports more efficient and effective data analysis.
Organizations use various tools and technologies to wrangle data from different sources and integrate it into a data pipeline that supports overall business needs. These include:
Python and R are widely used for data wrangling tasks, including data mining, manipulation and analysis. Structured query language (SQL) is essential for working with relational databases and data management.
Data wranglers use tools such as Microsoft Excel and Google Sheets for basic data cleaning and manipulation, particularly for smaller data sets.
Data wrangling tools provide a visual interface for data cleansing and data transformation, helping to streamline workflows and automate tasks. For example, the data refinery tool available in IBM platforms can quickly transform raw data into a usable form for data analytics and other purposes.
Big data platforms help wrangle large-scale, complex data sets by providing the tools and scaling capabilities needed to handle the volume and variety of big data. Platforms such as Apache Hadoop and Apache Spark are used for wrangling large data sets. They use big data technologies to transform information into a usable form for high-quality data analytics and decision-making.
AI supports data wrangling through automation and advanced analysis. Machine learning models and algorithms might help with issues such as outlier detection and scaling. Other AI tools can process large data sets quickly, handle real-time transformations and recognize patterns to guide cleaning efforts. Natural language processing (NLP) interfaces allow users to interact with data intuitively, which might reduce technical barriers.
All links reside outside of ibm.com
1 State of Data Science, Anaconda, July 2020.
2 Hellerstein et al. Principles of Data Wrangling. O’Reilly Media. July 2017.
IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
Use this ebook to align with other leaders on the 3 key goals of MLOps and trustworthy AI: trust in data, trust in models and trust in processes.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Learn how to select the most suitable AI foundation model for your use case.
Learn how to incorporate generative AI, machine learning and foundation models into your business operations for improved performance.
Learn why having a complete freedom in choice of programming languages, tools and frameworks improves creative thinking and evolvement.
Use data science tools and solutions to uncover patterns and build predictions by using data, algorithms, machine learning and AI techniques.
Introducing Cognos Analytics 12.0, AI-powered insights for better decision-making.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.