The low priority that businesses have placed on Data Quality has caused the failure of many projects over the last 10 years. In today’s Big Data era where massive scale and complex data reign, success is achieved by prioritizing Data Quality management.
To gain a competitive advantage many companies are performing advanced analytics on Big Data. It is described using the 5 V’s: Volume, Velocity, Variety, Veracity and Value. Social Media and the Internet of Things (IoT) are examples of large Volume and extreme Velocity of data. Variety represents data types; structured, semi structured or unstructured.
Data Quality impacts all 5 V’s as highlighted by Anmol Rajpurohit in a KDnuggets article. The two most important for Data Quality are Veracity (the ability to trust the data) and the Value the data enables.
Top Data Quality Issues
Lessons learned from top Data Quality issues that existed a decade ago in traditional relational systems with ‘small data’ are still relevant today. Small and Big Data have the same Data Quality issues. An estimated $3.1 trillion are spent in the United States on Data Quality issues, according to IBM Big Data Hub.
Lack of Data Standards
Metadata Definitions/Quality - Incorrect definitions or lack of proper definitions describing the data within a column (i.e. allowed values)
Manual Human Intervention
Data entry errors and use of spreadsheets for data preparation
Broken Business Processes
Changes in business requirements not properly captured/accounted for leading to broken business processes (i.e. outdated data feeds)
Poor Data Requirements
Missing or incorrect data configuration rules, mappings or cleansing handled by custom application code unknown to stakeholders in a data migration/integration project
Big Data Quality
One might argue there has been an increase in Data Quality issues from mere volume with Big Data. Data Scientists spend 80% of their time in data preparation activities as indicated by a Forbes article. A large effort of time is spent cleaning “dirty data” prior to the fun tasks of generating data models, applying sophisticated algorithms and using machine learning.
Data has become more complex in today’s world with new characteristics, but the foundational Data Quality principles remain:
- Focus on business goals that produce Value
- Prioritize data that supports the business goals and use cases
- Institute a Data Quality initiative that identifies data issues inhibiting value
- Execute Data Cleansing where it matters
What has changed today is the approach in Data Quality analysis execution, as now it must consider the 5 V’s. Data Quality thresholds will vary based on how data is produced. Social media data will not have the same standards as operational data. Data Quality must be scalable and keep up with the growing volume and speed of data. It’s no longer sufficient to just focus on automation using profiling tools. The future for Data Quality is using Machine Learning (ML) technologies to help detect future issues or similar issues with varying data sets. How will you use Machine Learning to help your Data Quality program?