It would be difficult to find someone who thinks that data quality isn't important. Certainly, the effects of poor data quality are painfully clear: organizations depend on data to make strategic management decisions, provide customer service, and develop processes and timelines. If that data is obsolete, inconsistent, incoherent, or just plain wrong, it can cost a company time, customers, and revenue. Additionally, demonstrating data quality is often a requirement for regulatory compliance.
Trying to develop an overarching program to maintain and improve data quality can feel like chasing ghosts. In this article, we will present important concepts essential to a successful data quality program. We will also outline a plan for initiating a data quality program through a project tied to a specific business initiative.
What is data quality?
The first step to creating a successful data quality program is to understand what data quality means in the context of a particular organization. Broadly, quality data is "fit for use": it can be trusted and it is suitable for its intended purpose. Assessing whether a specific set of data meets the criteria requires answering several questions: What data is being used, who is using it, how are they using it, when are they using it, and why? This becomes more complex as organizations begin sharing data across lines of business, departments, and other entities. It quickly becomes clear that to measure data quality effectively, it must be defined at the entity or even at the attribute level.
Data quality can be measured in many dimensions, including accuracy, reliability, timeliness, relevance, completeness, and consistency. Of course, different organizations will have different priorities. However, it's important to recognize that there are technical and business views of data quality, and both are important. Data that meets technical quality standards (such as consistent, correctly formatted, well-defined) but that is not perceived by users as reliable, accurate, or useful will have little impact on the organization. In short, ensuring data quality requires an awareness of both technical and business requirements.
Strategy and setting goals
One of the best ways to build a data quality program is to tie it to a strategic business project. Data quality isn't the ultimate goal—it's the means to the goal, which supports, extends, or enhances the business in some way. For example, a company that sets the goal of increasing sales at retail stores by 20 percent over the coming year might want to create a data quality program that ensures that information delivered to store managers about sales trends of high-value products is accurate, timely, and precise.
The charter, objectives, and plan for a successful data quality project should follow the well-known project management SMART mnemonic: Specific, Measurable, Actionable, Realistic, Time-driven (see sidebar, "SMART data quality"). This is also the time to address high-level organizational issues (such as who will own the program and who will be the major stakeholders) and technical issues (such as the tools to be used and the environment for data analysis).
Scope and definition
Once the goals of the data quality project are established, the next stage is discovery and assessment, starting with identifying the data that is within the scope of the project. With the owners of the data entities established, the business and IT teams can move on to defining the data entities and their attributes. For every entity, there should be a business definition (such as what the data is and why it is meaningful), a technical definition (field sizes, types, relationships, and hierarchies; expected data patterns or formats; and so on), and a quality definition that includes expected and acceptable values along with business rules and formatting rules.
A tool such as IBM InfoSphere Business Glossary can be helpful at this stage, by providing a repository for data definitions and a simple user interface for entering, searching, and exploring vocabulary and definitions. An enterprise glossary helps ensure that definitions are consistent across projects, supports collaboration between business and IT and across lines of business, and helps build the common vocabulary and understanding of data.
Assessment and profiling
Assessing the actual data based on the criteria established by the business and technical teams is the next step. Here, software such as IBM InfoSphere Information Analyzer is used to profile the data. During profiling, the data is checked at the column, table, and cross-table level to assess its completeness, validity, and conformity to known or expected usage. If the business definitions for the data have been clearly established, the rules can be entered into InfoSphere Information Analyzer, which will use them to validate the data.
InfoSphere Information Analyzer also provides a central business rule repository, promoting reuse and consistency across different projects and implementations, and shares a metadata repository with InfoSphere Business Glossary, which simplifies data sharing and implementation. Other data quality tools make it possible to perform sophisticated, automated analysis on data depending on the data quality and validation needs (see sidebar, "Resources").
After the assessment, results should be reviewed by both technical and business teams to develop a complete understanding of the data. The next step is deciding what action to take based on the reports. Sometimes, the action will be technical, such as changing a data model or a user interface. Other times, the action will involve a business process or policy change, such as altering who is responsible for gathering and entering the data.
From assessment to program
At this point in the process, the organization should understand what at least part of its data environment looks like and know what its business objectives are. The next step is to create a data quality process that will move the organization from the current state to the desired state.
Building this program is beyond the scope of this article, but a data quality program has three essential elements. First, it continually uses the structure defined for the data quality assessment to regularly measure data quality. Second, it assigns stewards to continually monitor data quality. Finally, it provides a process for developing action plans for dealing with data quality issues identified during ongoing monitoring.
Today, many organizations discover data quality issues only when they impact the business—usually with a negative result. By actively assessing and monitoring data quality, organizations can graduate to identifying data issues and addressing them before they cause problems. By creating repeatable processes and reusable assets, organizations can ground the abstract concept of data quality in a real-world project, and use it to minimize risk and generate business value.
- IBM InfoSphere
- IBM InfoSphere Information Analyzer
- IBM InfoSphere Business Glossary
- IBM InfoSphere QualityStage
- IBM InfoSphere Discovery
Dig deeper into Information management on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.