This week, I've been part of a small team kicking off a new IBM Redbook initiative around Big Data Governance. We brought some solid and diverse backgrounds together from across the spectrum of information management solutions and products with lively and entertaining discussion on the subject -- I'm very excited to be part of the initiative!
The intersection of Big Data and Information Governance makes for a broad array of topics addressing data that ranges from social media feeds to sensor inputs to arrays of log files and beyond. Given my long-time work in Information Quality, I have a lot of interest in how we find and establish an effective quality focus for Big Data. From our discussions this week, two key aspects (really fundamental principles of Information Quality) stood out to me, which I would summarize with the phrases: "Know your data" and "Fit for Purpose".
"Know your data"
For years we've preached that the first pillar for information integration is Understanding. Not only does this not change for Big Data, but in many cases you have to dig deeper and cast aside typical assumptions from the world of structured data. Consider a typical traditional operational system such as a Payroll application. The data represents the salaries of your employees and what they have been paid each pay period. You own the data, you control how it is entered and stored (or your application system manual tells you those details), and you either have the metadata or can get it.
Now consider an external source, perhaps statistics on typical employee salaries by various occupational classes over the last five years. Who created the source? What methodology did they follow in collecting the data? Were only certain occupations or certain classes of individuals included? Did the creators summarize the information? Can you identify how the information is organized and if there is any correlation at any level to information that you have? Has the information been edited or modified by anyone else? Is there any way for you to ascertain this information?
Aspects such as establishing the provenance (and possible lineage), the methods used in data capture, and the methods (statistical or otherwise) used in data filtering and aggregation, which have long been assumed with traditional data sources, all become core parts of Understanding when addressing Big Data.
"Fit for Purpose"
What's the quality of a tweet? Or a sensor stream? Or a log file? Or a string of bits that define an image? Does the presence or absence of specific data matter?
In the world of structured data, we look at a payroll record and say it is complete when the employee ID, payroll date, pay amount, and certain other fields contain values. We say it has integrity when the values in those fields have the right formats and correctly link to data in other tables. We say it has validity when the payroll date is the system date and the pay amount is in an established range. We set these rules when we established what was fit for purpose.
In the world of Big Data, though, with such a variety and volume of data coming in at high velocity, it's hard to ascertain what information quality means and many of the traditional information quality measures seem to fall short. Is a tweet complete? Is it correctly formatted? Is it valid? The questions appear nonsensical. So we need to step back and ask "what is fit for our purpose?" And that leads to another question: "what business objective am I trying to address and what value do I expect from that?" If you can answer this second question, you can start building the parameters that establish what is fit for your purpose--i.e. your Business Requirements.
The intersection of Understanding of the data with your Business Requirements brings you back to the point where you can establish the Information Quality needed for your Big Data initiative. These may not be the traditional structured data measurements. Completeness may indicate that a tweet contains one or more hashtags that you care about -- other tweets should be filtered out. You may need to look at Continuity as a dimension with sensor readings -- did I receive a continuous stream of information and if not, is there a tolerable gap for the data?
Back to the Basics
These questions are not rocket science. In my mind, these questions are the basics of data analysis (and data science). Information Quality does not go away or disappear with Big Data -- instead Big Data requires us to strip away the assumptions from the structured data world view and ask the questions anew.
And as always, the postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.