One of my colleagues sent me a link to the Bad Data Handbook by Q. Ethan McCallum. I will state clearly upfront that I have NOT read the book, however given my long history with Information Quality products and solutions, I certainly found the title intriguing and, of course, provocative!
I immediately had images of saying: "Bad datum! Bad datum! Off to your room this instant and don't come out until I call you for dinner!"
But is there really such a thing as "Bad Data"?
The summary of the book notes: "Bad data is data that gets in the way…." My first response is: "Gets in the way of what?" Reflecting back on my last post and the notions of "Know your data" and "Fit for purpose", the idea of good data or bad data really comes back to the context in which the data is placed and used.
Consider the following piece of data:
Is it good? Is it bad? Do I or can I even have an opinion on it? Not without establishing some context, some criteria of fitness, and an ability to assess or understand it against the context and criteria.
Putting data in context
If I told you this was all or part of a tweet, I've given you some additional understanding about the data, but not provided any additional context or criteria to say whether it is good, bad, or otherwise. I add a context: I'm collecting tweets to ascertain when my customers are most or least likely to shop for certain goods so I can improve my marketing campaign. Well, with that I can look at the data and say it has a name and a day of the month. Good so far? I still can't say as there is no criteria to judge it on.
I add some criteria: the data must contain names of customers and a positive or negative sentiment about the day in regards to shopping. Let's assume that the name does match a name in our customer master data system. But, there's no statement of sentiment, just the day of the month. At that point I can say that the data does not meet my criteria for my context -- it is not "Fit for purpose", and I can conclude the data is "bad" in that context.
If I change the context: I'm following tweets by my friends indicating their available days to see a movie. My criteria changes along with the context: the name matches the name of a friend and the date given appears sufficient for my context. With this change in context and criteria, I conclude the data is "good".
And now, for something completely different, it's...
Going back to the data, if it turns out that this data is not part of a tweet but the contents of a file called JOHNSPASSWORDS.txt, then I'm likely applying a totally different context and my understanding of it changes completely. If I'm a security specialist for an organization tracking unencrypted passwords, then this data may hit that criteria and fall into the realm of "bad data". If I'm a hacker looking to find access into an organization's systems, then this may in fact be very "good" data!
Once you've established the context and criteria, and provided some understanding of the data against those, then you can start making statements about value, cost, risk, or compliance -- the measures that indicate the degree to which the data supports or hinders those targets.
The Big Bad Data, or the Bad Big Data?
Particularly as we move into the realm of Big Data with more data volume, more data variety, higher velocity or influx of data, and more questions about the veracity of the data (or even parts of it), I think the need for establishing the right context, criteria, and understanding becomes imperative.
An ongoing series of daily tweets for analyzing immediate social trends may prove to meet my needs or it may not (and may take some time to ascertain the value). But those tweets may have a limited shelf-life. If I'm still storing them a year from now, have I turned them from a value-added asset into bad data that is now just a cost? Probably, since the criteria of immediate trending has past. Each Big Data case, though, is likely distinct -- working through what is and what may become "bad" data is going to be an ongoing Information Quality challenge.
I'm curious to see what the author discusses in this work and how it fits into the broader contexts of Information Governance and Big Data (Big Bad Data!). Have you read the book? If so, what are your thoughts on "Bad Data" and emerging challenges?
As always, the postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.