Welcome to my Blog- Big Data Governance Meets Reality
Here, you'll find a wide variety of opinions, information, and resources devoted to Big Data Information Governance. While there are many who think that BDIG is just an extension of Information Governance, there are some important differences:
1. Size and Scale: lots of data means lots of processing. It's not just about increased volume/variety/velocity as many industry pundits have stated. Traditional architectures have not been designed to govern data on HADOOP file systems, nor to work in a Map/Reduce processing framework. That being said, it is a major challenge for products that didn't scale very well in their original enviroments to process this much data at the speed required, especially when much of the need is real time, such as data streaming from devices. Solutions need to be either adapted or re-invented to accommodate massive scale.
2. Different architecture, few 'real' products- See #1 above. Just putting 'Big Data' in front of a product name does not mean it was designed for this purpose. Caveat emptor!
3. Processing methods- force 'classification in reverse'. In other words, we may not even know that the data is sensitive until it is actually processed, so classification is only possible in real-time or after the fact. Again, new solutions, new methods are required for governing privacy. How will Metadata repositories evolve
4. Data 'surprises' with re-identification- when public data is combined with corporate data. Increased legal risk and exposure, as evidenced by recent legal actions against companies like Spokeo and Skout.
Some interesting links follow:
Big Data Governance: A Framework to Assess Maturity: http://ibmdatamag.com/2012/04/big-data-governance-a-framework-to-assess-maturity/ . This articile discusses how the original Information Governance Framework can be applied to Big Data. In it, the authors suggest a list of questions and considerations for those going down the Big Data path.
Do you trust your Big Data? How do you assess its accuracy, especially if you are using publicly available data in conjunction with your organization's data to drive critical business decisions?
There is a web site that shall rename nameless, it advertises itself as 'Not your grandma's phone book'. It's a good thing, since it's a great example of misuse of Big Data and of public information in general. Said web site consolidates public information on most everyone and then posts it without regard to its accuracy, It includes information such as your name, address, phone number, real estate value, age, marital status, and much more. As an example, it lists my dad as in his 90's and living in Florida when in fact, he died in back in 1979. Whle my mom does now live in Florida, he never did. Meanwhle she since remarried and was subsequently widowed. Relying on an information consolidator such as this to augment customer information could indeed be misleading and costly.
Jeff Jonas has some interesting ideas around Privacy By Design. http://www.e-comlaw.com/data-protection-law-and-policy/hottopics_template.asp?id=Jonas
Metadata, Classification and Discovery:
How do you automate classification especially for machine-generated data where volumes are huge and tried and true methods may not scale? Take a look at the work we're doing with the infogov community and join in... http://www.infogovcommunity.com . High level summary- define the high level ontology and then use crawlers/automation to classify and tag the masses.
Do you have to cleanse everything? Well... um... no! Read here and give your opinion. I've shared mine....