I'm in the midst of moving to a new laptop. While the new laptop offers the promise of faster speed, more CPU, and more disk space, there's the usual challenge of getting everything configured and getting all my old files moved. And, unfortunately, I'm also one of those people who saves a lot of stuff in a lot of files over the years.
As with any move, whether physical or electronic, I'm immediately faced by the question: do I really need to bring this stuff along or can I finally get rid of it?
On the plus side, I tend to have everything categorized in a couple levels of folders that I can make sense of quickly. On the down side, that level of organizing means a lengthy process to review folders and determine what to keep or throw away. For instance, my general knowledge base folder contains 100 separate folders, each typically containing 5-30 files. Some of the detail folders are fairly static at this point while others such as my BigData folder are actively growing.
I've got some basic tools to help me make decisions to keep or discard. Generally, I recognize file names and rough level of content. I can use tools to assess when the files were created, updated, or last accessed. For something like my knowledge base, most likely I'll bring over the whole high-level folder just to make sure I don't lose anything I need. After all, there's still a reasonable size limit since the overall contents are bounded by the size of my current hard drive.
Jumping into the Big Data Lake
So what's this small example have to do with Big Data? To me, it illustrates one of the key governance challenges facing Big Data. The concept of the Big Data Lake emerged around two years ago (see: Big Data Requires a Big, New Architecture). In general, the Data Lake allows organizations to store"the data in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers."
Tools have developed to support a lot of ways to jump in and get at all that Big Data. IBM's InfoSphere Data Explorer is one example "to help users of all kinds find and share information more easily and to help organizations launch big data initiatives more quickly" (see: IBM InfoSphere Data Explorer).
Just as I can browse my own local laptop directory, the various Big Data tools allow us to search, find, tag, explore, and provision the data in the Big Data Lake. And we work with this Big Data with the goal of finding those really valuable diamonds -- information that we can drive new business insight with.
Pulling up Old Shoes
With a lot of people and processes adding data into the Big Data Lake, there's a lot of opportunity for that lake to turn into a Big Data Swamp or perhaps worse a Big Data Landfill! There will be a strong tendency to treat such Big Data Lakes as landing zones in which to put anything of potential use for subsequent analysis--landing areas perceived to have unlimited storage capacity as well. Instead of working with a small set of known directories as on a laptop, you may be looking at hundreds or thousands of directories with untold numbers of files, often with cryptic names.
Consider the following partial directory listing from a test environment:
In this case, there is very little information to go on unless you open each and every file (and maybe not even then depending on the data format), or hope that the tools you have available can give you more insight.
As the volume and variety of these files increases and their velocity or frequency of arrival also expands, users fall back on what they know and have personal confidence in. That at least increases the likelihood of pulling up something of interest rather than the flotsam and jetsam of the Big Data Lake. But it also may diminish the value of the Big Data.
Retain or Remove?
When working with files on my laptop, I have the advantage of knowing when they were created, what they contain, when they were last used, and most importantly how valuable they are to my work. With hundreds or thousands of users, great volumes of files created by users or by automated processes, and likely little understanding of who else is using a given file and why, there's an immediate challenge in managing and governing this Big Data Lake. Add to that the ever-changing nature of an organization where users who add and understand content move to new roles or leave the organization, there's also an increasing likelihood that a lot of data will exist that is, in effect, orphaned.
One aspect of Information Governance in the Big Data context is how we manage the lifecycle of this data. These are fundamentally policy questions supported by people and process, with tools as facilitators not dictators. Questions to address for this Big Data Lake include:
How long will the organization retain this data?
- If the data is used in making certain kinds of business decisions, are there policies that dictate this retention period?
- If part of the value in Big Data is finding unexpected trends over time, is there value in retaining some of this data to increase the likelihood of finding those trends?
- Are there ways to readily categorize the data between what only has immediate, time-sensitive value and what has longer-term value?
Will some of this data be moved to historical or archived locations?
- If so, will there be any different approach to finding, accessing, and utilizing this data?
How will the data be disposed of?
- If the data contains content of particularly sensitive nature, are there policies that dictate the disposal practice?
All of these questions raise considerations for an organization as part of their Information Governance program. Given my own current migration process, I'm curious if your organization is addressing these aspects of Information Lifecycle Management in it's Big Data context?
As always, the postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.