Jeff Jonas on analytics
What is big data, and what makes it different from regular data? In an interview with the Data Protection & Law Policy newsletter, Jeff Jonas answers those questions and more. Jeff is an IBM Fellow and Chief Scientist of the IBM Entity Analytics Group. His full bio can be found on his blog.
Data protection challenge of the future: what is big data?
The three Vs — volume, velocity and variety — are the essential characteristics of big data. While data protection and privacy laws are still busy catching up with technologies of yesterday, big data is growing at a lightning speed on a daily basis. How can companies deal with the data-protection challenges brought about by big data, in order to truly benefit from the opportunities it introduces? First, one must truly grasp what is big data.
When did data become big?
Big data did not become big overnight. What I think happened is that data started getting generated faster than organizations could get their hands around it. Then one day you simply wake up and feel like you are drowning in data. On that day, data felt big.
Please explain and elaborate on the characteristics of big data?
"Big data" means different things to different people.
Personally, my favorite definition is: "something magical happens when very large corpuses of data come together." Some example of this can be seen at Google, for example Google Flu Trends and Google Translate. In my own work, I witnessed this first in 2006. This particular system started getting higher-quality predictions, faster, as it ingested more data. This is so counterintuitive.
The easiest way to explain this, though, is to consider the familiar process of putting a puzzle together at home. Why is it, do you think, that the last few pieces are as easy as the first few — even though you have more data in front of you than ever before? Same thing really that is happening in my systems these days. It’s rather exciting, to tell you the truth.
To elaborate briefly on the new physics of big data, I pinpointed the three phenomena of big data physics in my blog entry "Big Data. New Physics." drawing from my personal experience of 14 years of designing and deploying a number of multi-billion-row context-accumulating systems:
1.Better prediction. Simultaneously lower false positives and lower false negatives.
2.Bad data good. More specifically, natural variability in data including spelling errors, transposition errors, and even professionally fabricated lies — all helpful.
3.More data faster. Less compute effort as the database gets bigger.
Another definition of big data is related to the ability of organizations to harness data sets previously believed to be "too large to handle." Historically, big data means too many rows, too much storage and too much cost for organizations that lack the tools and ability to really handle data of such quantity. Today, we are seeing ways to explore and iterate cheaply over big data.
When did data become big for you? What is your "big data" processing experience?
As previously mentioned, for me, big data is about the magical things that happen when a critical mass is reached. To be honest, big data does not feel big to me unless it is hard to process and make sense of. A few billion rows here and a few billion rows there — such volumes once seemed like a lot of data to me. Then helping organizations think about dealing volumes of 100 million or more records a day seemed like a lot. Today, when I think about the volumes at Google and Facebook, I think: "Now that really is big data!"
My personal interest and primary focus on big data these days is: how to make sense of data in real time, that is fast enough to do something about the transaction while the transaction is still happening. While you swipe that credit card, there are only a few seconds to decide whether that is you or maybe someone pretending to be you. If an unauthorized user is inside your network, and data starts getting pumped out, an organization needs sub-second "sense and respond" capabilities. End-of-day batch processes producing great answers is simply late!
What technologies are in use to process big data?
The availability of big data technologies seems to be growing by leaps and bounds and on many fronts. We are seeing large corporate investments resulting in commercial products — at IBM, two examples would be IBM InfoSphere Streams for big data in motion and IBM InfoSphere Big Insights for pattern discovery over data at rest. There are also many big data open source efforts under way: for example HADOOP, Cassandra and Lucene. If one were to divide these into types one would find some well suited for streaming analytics and others for batch analytics. Some help organizations harness structured data while others are ideal for unstructured data. One thing is for sure: there are many options, and there will be many more choices to come as big data continues to get investment.
How can companies benefit from the use of big data?
I’d like to think consumers benefit too, just to be clear. To illustrate my point, I find it very helpful when Google responds to my search with "did you mean ______?" To pull this very smart stunt, Google must remember the typographical errors of the world, and that, I do believe, would qualify as big data. Moreover, I think that health care is benefiting from big data…or let’s hope so. Organizations such as financial institutions and insurance companies are benefitting from big data also by using these insights to run more efficient operations and mitigate risks.
We, you and I, are responsible in part for generating so much big data. These social media platforms we use to speak our mind and stay connected are responsible for massive volumes of data. Companies know this and are paying attention. For example, my friend’s wife complained on Twitter about a specific company’s service. Not long thereafter they reached out to her because they too were listening. They fixed the problem and she was as happy as ever. How did the company benefit? They kept a customer.
What is the trend of processing big data?
I think a lot of big data systems are running as periodic batch processes, for example, once a week or once a month. My suspicion is that as these systems begin to generate more and more relevant insight, it will not be long before the users say: "Why did I have to wait until the end of the week to learn that? They already left the web site" or "I already denied their loan when it is now clear I should have granted them that loan."
What are the complications dealing with the privacy implications brought about by big data compared to average-sized data?
Lots of privacy complications come along with big data. Consumers, for example, often want to know what data an organization collects and the purpose of the collection. Something that further complicates this: I think many consumers would be surprised to know what is computationally possible with big data. For example, where you are going to be next Thursday at 5:35 p.m., or [who are] your three best friends and which two of them are not on Facebook. Big data is making it harder to have secrets. To illustrate using lines from my blog entry "Using Transparency As A Mask":
"Unlike two decades ago, humans are now creating huge volumes of extraordinarily useful data as they self-annotate their relationships and yours, their photographs and yours, their thoughts and their thoughts about you…and more. With more data comes better understanding and prediction. The convergence of data might reveal your 'discreet' rendezvous or the fact that you are no longer on speaking terms your best friend. No longer secret is your visit to the porn store and the subsequent change in your home’s late-night energy profile, another telling story about who you are…again out of the bag, and little you can do about it. Pity…you thought that all of this information was secret."
What are the privacy concerns and threats big data might bring about — to companies and to individuals whose data are contained in big data?
My number one recommendation to organizations is "avoid consumer surprise."
How are companies applying privacy protection principles before and after big data has been processed?
I think that many best practices are being adopted. One of my favorites involves letting consumers opt in instead of opting them in automatically and then requiring them to opt out. One new thing I would like to see become a new best practice is: a place on the web site, for example my bank, where I can see a list of third parties whom my bank has shared my data with. I think this transparency would be good and certainly would make consumers more aware.
What is "big data," according to Jeff Jonas?
Big data is a pile of data so big — and harnessed so well — that it becomes possible to make substantially better predictions: for example, what web page would be the absolute best web page to place first on your results, just for you.