Data managers have created immense value from structured information. Now the challenge is to pull in data from the unstructured world—and mix it with internal data stores to gain fresh perspectives. And there's no problem finding unstructured data to analyze: David Boloker, chief technology officer for emerging technologies at IBM, estimates that of the 15 petabytes of data created around the world each day, about 80 percent comes from unstructured sources.
"The most daunting part of the challenge isn't collecting unstructured data, it's getting value from it," says Boloker. "Take the example of a pharmaceutical company that has a drug in clinical trials. Much of clinical data is unstructured, handwritten in patient records and then digitized. If there was a way to quickly and easily refine that data into a more structured form, the company might confirm benefits of the drug much earlier in the process or spot subtle problems that might otherwise be missed."
The British Library was grappling with just such a challenge. Tasked with archiving information from across the published spectrum, the staff needed a way to turn massive amounts of data from Web sites and other unstructured sources into a viable resource. Working with IBM, the library successfully implemented a prototype analytics technology called IBM BigSheets.
With IBM BigSheets software, users are able to access vast archives of data, submit queries to easily research the information, analyze it in a format that is organized like a spreadsheet, and explore it in other familiar visual contexts. For example, users can see search results in a pie chart and look at the data in a tag cloud. "As a data manager, my question is, 'How can I make all the unstructured information coming at me useful to my organization?' Now I have an answer," says Boloker.
Under the hood, BigSheets is built on the Apache Hadoop open-source framework for parallel processing large data sets on compute clusters, and it uses the Hadoop Distributed File System (HDFS) for high-throughput access to application data. The BigSheets software collects information from a variety of source applications, extracts the data, annotates it with tags, and enriches it for display.
BigSheets is already enabling the British Library to extract big value from unstrucutured data, but Boloker expects the technology to have an even larger impact in science, academia, and the private sector. "A business could do things like matching unstructured data from a given zip code to internal sales data and see what's causing an up or down trend," he explains. "We're now able to use information that was lost in the unstructured world and compare or contrast it with what we already have. It's really a new day for data managers and their clients."