Smarter is...: Adding structure to an unstructured world

IBM BigSheets helps organizations extract big value from unstructured data

To turn its huge volume of unstructured dataspanning decades of Web pages, online documents, video, images and other sourcesinto a usable, accessible resource that could be easily analyzed and queried, the British Library used IBM BigSheets analytics software to extract data from source applications, add tags, and display it in an easily consumed format.

This article was originally published in IBM Data magazine

Chris Young (chris.young@tdagroup.com), Contributing Writer, IBM Data Management magazine

Chris Young is a technology writer based in the Pacific Northwest.



30 April 2010

Data managers have created immense value from structured information. Now the challenge is to pull in data from the unstructured world—and mix it with internal data stores to gain fresh perspectives. And there's no problem finding unstructured data to analyze: David Boloker, chief technology officer for emerging technologies at IBM, estimates that of the 15 petabytes of data created around the world each day, about 80 percent comes from unstructured sources.

"The most daunting part of the challenge isn't collecting unstructured data, it's getting value from it," says Boloker. "Take the example of a pharmaceutical company that has a drug in clinical trials. Much of clinical data is unstructured, handwritten in patient records and then digitized. If there was a way to quickly and easily refine that data into a more structured form, the company might confirm benefits of the drug much earlier in the process or spot subtle problems that might otherwise be missed."

The British Library was grappling with just such a challenge. Tasked with archiving information from across the published spectrum, the staff needed a way to turn massive amounts of data from Web sites and other unstructured sources into a viable resource. Working with IBM, the library successfully implemented a prototype analytics technology called IBM BigSheets.

BigSheets screen shot

With IBM BigSheets software, users are able to access vast archives of data, submit queries to easily research the information, analyze it in a format that is organized like a spreadsheet, and explore it in other familiar visual contexts. For example, users can see search results in a pie chart and look at the data in a tag cloud. "As a data manager, my question is, 'How can I make all the unstructured information coming at me useful to my organization?' Now I have an answer," says Boloker.

Under the hood, BigSheets is built on the Apache Hadoop open-source framework for parallel processing large data sets on compute clusters, and it uses the Hadoop Distributed File System (HDFS) for high-throughput access to application data. The BigSheets software collects information from a variety of source applications, extracts the data, annotates it with tags, and enriches it for display.

BigSheets is already enabling the British Library to extract big value from unstrucutured data, but Boloker expects the technology to have an even larger impact in science, academia, and the private sector. "A business could do things like matching unstructured data from a given zip code to internal sales data and see what's causing an up or down trend," he explains. "We're now able to use information that was lost in the unstructured world and compare or contrast it with what we already have. It's really a new day for data managers and their clients."

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=486897
ArticleTitle=Smarter is...: Adding structure to an unstructured world
publish-date=04302010