BigSheets for the common man

Convert data to visualizations that can be used for future queries and additional processing

The first time you play with BigSheets technology in InfoSphere® BigInsights™, you wonder where the tool has been all this time. BigSheets takes your big data and makes it easy to browse, read, and identify. It also provides a basis on which to run queries and identify patterns, before you write the more complex queries and tools that dig into your data. This article shows how to turn analyzed information into visualized information that can be used to build and develop future queries and address processing needs.

Martin C. Brown, Director of Documentation

Martin BrownA professional writer for over 15 years, Martin (MC) Brown is the author and contributor to more than 26 books covering an array of topics, including the recently published Getting Started with CouchDB. His expertise spans myriad development languages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac OS and more. He currently works as the director of documentation for Continuent.



24 December 2013

Also available in Chinese Russian

Get the data into BigSheets

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

BigSheets, the web-based spreadsheet-like interface that enables users to easily analyze large volumes of data, is part of InfoSphere BigInsights. It includes built-in functions to extract names, addresses, organizations, email, locations, and phone numbers. BigSheets has a good deal in common with a typical spreadsheet application, such as Microsoft® Excel®. The benefit is the ease with which you can open, browse, and ultimately create graphical views in the form of graphs and tag clouds on your data. As an entirely web-based interface to the data viewing and analysis process, it is very easy to use, but not without its complexities.

Getting the data into the BigSheets environment is a required step in the production process. Unlike more traditional Hadoop systems in which the data can come after the initial setup, with BigSheets, you open the pre-loaded data and use that to process the actual content.

BigSheets either takes the data you provide and builds a visualized version of that information in the form of a graph, or it processes raw information to provide a summarized view of the data. This enables BigSheets to support some basic processing alongside its core visualization role.

Getting the raw data files into BigSheets is much the same as any other InfoSphere BigInsights process. You can upload the file through the Files tab, process data through one of the built-in applications, or run a MapReduce process on other data loaded into the system.

Figure 1. Uploading a file through the file tab
Image shows files tab with accesslog directory highlighted

For example, using the file system, you can upload the data by selecting a local file and sending it through your browser interface. This is convenient, but can occasionally lead to problems during the loading process that cause the files to be truncated.

Once the raw data has been uploaded or generated from the previous processing, you need to open the file within BigSheets, then determine how the content and data should be extracted.

Projects within BigSheets are organized by workbooks, and a workbook may define a collection of different analysis, relationship, and visualizations of the data to achieve its goals.


Choose a visualization technique that suits the data format

To create a new BigSheets workbook and identify the data you want to build, you need to consider the data and the visualization techniques available.

Open the InfoSphere BigInsights application through your browser (hostname:8080), click the BigSheets tab, then click New Workbook. You will be presented with a browser dialog to select the location of the source data files. For example, select the chicago.csv data in the Chicago directory, as shown in Figure 2. By default, BigSheets reads the data as single lines of information.

Figure 2. Selecting the location of the source data files
New workbook window with chicago.csv selected

Data is likely to be in a number of formats. The Chicago data is in the comma-separated value (CSV) format, so clicking the Pencil icon enables us to select the data file type. The types supported include the following (also shown in Figure 3):

Figure 3. Supported types
List of supported file types starting with basic crawler data
  • Basic crawler data— Information extracted from web pages using the BigInsights web crawler.
  • Character delimited data— Tab, tilde, or other field-based information separated by characters.
  • Character-delimited data with text qualifier— Delimited data where the fields are enclosed by quotation marks or other characters to contain the data.
  • Comma-separated value (CSV) data— Standard CSV format, including qualified fields (quotation marks or escape characters), with or without header rows.
  • Hive read— Data read from an existing Hive data table.
  • JSON array— An array of JSON data.
  • JSON object read— Multiple lines of JSON object data.
  • Line reader— Each line is taken as an entry of discrete data. This format is useful when a separate job has output a list of unique words.
  • Sheets data— Data generated by a previous sheet's processing job.
  • Tab-separated value (TSV) data— Field-based information separated by tabs, with or without a header row.

The different data types supported affect the different ways in which the information can be visualized or analyzed.

To process the CSV data, select the CSV data type and disable the header row since it is not included in the data file. The structure and format of the data is displayed to confirm that you have picked the correct parsing method.

Once the data is in the correct format, click the check box at the bottom of the import page. BigSheets then parses and displays the data file.

After you have identified the basic structure of the data so it can be parsed into individual columns, you can use the inline browser to start selecting the data types and formats of the information. Click on the array next to each column to select the field type. BigSheets will try to determine the type automatically, but you might want to adjust the structure accordingly. Type identification is important because it affects how the data is analyzed and summarized.

Use the browser to manually dig deeper into the data by viewing the extracted rows, fields, and other data. You can review the entire date set to find the information you need or to better understand the scope of the data.

The purpose of BigSheets is to convert the large volume of data stored or produced within the typical Hadoop cluster.


Convert worksheet data to a graph

The basic analysis process is to convert your imported worksheet data within BigSheets to a graph that simplifies the information. There are a variety of graph types, and each type is particularly useful and approachable for a specific type of data.

Figure 4. Imported worksheet data in BigSheets
Image shows list of data entries for Chicago analysis

It's also important to remember that BigSheets is not a tool for extracting, searching, or filtering the data, although it is capable of these processes. These tasks should have taken place before the visualization process is applied within BigSheets. Keep in mind that it is important to align your data and structure with the type of graph you want to output.

Figure 5. Opening a CSV file
Image shows Row 3, Column 1 highlighted in the chicago.csv file

Visualize data with the cloud function

The cloud function generates a tag cloud, similar to that seen on websites and often used to display keyword, tag, or other counts when it's helpful to display the relative frequency of different elements within the data set. The cloud function is generally most useful with countable or quantifiable data, such as the incidences of words. BigSheets will do the counting for you, if you want it to. For example, you might use some access log data that counts the number of operation types.

To use the cloud function to display relative frequency, select the column to count, then use the Count occurrences of value field. This performs a basic accumulation on unique values within the field across the entire set of data. You can also set the order and limit. Set the limit using the number of values to display function. Click the check box to complete the step.

After the chart definition has been created, the chart must be run to generate the actual chart information. The process may take some time, depending upon the amount of data that needs to be processed.

Figure 6 shows the different operations displayed in a font size relative to the incidences. If you hover over the word, you'll get the actual data counts.

Figure 6. Different operations are displayed in a font size relative to the incidences
Image shows tag cloud with words active and recipeview in large font

Visualize data with heat maps and visual maps

You can produce heat maps and visual maps of your data that overlay the information about the location of different entities on a real map of the United States or of the world. This option can be useful if you want to show geographic values. For example, the heat map of city populations uses latitude and longitude data combined with population counts.

Figure 7. Heat map of city populations
Image shows world map that shows regions color coded to signify population

Visualize data using traditional charts

More traditional chart types, such as lines, bar graphs, and others enable you to build visualizations in a more familiar — if less fun — format. For example, you can create a bar chart of the access log information by performing a similar count, but by displaying the information as a horizontal bar chart.

Figure 8. Bar chart of the access log information
Y-axis shows values (active, recipeview, ingredient,....) X-axis show numbers from 0 to 1,000,000.

It should be noted that for each of these charts, it is the definition given by BigSheets that is the most useful. Once the definition has been created, your original source file with the data in it can be updated or created, and the graphs can be rerun on the source data to produce the information you need.

Therefore, you can continually update your source data and view the latest visualized version of the information. Better still, you can make the process part of a workflow that performs a number of different steps on the raw data.


Integrate with other data using new workbooks

After you have the basic process for analyzing your data sorted, you will probably want to move on to more advanced analysis. BigSheets includes the ability to create new workbooks based on your existing workbook solutions, a process that enables you to create new sheets that can be used for analyzing data.

For example, to output the recipe titles from the access log information that contains only recipe views and then combine that with the recipe title information stored in another file, we need to perform a join between the access log data and the recipe ID and title:

  1. Open the workbook that contains the access log information.
  2. Click Build New WorkBook. This action creates a new workbook that contains multiple stages. Each stage contains the definition to load, merge, and output the combined data. First, click Add sheets, and select the Filter operation, then configure the access_log to select only recipeview.
  3. Click Build New Workbook and select Load. The Load function configures the workbook to include the data from another workbook. Select the recipe workbook to get the joinable data.
  4. Now, click Build New Workbook, select Join and select the columns to join: Column 6 (recipeid) from the access_log and Column 1 from the recipes.
  5. Once the workbook definition has been completed, you can run the join process, which parses the entire access_log with the recipe title information. You need to run the process and confirm that the join completes without errors.

We've created a workbook that includes the filter (for selecting recipeviews), the load of the recipe data, and the join that joins the recipe data together. With the basic filter and join in place, we can recreate the tag cloud chart and produce a chart.

Figure 9. Tag cloud chart
Example of a tag cloud with some recipe titles in large fonts

Conclusion

BigSheets does a remarkably simple job of visualizing data, but you don't have to be a Hadoop expert, or even an Excel wizard, to produce good looking charts and information from your data. It doesn't matter whether that data is the raw information or previously processed information, BigSheets still provides you with the visuals you need.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=957658
ArticleTitle=BigSheets for the common man
publish-date=12242013