Contents


BigSheets for the common man

Convert data to visualizations that can be used for future queries and additional processing

Comments

Get the data into BigSheets

BigSheets, the web-based spreadsheet-like interface that enables users to easily analyze large volumes of data, is part of InfoSphere BigInsights. It includes built-in functions to extract names, addresses, organizations, email, locations, and phone numbers. BigSheets has a good deal in common with a typical spreadsheet application, such as Microsoft® Excel®. The benefit is the ease with which you can open, browse, and ultimately create graphical views in the form of graphs and tag clouds on your data. As an entirely web-based interface to the data viewing and analysis process, it is very easy to use, but not without its complexities.

Getting the data into the BigSheets environment is a required step in the production process. Unlike more traditional Hadoop systems in which the data can come after the initial setup, with BigSheets, you open the pre-loaded data and use that to process the actual content.

BigSheets either takes the data you provide and builds a visualized version of that information in the form of a graph, or it processes raw information to provide a summarized view of the data. This enables BigSheets to support some basic processing alongside its core visualization role.

Getting the raw data files into BigSheets is much the same as any other InfoSphere BigInsights process. You can upload the file through the Files tab, process data through one of the built-in applications, or run a MapReduce process on other data loaded into the system.

Figure 1. Uploading a file through the file tab
Image shows files tab with accesslog directory highlighted
Image shows files tab with accesslog directory highlighted

For example, using the file system, you can upload the data by selecting a local file and sending it through your browser interface. This is convenient, but can occasionally lead to problems during the loading process that cause the files to be truncated.

Once the raw data has been uploaded or generated from the previous processing, you need to open the file within BigSheets, then determine how the content and data should be extracted.

Projects within BigSheets are organized by workbooks, and a workbook may define a collection of different analysis, relationship, and visualizations of the data to achieve its goals.

Choose a visualization technique that suits the data format

To create a new BigSheets workbook and identify the data you want to build, you need to consider the data and the visualization techniques available.

Open the InfoSphere BigInsights application through your browser (hostname:8080), click the BigSheets tab, then click New Workbook. You will be presented with a browser dialog to select the location of the source data files. For example, select the chicago.csv data in the Chicago directory, as shown in Figure 2. By default, BigSheets reads the data as single lines of information.

Figure 2. Selecting the location of the source data files
New workbook window with chicago.csv selected
New workbook window with chicago.csv selected

Data is likely to be in a number of formats. The Chicago data is in the comma-separated value (CSV) format, so clicking the Pencil icon enables us to select the data file type. The types supported include the following (also shown in Figure 3):

Figure 3. Supported types
List of supported file types starting with basic crawler data
List of supported file types starting with basic crawler data
  • Basic crawler data— Information extracted from web pages using the BigInsights web crawler.
  • Character delimited data— Tab, tilde, or other field-based information separated by characters.
  • Character-delimited data with text qualifier— Delimited data where the fields are enclosed by quotation marks or other characters to contain the data.
  • Comma-separated value (CSV) data— Standard CSV format, including qualified fields (quotation marks or escape characters), with or without header rows.
  • Hive read— Data read from an existing Hive data table.
  • JSON array— An array of JSON data.
  • JSON object read— Multiple lines of JSON object data.
  • Line reader— Each line is taken as an entry of discrete data. This format is useful when a separate job has output a list of unique words.
  • Sheets data— Data generated by a previous sheet's processing job.
  • Tab-separated value (TSV) data— Field-based information separated by tabs, with or without a header row.

The different data types supported affect the different ways in which the information can be visualized or analyzed.

To process the CSV data, select the CSV data type and disable the header row since it is not included in the data file. The structure and format of the data is displayed to confirm that you have picked the correct parsing method.

Once the data is in the correct format, click the check box at the bottom of the import page. BigSheets then parses and displays the data file.

After you have identified the basic structure of the data so it can be parsed into individual columns, you can use the inline browser to start selecting the data types and formats of the information. Click on the array next to each column to select the field type. BigSheets will try to determine the type automatically, but you might want to adjust the structure accordingly. Type identification is important because it affects how the data is analyzed and summarized.

Use the browser to manually dig deeper into the data by viewing the extracted rows, fields, and other data. You can review the entire date set to find the information you need or to better understand the scope of the data.

The purpose of BigSheets is to convert the large volume of data stored or produced within the typical Hadoop cluster.

Convert worksheet data to a graph

The basic analysis process is to convert your imported worksheet data within BigSheets to a graph that simplifies the information. There are a variety of graph types, and each type is particularly useful and approachable for a specific type of data.

Figure 4. Imported worksheet data in BigSheets
Image shows list of data entries for Chicago analysis
Image shows list of data entries for Chicago analysis

It's also important to remember that BigSheets is not a tool for extracting, searching, or filtering the data, although it is capable of these processes. These tasks should have taken place before the visualization process is applied within BigSheets. Keep in mind that it is important to align your data and structure with the type of graph you want to output.

Figure 5. Opening a CSV file
Image shows Row 3, Column 1 highlighted in the chicago.csv file
Image shows Row 3, Column 1 highlighted in the chicago.csv file

Visualize data with the cloud function

The cloud function generates a tag cloud, similar to that seen on websites and often used to display keyword, tag, or other counts when it's helpful to display the relative frequency of different elements within the data set. The cloud function is generally most useful with countable or quantifiable data, such as the incidences of words. BigSheets will do the counting for you, if you want it to. For example, you might use some access log data that counts the number of operation types.

To use the cloud function to display relative frequency, select the column to count, then use the Count occurrences of value field. This performs a basic accumulation on unique values within the field across the entire set of data. You can also set the order and limit. Set the limit using the number of values to display function. Click the check box to complete the step.

After the chart definition has been created, the chart must be run to generate the actual chart information. The process may take some time, depending upon the amount of data that needs to be processed.

Figure 6 shows the different operations displayed in a font size relative to the incidences. If you hover over the word, you'll get the actual data counts.

Figure 6. Different operations are displayed in a font size relative to the incidences
Image shows tag cloud with words active and recipeview in large font
Image shows tag cloud with words active and recipeview in large font

Visualize data with heat maps and visual maps

You can produce heat maps and visual maps of your data that overlay the information about the location of different entities on a real map of the United States or of the world. This option can be useful if you want to show geographic values. For example, the heat map of city populations uses latitude and longitude data combined with population counts.

Figure 7. Heat map of city populations
Image shows world map that shows regions color coded to signify population
Image shows world map that shows regions color coded to signify population

Visualize data using traditional charts

More traditional chart types, such as lines, bar graphs, and others enable you to build visualizations in a more familiar — if less fun — format. For example, you can create a bar chart of the access log information by performing a similar count, but by displaying the information as a horizontal bar chart.

Figure 8. Bar chart of the access log information
Y-axis shows values (active, recipeview, ingredient,....) X-axis                     show numbers from 0 to 1,000,000.
Y-axis shows values (active, recipeview, ingredient,....) X-axis show numbers from 0 to 1,000,000.

It should be noted that for each of these charts, it is the definition given by BigSheets that is the most useful. Once the definition has been created, your original source file with the data in it can be updated or created, and the graphs can be rerun on the source data to produce the information you need.

Therefore, you can continually update your source data and view the latest visualized version of the information. Better still, you can make the process part of a workflow that performs a number of different steps on the raw data.

Integrate with other data using new workbooks

After you have the basic process for analyzing your data sorted, you will probably want to move on to more advanced analysis. BigSheets includes the ability to create new workbooks based on your existing workbook solutions, a process that enables you to create new sheets that can be used for analyzing data.

For example, to output the recipe titles from the access log information that contains only recipe views and then combine that with the recipe title information stored in another file, we need to perform a join between the access log data and the recipe ID and title:

  1. Open the workbook that contains the access log information.
  2. Click Build New WorkBook. This action creates a new workbook that contains multiple stages. Each stage contains the definition to load, merge, and output the combined data. First, click Add sheets, and select the Filter operation, then configure the access_log to select only recipeview.
  3. Click Build New Workbook and select Load. The Load function configures the workbook to include the data from another workbook. Select the recipe workbook to get the joinable data.
  4. Now, click Build New Workbook, select Join and select the columns to join: Column 6 (recipeid) from the access_log and Column 1 from the recipes.
  5. Once the workbook definition has been completed, you can run the join process, which parses the entire access_log with the recipe title information. You need to run the process and confirm that the join completes without errors.

We've created a workbook that includes the filter (for selecting recipeviews), the load of the recipe data, and the join that joins the recipe data together. With the basic filter and join in place, we can recreate the tag cloud chart and produce a chart.

Figure 9. Tag cloud chart
Example of a tag cloud with some recipe titles in large fonts
Example of a tag cloud with some recipe titles in large fonts

Conclusion

BigSheets does a remarkably simple job of visualizing data, but you don't have to be a Hadoop expert, or even an Excel wizard, to produce good looking charts and information from your data. It doesn't matter whether that data is the raw information or previously processed information, BigSheets still provides you with the visuals you need.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=957658
ArticleTitle=BigSheets for the common man
publish-date=12242013