Get the data into BigSheets
BigSheets, the web-based spreadsheet-like interface that enables users to easily analyze large volumes of data, is part of InfoSphere BigInsights. It includes built-in functions to extract names, addresses, organizations, email, locations, and phone numbers. BigSheets has a good deal in common with a typical spreadsheet application, such as Microsoft® Excel®. The benefit is the ease with which you can open, browse, and ultimately create graphical views in the form of graphs and tag clouds on your data. As an entirely web-based interface to the data viewing and analysis process, it is very easy to use, but not without its complexities.
Getting the data into the BigSheets environment is a required step in the production process. Unlike more traditional Hadoop systems in which the data can come after the initial setup, with BigSheets, you open the pre-loaded data and use that to process the actual content.
BigSheets either takes the data you provide and builds a visualized version of that information in the form of a graph, or it processes raw information to provide a summarized view of the data. This enables BigSheets to support some basic processing alongside its core visualization role.
Getting the raw data files into BigSheets is much the same as any other InfoSphere BigInsights process. You can upload the file through the Files tab, process data through one of the built-in applications, or run a MapReduce process on other data loaded into the system.
Figure 1. Uploading a file through the file tab
For example, using the file system, you can upload the data by selecting a local file and sending it through your browser interface. This is convenient, but can occasionally lead to problems during the loading process that cause the files to be truncated.
Once the raw data has been uploaded or generated from the previous processing, you need to open the file within BigSheets, then determine how the content and data should be extracted.
Projects within BigSheets are organized by workbooks, and a workbook may define a collection of different analysis, relationship, and visualizations of the data to achieve its goals.
Choose a visualization technique that suits the data format
To create a new BigSheets workbook and identify the data you want to build, you need to consider the data and the visualization techniques available.
Open the InfoSphere BigInsights application through your browser
hostname:8080), click the BigSheets tab,
then click New Workbook. You will be presented with a
browser dialog to select the location of the source data files. For
example, select the chicago.csv data in the Chicago directory, as shown in
Figure 2. By default, BigSheets reads the data as single lines of
Figure 2. Selecting the location of the source data files
Data is likely to be in a number of formats. The Chicago data is in the comma-separated value (CSV) format, so clicking the Pencil icon enables us to select the data file type. The types supported include the following (also shown in Figure 3):
Figure 3. Supported types
- Basic crawler data— Information extracted from web pages using the BigInsights web crawler.
- Character delimited data— Tab, tilde, or other field-based information separated by characters.
- Character-delimited data with text qualifier— Delimited data where the fields are enclosed by quotation marks or other characters to contain the data.
- Comma-separated value (CSV) data— Standard CSV format, including qualified fields (quotation marks or escape characters), with or without header rows.
- Hive read— Data read from an existing Hive data table.
- JSON array— An array of JSON data.
- JSON object read— Multiple lines of JSON object data.
- Line reader— Each line is taken as an entry of discrete data. This format is useful when a separate job has output a list of unique words.
- Sheets data— Data generated by a previous sheet's processing job.
- Tab-separated value (TSV) data— Field-based information separated by tabs, with or without a header row.
The different data types supported affect the different ways in which the information can be visualized or analyzed.
To process the CSV data, select the CSV data type and disable the header row since it is not included in the data file. The structure and format of the data is displayed to confirm that you have picked the correct parsing method.
Once the data is in the correct format, click the check box at the bottom of the import page. BigSheets then parses and displays the data file.
After you have identified the basic structure of the data so it can be parsed into individual columns, you can use the inline browser to start selecting the data types and formats of the information. Click on the array next to each column to select the field type. BigSheets will try to determine the type automatically, but you might want to adjust the structure accordingly. Type identification is important because it affects how the data is analyzed and summarized.
Use the browser to manually dig deeper into the data by viewing the extracted rows, fields, and other data. You can review the entire date set to find the information you need or to better understand the scope of the data.
The purpose of BigSheets is to convert the large volume of data stored or produced within the typical Hadoop cluster.
Convert worksheet data to a graph
The basic analysis process is to convert your imported worksheet data within BigSheets to a graph that simplifies the information. There are a variety of graph types, and each type is particularly useful and approachable for a specific type of data.
Figure 4. Imported worksheet data in BigSheets
It's also important to remember that BigSheets is not a tool for extracting, searching, or filtering the data, although it is capable of these processes. These tasks should have taken place before the visualization process is applied within BigSheets. Keep in mind that it is important to align your data and structure with the type of graph you want to output.
Figure 5. Opening a CSV file
Visualize data with the cloud function
The cloud function generates a tag cloud, similar to that seen on websites and often used to display keyword, tag, or other counts when it's helpful to display the relative frequency of different elements within the data set. The cloud function is generally most useful with countable or quantifiable data, such as the incidences of words. BigSheets will do the counting for you, if you want it to. For example, you might use some access log data that counts the number of operation types.
To use the cloud function to display relative frequency, select the column to count, then use the Count occurrences of value field. This performs a basic accumulation on unique values within the field across the entire set of data. You can also set the order and limit. Set the limit using the number of values to display function. Click the check box to complete the step.
After the chart definition has been created, the chart must be run to generate the actual chart information. The process may take some time, depending upon the amount of data that needs to be processed.
Figure 6 shows the different operations displayed in a font size relative to the incidences. If you hover over the word, you'll get the actual data counts.
Figure 6. Different operations are displayed in a font size relative to the incidences
Visualize data with heat maps and visual maps
You can produce heat maps and visual maps of your data that overlay the information about the location of different entities on a real map of the United States or of the world. This option can be useful if you want to show geographic values. For example, the heat map of city populations uses latitude and longitude data combined with population counts.
Figure 7. Heat map of city populations
Visualize data using traditional charts
More traditional chart types, such as lines, bar graphs, and others enable you to build visualizations in a more familiar — if less fun — format. For example, you can create a bar chart of the access log information by performing a similar count, but by displaying the information as a horizontal bar chart.
Figure 8. Bar chart of the access log information
It should be noted that for each of these charts, it is the definition given by BigSheets that is the most useful. Once the definition has been created, your original source file with the data in it can be updated or created, and the graphs can be rerun on the source data to produce the information you need.
Therefore, you can continually update your source data and view the latest visualized version of the information. Better still, you can make the process part of a workflow that performs a number of different steps on the raw data.
Integrate with other data using new workbooks
After you have the basic process for analyzing your data sorted, you will probably want to move on to more advanced analysis. BigSheets includes the ability to create new workbooks based on your existing workbook solutions, a process that enables you to create new sheets that can be used for analyzing data.
For example, to output the recipe titles from the access log information that contains only recipe views and then combine that with the recipe title information stored in another file, we need to perform a join between the access log data and the recipe ID and title:
- Open the workbook that contains the access log information.
- Click Build New WorkBook. This action creates a new
workbook that contains multiple stages. Each stage contains the
definition to load, merge, and output the combined data. First, click
Add sheets, and select the
Filter operation, then configure the access_log
to select only
- Click Build New Workbook and select Load. The Load function configures the workbook to include the data from another workbook. Select the recipe workbook to get the joinable data.
- Now, click Build New Workbook, select
Join and select the columns to join: Column 6
recipeid) from the access_log and Column 1 from the recipes.
- Once the workbook definition has been completed, you can run the join process, which parses the entire access_log with the recipe title information. You need to run the process and confirm that the join completes without errors.
We've created a workbook that includes the filter (for selecting
recipeviews), the load of the recipe data, and the join that
joins the recipe data together. With the basic filter and join in place,
we can recreate the tag cloud chart and produce a chart.
Figure 9. Tag cloud chart
BigSheets does a remarkably simple job of visualizing data, but you don't have to be a Hadoop expert, or even an Excel wizard, to produce good looking charts and information from your data. It doesn't matter whether that data is the raw information or previously processed information, BigSheets still provides you with the visuals you need.
- Get a technical introduction to BigSheets on Slideshare.
- Read the SQL to Hadoop and back again series, which demonstrates the contents for the Chicago Traffic Tracker data set: Part 1: Basic data interchange techniques, Part 2: Leveraging HBase and Hive, Part 3: Direct Transfer and Live Data Exchange).
- Check out the Chicago Traffic Tracker.
- Read "What's the big deal about Big SQL?" to learn about IBM's SQL interface to its Hadoop-based platform, InfoSphere BigInsights. Big SQL is designed to provide SQL developers with an easy on-ramp for querying data managed by Hadoop.
- "Analyzing social media and structured data with InfoSphere BigInsights" teaches the basics of using BigSheets to analyze social media and structured data collected through sample applications provided with BigInsights.
- Read "Understanding InfoSphere BigInsights" to learn more about the InfoSphere BigInsights' architecture and underlying technologies.
- Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights video to listen to Cindy Saracco discuss some of the frequently asked questions about IBM's Big Data platform and InfoSphere BigInsights.
- Read "Exploring your InfoSphere BigInsights cluster and sample applications" to learn more about the InfoSphere BigInsights web console.
- Visit the BigInsights Technical Enablement wiki for links to technical materials, demos, training courses, news items, and more.
- Learn about the IBM Watson research project.
- Check out Big Data University for free courses on Hadoop and big data.
- Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data for details on two of IBM's key big data technologies.
- Visit the Apache Hadoop Project Web site.
- Big Data Glossary, by Pete Warden, O'Reilly Media, ISBN 1449314597, 2011; and Hadoop: The Definitive Guide, by Tom White, O'Reilly Media, ISBN 1449389732, 2010, offer more information.
- "A Survey of Large Scale Data Management Approaches in Cloud Environments" gives a comprehensive survey of numerous approaches and mechanisms of deploying data-intensive applications in the cloud, which are gaining a lot of momentum in both research and industrial communities.
- Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets.
- Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
- Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Get Hadoop 0.20.1, Hadoop MapReduce, and Hadoop HDFS.
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
- Download InfoSphere Streams, available as a native software installation or as a VMware image.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Ask questions and get answers in the InfoSphere BigInsights forum.
- Ask questions and get answers in the InfoSphere Streams forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.
- Check out IBM big data and analytics on Facebook.