Identifying your data resources

Data is stored in various formats and sources. With a few high-level steps, data can be imported into the BigInsights® distributed file system in a usable format.

Before you begin

To use BigInsights to answer your business questions, you must first follow some basic steps to define your problem and your data resources.

About this task

You can use BigInsights with your existing infrastructure or data warehouse to import data and content in its original formats. All of the data can be imported for analysis, not just subsets. Both low-latency streams and large volume velocity can be handled, as well as huge volumes of at-rest or incoming data in motion.

Procedure

  1. Identify your business need.

    You must determine what insight you are trying to discover before the data can be structured for analysis.

    For example:
    • What is the customer sentiment about a product launch?
    • Are there patterns that can help indicate the potential for credit card fraud?
    • Will traffic and weather logistics affect distribution plans?
  2. Identify the sources of your information.

    You need to identify all of these sources of information.

    For example, customer survey data might be stored in data warehouses, web log information on web servers, and email information in structured or unstructured flat files.

  3. Identify your data types.
    You can determine the best method to use for importing by data type. Consider that the data in each source is of one of the following types:
    Data at rest
    Data at rest is complete and can be loaded as is.
    For Sample Outdoor Company, company email data, which might be in flat files or semistructured, or unstructured data, such as log information are examples of data at rest.
    Data in motion
    Data in motion is continually updated, and it can be semistructured or unstructured data. New data might be added regularly to these data sources; data might be appended to a file, or numerous logs might be merged into one log. Determine an update frequency if you are going to use this type of data. Examples of data in motion include:
    • Data from a web server such as WebSphere® Application Server or Apache
    • Data in server logs or application logs, for example the Sample Outdoor Company browsing histories of the online retail store.
    Data from a data warehouse
    Data from a data warehouse (DB2®, Netezza®) is typically in a structured format, such as in relational tables.
    Since new data is added regularly to these data sources, it is periodically updated. You must determine frequency intervals for these updates. An example of this type of data is Sample Outdoor Company 's quarterly sales target data, or vendor information for each of the geographic companies that is based on postal codes.
  4. Determine the frequency of your update intervals.

    For any data source that is regularly updated, you must determine how often to propagate changes to the distributed file system.

    If your data is updated monthly, quarterly, or annually you can schedule the updates accordingly. However, if your data is updated by minute, hourly, or daily intervals you might need to update your data more frequently or consider a data in motion feed.

  5. Export the data.

    The format of the exported data is usually determined by your data source, but you might have to compress large volumes of information for input into BigInsights.

  6. Import the data from the various sources in your enterprise.
    After you identify the sources and locations of the data that you want to analyze, such as tables in warehouses or web server logs, you import the data into the distributed file system.
    Import data

Example

For example, Sample Outdoor Company wants to compare the regional sales data from last quarter with browser histories and user preferences. Once that is combined, they can break down the integrated data by postal code to generate sales leads and to identify market trends to share with their vendors. The following table shows the different types of data in the various sources that Sample Outdoor Company wants to import and integrate.

Table 1. Data that Sample Outdoor Company must locate
Data Data source Type of data
Customer service emails Email data in flat files Data at rest: unstructured flat files
Browsing histories Web server logs Data in motion: semistructured web server logs
Survey data or postal code data Data warehouse tables Relational tables in a warehouse
User preferences Preferences data in flat files Data at rest: semistructured flat files

What to do next

When the data is in the format that you need, you can develop and run complex analytics to gain significant business insights. Sample Outdoor Company can now determine whether the day of the week or time of day has a correlation to the number of customer purchases, and if they choose to reallocate storage options or product marketing efforts, which region will be the most receptive.
Analyze data