Identifying your data resources
Data is stored in various formats and sources. With a few high-level steps, data can be imported into the BigInsights® distributed file system in a usable format.
Before you begin
To use BigInsights to answer your business questions, you must first follow some basic steps to define your problem and your data resources.
About this task
You can use BigInsights with your existing infrastructure or data warehouse to import data and content in its original formats. All of the data can be imported for analysis, not just subsets. Both low-latency streams and large volume velocity can be handled, as well as huge volumes of at-rest or incoming data in motion.
- Identify your business need.
You must determine what insight you are trying to discover before the data can be structured for analysis.For example:
- What is the customer sentiment about a product launch?
- Are there patterns that can help indicate the potential for credit card fraud?
- Will traffic and weather logistics affect distribution plans?
- Identify the sources of your information.
You need to identify all of these sources of information.
For example, customer survey data might be stored in data warehouses, web log information on web servers, and email information in structured or unstructured flat files.
- Identify your data types. You can determine the best method to use for importing by data type. Consider that the data in each source is of one of the following types:
- Data at rest
- Data at rest is complete and can be loaded as is.
- For Sample Outdoor Company, company email data, which might be in flat files or semistructured, or unstructured data, such as log information are examples of data at rest.
- Data in motion
- Data in motion is continually updated, and it can be semistructured
or unstructured data. New data might be added regularly to these data
sources; data might be appended to a file, or numerous logs might
be merged into one log. Determine an update frequency if you are going
to use this type of data. Examples of data in motion include:
- Data from a web server such as WebSphere® Application Server or Apache
- Data in server logs or application logs, for example the Sample Outdoor Company browsing histories of the online retail store.
- Data from a data warehouse
- Data from a data warehouse (DB2®, Netezza®) is typically in a structured format, such as in relational tables.
- Since new data is added regularly to these data sources, it is periodically updated. You must determine frequency intervals for these updates. An example of this type of data is Sample Outdoor Company 's quarterly sales target data, or vendor information for each of the geographic companies that is based on postal codes.
- Determine the frequency of your update intervals.
For any data source that is regularly updated, you must determine how often to propagate changes to the distributed file system.
If your data is updated monthly, quarterly, or annually you can schedule the updates accordingly. However, if your data is updated by minute, hourly, or daily intervals you might need to update your data more frequently or consider a data in motion feed.
- Export the data.
The format of the exported data is usually determined by your data source, but you might have to compress large volumes of information for input into BigInsights.
- Import the data from the various sources in your enterprise.
After you identify the sources and locations of the data that you want to analyze, such as tables in warehouses or web server logs, you import the data into the distributed file system.
For example, Sample Outdoor Company wants to compare the regional sales data from last quarter with browser histories and user preferences. Once that is combined, they can break down the integrated data by postal code to generate sales leads and to identify market trends to share with their vendors. The following table shows the different types of data in the various sources that Sample Outdoor Company wants to import and integrate.
|Data||Data source||Type of data|
|Customer service emails||Email data in flat files||Data at rest: unstructured flat files|
|Browsing histories||Web server logs||Data in motion: semistructured web server logs|
|Survey data or postal code data||Data warehouse tables||Relational tables in a warehouse|
|User preferences||Preferences data in flat files||Data at rest: semistructured flat files|