Integrate PureData System for Analytics and InfoSphere BigInsights for email analysis

Combine the advanced text analytic capabilities of BigInsights with your warehouse

PureData™ System for Analytics is IBM's core warehousing platform. This article explains how to integrate it with InfoSphere® BigInsights™, IBM's enterprise-ready Hadoop distribution. We'll discuss the integration concepts, as well as a primary use cases for this integration, combining the advanced text analytics capabilities of BigInsights with your warehouse. We'll illustrate how to create a small scenario that enriches employee data with information extracted from email. Then we will demonstrate how to extract this information in BigInsights and how to upload the extracted results to PureData for Analytics. We will also show how to access data stored in BigInsights from the warehouse to facilitate drill-through capabilities.

Benjamin Leonhardi, Software Engineer, IBM China

Photo of author Benjamin LeonhardiBenjamin Leonhardi is the team lead for the Big Data/Warehousing partner enablement team. Before that he was a software developer for InfoSphere Warehouse at the IBM Research And Development Lab Boeblingen, Germany. He was a developer in the data mining, text mining, and mining reporting solutions.



Piotr Pruski (piotr.pruski@ca.ibm.com), Technical Enablement Specialist, IBM China

Author photoPiotr Pruski is a partner enablement engineer within the Information Management Business Partner Ecosystem team at IBM. His main focus is to help accelerate sales and partner success by reaching out to and engaging business partners, enabling them to work with products within the IM portfolio -- namely, IBM InfoSphere BigInsights. In previous roles, he has worked with DB2, IBM Smart Analytics System 5710, and briefly with the IBM PureData System for Transactions.



Andre Albuquerque (andreda@ca.ibm.com), Technical Enablement Specialist, IBM China

Andre AlbuquerqueAndre Albuquerque is the Big Data technical lead for the IBM Information Management PureData Ecosystems team. He joined IBM Canada Labs in 2009 and works with various IBM products, such as DB2 and the IBM big data platform. As a technical lead, he is the content owner of the Big Data Fundamentals Bootcamp.



30 April 2013

Also available in Chinese Portuguese

Introduction

Enterprise integration is a unique value from IBM. Some of the use cases for big data technologies that customers have can be made possible only through integration. This requires integrating existing information in the enterprise with the new streams of information outside of the enterprise. The desired types of analytics cannot be realized unless structured as well as unstructured, semi-structured data is analyzed together. We will see here that the value of traditional and non-traditional data sources, and traditional and non-traditional technologies increases in value when they are brought together.

Before continuing, please note that we cover quite an extensive amount of ground throughout this article and have omitted some information that we believe is basic or fundamental. Therefore, we assume at least a rudimentary understanding of BigInsights, AQL, JAQL, PureData System for Analytics, and Cognos®. More information can be found in the Resources section on all of these technologies.

PureData System for Analytics (formerly known as Netezza)

The PureData System for Analytics, powered by Netezza technology, is a simple data appliance for serious analytics. It simplifies and optimizes performance of data services for analytic applications, enabling complex algorithms to run in minutes instead of hours. You may know the PureData System for Analytics as Netezza Appliances. The product has been renamed and we may use the two names interchangeably.

Highlights:

  • Speed— 10-100 times faster than traditional custom system
  • Simplicity— Fast, easy setup for a simplified user experience
  • Scalability— Peta-scale user data capacity
  • Smart— Complex algorithms in minutes; rich library of integrated analytics

InfoSphere BigInsights

InfoSphere BigInsights helps companies analyze and discover business insights hidden in large volumes of raw data, such as web logs, clickstreams, newsfeeds, social networking sites, sensor data, email, and instant messages. BigInsights complements your existing information management infrastructure, including data warehouses and analytic tools, extending your analytic capabilities to include a greater variety and volume of rapidly changing data.

Highlights:

  • Enables companies to rapidly explore large volumes of raw data to gain insight into operational challenges that were previously too large or complex to tackle.
  • Processes a variety of data, including unstructured and semi-structured formats often ignored by traditional technologies. Such data may be "messy" and contain varying structures with potentially complex interrelationships.
  • Enables enterprise-class analytics with massive scalability on low-cost commodity hardware.
  • Enhances and integrates with popular information management software, including data warehouses, relational DBMSes, data streaming platforms, business intelligence tools, and more.

Integration

The following chart describes the foundational view of the data warehouse and BigInsights for this article.

Table 1. Data Warehouse and BigInsights compared
Data WarehouseBigInsights
Data typesStructured dataAny type of data, structured or unstructured
Data loadingData is cleansed/structured before going in to the warehouse to maximize its utility.Raw data may be stored as-is, without any modification, preserving fidelity and lineage.
Analytic approach
  • High value, structured data
  • Repeated operations and processes (e.g., transactions, reports, BI, etc.)
  • Relatively stable sources
  • Well-understood requirements
  • Optimized for fast access and analysis
  • Highly variable data and content; some may be high value
  • Iterative, exploratory analysis (e.g., scientific research, behavioral modeling)
  • Volatile sources
  • Ad-hoc questions and changing requirements
  • Optimized for flexibility
HardwarePowerful appliance and optimized systemsInexpensive commodity hardware

An effective big data platform and approach requires integration with the rest of your IT infrastructure. The last thing you need in your architecture is another technology or data silo. Big data technologies should work in tandem with and extend the value of your existing data warehouse and analytics technologies.

The following will be our basis for the remainder of this article and why we want and need to achieve this integration.

PureData System for Analytics can process terabytes of structured data, but has limited unstructured capabilities. InfoSphere BigInsights has limitations when processing structured information, but it really shines for unstructured analytics. In combining the strengths of the two systems we can:

  • Analyze large amounts of unstructured data in BigInsights and write the structured results into the PureData System for Analytics or read them on demand with the Netezza UDFs.
  • Use standard reporting solutions to access the PureData System and enrich structured (OLAP, for example) reports with unstructured information.

Use case description

The following figure showcases the use case we will base our integration scenario on.

Figure 1. BigInsights PureData System for Analytics Integration use case
Image shows BigInsights PureData System for Analytics Integration use case

The demo shows the integration of the PureData System for Analytics and BigInsights and uses Cognos as a front end. Employee records are stored in PureData System for Analytics and email is stored and analyzed in BigInsights. An AQL annotator scores the email as work or private. AQL is the annotated query language, a programming language similar to SQL used for the extraction of structured information from text. The scoring results are then uploaded from BigInsights into PureData System for Analytics through a JAQL Netezza module. The results are aggregated and shown in a Cognos report (employees and their number of private and work email). The report also allows a drill-through to the underlying email texts, which are then retrieved from BigInsights. This uses the Netezza BigInsights connector functionality to read from HDFS.


Document analysis

Here we describe how to create a simple AQL annotator and use it to extract structured information from a set of documents. We will first describe how to create a simple AQL annotator that can distinguish work email from private email. The next step is to import the email into BigInsights. Then we use a JAQL script to annotate the documents and upload the results into PureData System for Analytics.

That encapsulates the BigInsights part of our integration, but we'll still need to create a drill-through capability from the Cognos report to the underlying email data. To do this, we need to be able to access the HDFS file system from the PureData System for Analytics data warehouse. This is achieved through the Netezza UDF extension shipped with InfoSphere BigInsights. We will describe the installation process and show one possibility to implement the drill-through function.

We cover a lot of ground, so the implementation will be greatly simplified. We plan to provide more details and alternatives to some of the described steps in future articles.

AQL annotator development

We will briefly describe how to develop a dictionary-based AQL annotator that can distinguish between work and private email. This annotator can be used to, for example, identify non-work related emails that do not need to be archived, thereby reducing backup costs. Since we are limited by the scope of the article, we will only use a small portion of the capabilities of AQL; the intention is to give you an idea of what is possible. Your development environment will look like the following.

Figure 2. InfoSphere BigInsights Development Studio
Image shows InfoSphere BigInsights Development Studio

Our annotator will have two dictionaries containing words indicative of private and work email. In a second step, we will use AQL to count all occurrences of both categories of words, and finally compute a single number, or a total score, for each email. Identifying an email as private, meaning more private than work words, will yield a negative number, while a work email with more work words than private words will yield a positive number. We will also demonstrate the creation of an AQL Java™ UDF since AQL does not have its own subtraction function. Nonetheless, it is trivial to implement these kinds of functionality in Java technology.

Of course each email set might have different requirements, so we used a training set of email messages that have been divided manually into work and private email; in this limited example, there are about 10 for each. From these, we identified the word lists and created the dictionaries. We assume that you have a fundamental knowledge of AQL and the development environment. You can find more details and examples in the Resources section.

AQL is a modular language, so we created a project with two modules: Common and Email. Common is used to define the UDFs we might want to reuse in other projects. We then use it in the main Email module to extract the email score.

Developing the subtraction UDF

To compute the total score, we need to implement a Java UDF that can subtract two integer numbers. This can be implemented in the same project where you implement your AQL since a BigInsights Eclipse project is derived from a Java Eclipse Project. Following is the code of our UDF:

package com.ibm.imte.taudfs;

public class Sub {
	public Integer sub(Integer i, Integer j) {
		return i - j;
	}
}

Next, we created an AQL module called Common that implements this function and exports it for use in other modules. Below is the source code of this module. We have also compiled the Java code above into a JAR and put it into the same module folder. This folder is automatically in the classpath of the AQL code.

module Common;

create function sub(i Integer ,  j Integer)
return Integer
external_name 'udfs.jar:com.ibm.imte.taudfs.Sub!sub'
language java 
deterministic
return null on null input;

export function sub;

Since we exported the function, we can now import and use this function in other AQL modules. It is best practice to put all your UDFs into a separate module.

In the second step, we will utilize two dictionaries containing words that indicate private or work email. Note that we are not aiming for 100-percent preciseness. There will be some situations where work words are used in private mail and vice-versa, but we compute a total number in the end for each document. AQL also provides many options to filter results and enhance the precision of the annotator. However, for this small example, we will utilize a very basic approach. Here are sample entries in the dictionaries.

Dictionary private.dict

mom
sister
love
party
sister
happy easter
christmas
x-mas
...

Dictionary work.dict

business unit
invoice
sap
sybase
database
it system
solution area
kick-off
workshop
reporting solution
...

Now we need to create the main module called Emails. In the dictionary folder of this module we have added the two dictionary files shown above. Now we can code the annotator in the main AQL file. First we need to add a module identifier and import the Common module we have created before.

module Emails;

import module Common;

Now we need to import the private dictionary and create an AQL view of the documents, extracting all occurrences of the dictionary in the document. For more information about AQL, see Resources.

create dictionary PrivateDict
from file 'dictionaries/private.dict'
with language as 'en';

create view PrivateWords as
extract dictionary 'PrivateDict' 
	on R.text as match
from Document R;
output view PrivateWords;

Similarly, we repeat this for the work dictionary.

create dictionary WorkDict
from file 'dictionaries/work.dict'
with language as 'en';

create view WorkWords as
extract dictionary 'WorkDict' 
	on R.text as match
from Document R;
output view WorkWords;

We are not interested in the actual instances but want to have a single number, so we can count the number of findings for each word found. AQL is very similar to SQL, so we can simply compute a count on the two views: WorkWords and PrivateWords.

create view NumPrivateWords as 
select Count(P.match) as num from PrivateWords P;
output view NumPrivateWords;

create view NumWorkWords as 
select Count(W.match) as num from WorkWords W;
output view NumWorkWords;

Now we have the number of findings of private and work words for each email. In the next step, we want to create a single score for the email, so we will subtract the number of private findings from the number of work findings. If there are predominantly work findings, the score will be positive. If there are predominantly private findings, the total score will be negative. To compute the difference, we will use our AQL UDF defined before. Finally, we output the view TotalScore, so we can use it externally in Java and JAQL programs. This is different from the export statement, which is used for module dependencies.

create view TotalScore as
select Common.sub(w.num, p.num) as totalscore
from NumWorkWords w, NumPrivateWords p;

output view TotalScore;

We have now created our simple annotator and we can give it a test in the development environment using the extraction plan view. As we can see in the results, we have a working annotator that identifies all 10 sample work email messages as work and all private email as private.

Figure 3. Result of AQL execution
Image shows result of AQL execution

Finally, we need to export the generated AQL annotator as TAM files. Using Export > BigInsights > Export Text Analytics Extractor, will create two TAM files for both modules. To utilize them with JAQL, we need to copy these files to BigInsights. For the purpose of this article, we will assume they are in the folder /home/biadmin/nz-demo/Annotator. In the following section, we will use JAQL to execute this annotator on a set of email messages stored in a BigInsights cluster.

Importing data into BigInsights

Now that we have created the AQL annotator, we need to import the email data into BigInsights. There are many ways to achieve this. Essentially, we need to make sure the data is stored in large files and is splittable so it can take advantage of the Hadoop architecture. We also need to keep the ability to associate an email ID with each email text. To take advantage of the parallel architecture in Hadoop, files need to be splittable. This means that the file loaders can split the file into multiple blocks. Files that have new lines are automatically splittable. Multiple files in a compressed folder are also splittable as long as a splittable storage form, such as lzo, is used.

In our example, we will go with comma-separated delimited files that have two columns: ID and TEXT. We have replaced all newline characters in the email messages with spaces since this does not influence our annotators. You may need to escape these characters differently if newlines are important for your analysis.

In many situations, you will have a set of files, each containing a text document like an email. In our example, we used a simple Perl script to transform these into a delimited file. Alternatively, JAQL can handle sets of files as well, but this would be beyond the scope of this article. Our data file looks like the following: a delimited file with one line per email and two columns: LABEL and TEXT. LABEL is a unique email ID used to associate the email with an employee and a text field containing the email content.

1|Message-ID: <29394058.1075855815542.JavaMail@test> Date: Mon, 5 Jun...     
2|Message-ID: <9986382.1075855895922.JavaMail@test> Date: Wed, 16 Aug...
...

The following JAQL command loads the file into HDFS:

read(del("file:///home/biadmin/nz-demo/combinedEmails.txt", 
	{schema: schema{label, text}, delimiter: "|", quoted: false})) 
-> write(del("hdfs:/tmp/emails.del", 
	{schema: schema{label, text}, delimiter: "|", quoted: false}));

We have decided to keep the file in HDFS as an uncompressed delimited file. For better performance and a smaller storage footprint, we can compress it during load. For best performance, we could also store the file as a sequence file. Sequence files are native JSON files and increase file loading speeds since they essentially store files in the internal data representation of JAQL. However, sequence files are not human-readable, so we didn't make use of it for this demo. Notice the syntax for specifying the file schema and the file format. For a complete documentation of these features refer to the information center (see Resources).

Analyzing documents with JAQL

Now we need to annotate the results. To do this, we need to import the systemT module. This JAQL module contains the functions for annotating text documents: import systemT;.

Then we need to create a variable that specifies how to read the file in HDFS. Since we do not materialize the command with :=, it will only be executed in the actual annotation step:

emails = read(del("hdfs:/tmp/emails.del", 
{schema: schema{label, text}, delimiter: "|", quoted: false}));

Now we do the annotation:

textemails = emails -> transform { label: $.label,  score: 
	systemT::annotateDocument( $, ["Common", "Emails"], 
			["file:///home/biadmin/nz-demo/Annotator"], 
			tokenizer="multilingual", 
			outputViews=["Emails.TotalScore"])};

The transform operator in JAQL allows us to compute new fields in the JSON stream. We specify two fields: One is the label (our email ID), and the other field is computed by our AQL annotator. To do this, we use the systemT::annotateDocument function. This computes the email score we described in the AQL section and adds it to the label. We use the function with the following functions:

  • $: The first parameter is the document we want to annotate. In JAQL, $ denotes the current document. annotateDocument expects per default two fields, the label and the text.
  • ["Common", "Emails"]: The second parameter specifies the TAM files we want to utilize. You specify them as an array of strings, where the strings are the names of the TAM files.
  • ["file:///home/biadmin/nz-demo/Annotator"]: The directory or directories containing the TAM files. This can also be a compressed folder.
  • tokenizer="multilingual": This parameter specifies that we want to use the LanguageWare tokenizer. This is a powerful tokenizer with syntax analysis and multi-language support. For simple tasks, the whitespace tokenizer may be sufficient, but when in doubt use LanguageWare.
  • outputViews=["Emails.TotalScore"]: The last parameter specifies which output views in the annotators we want to add to the JSON records.

Now we have extracted the email score and added it to the JSON records. The results will look as follows:

[{ "label": "1",
  "score": {
  	  "Emails.TotalScore": [{"totalscore": -3}],
  	  "label": "8",
  	  "text": "Message-ID: <384643.10758531370..."
  }
},
...
{ "label": "9",
  "score": {
  	  "Emails.TotalScore": [{"totalscore": -4}],
  	  "label": "9",
  	  "text": "Message-ID: <8489243.10758620563..."
  }
}]

As you can see, the results are added as sub-arrays to the stream. This is necessary because an AQL output view normally has more than one result. There could be multiple persons, addresses, etc. in the text file. In our example, we aggregate a total number so we will always only have a single end result. We also want to be able to load the results in a Netezza database table. To achieve this, we need to flatten the array (i.e., we want to have a list of email IDs and their respective scores in a flat table). This can be achieved in JAQL with the expand unroll command. It will move the sub-array up one level and create one parent result row for every row of the sub-array.

scoredemails =  textemails -> transform {label:$.label, 
				score: $.score."Emails.TotalScore".totalscore, } 
				-> expand unroll $.score 
				-> transform {label: $.label, score: $.score} ;

We have done two things with this piece of code. First, we unrolled the sub-arrays and merged them in the parent array. We then used transform to change the naming of the fields, so they correspond to the naming in the Netezza table. Finally, we need to write the results into HDFS, so we can later take them and move them to the Netezza database. To do this, we use the following command.

scoredemails -> write(del("hdfs:/tmp/emails/scoredemails.del", 
		{schema: schema{label, score}, 
		delimiter: "|", quoted: false}));

We have now saved a flat delimited file with two columns email ID and computed score and saved it in HDFS. The file looks like the following:

1|-3
9|-4
10|10
11|19
12|16
13|9
14|16
...

We can now run the complete JAQL script with jaqlshell -b analyzeemails.jaql. Alternatively, we can package it as an application and schedule it or invoke it over a REST API.

Upload of results to PureData System for Analytics

Here, we describe how to upload the results from the previous section to a Netezza table. We will utilize the JAQL Netezza connector for this. Alternatively, you can also pull results through the Netezza UDFs, which we will describe in the next section. But in our demonstration, the analyzation is a push. We have separated the analyzation and upload part for operational reasons. You of course could do it in a single step as well.

We need to import the Netezza Connector into our JAQL script: import dbms::netezza;.

You can of course use the standard DBMS module, but this will not result in the best performance. The Netezza module utilizes external tables for loads and unloads and utilizes parallel loads for optimal performance. You also need to copy the Netezza JDBC driver JAR to the JAQL folder $JAQL_HOME/modules/dbms/netezza or add the location to the classpath.

Next, we need to set up a database connection. This requires the netezza::connect function with JDBC connection string and user account properties. We utilize the database BIGIDEMO for our sample database.

db := netezza::connect( 
		url = 'jdbc:netezza://netezza_host:5480/BIGIDEMO', 
		properties = { user: "admin", password: "password"});

Now we need to read the file we created earlier. Note that we changed the schema names. We need to load them with the same column names used in the Netezza table, including case-sensitivity.

tblData = read(del("hdfs:/tmp/emails/scoredemails.del", 
			   {schema: schema{EMAIL_ID: long, WORK_PRIVATE: long}, 
			   	delimiter: "|", quoted: false}));

Now we need to prepare the insert statement. This works similar to JDBC. We specify the database connection and the table name for the insert: dataIn = netezza::prepareInsert( db, 'EMAIL_SCORE' );.

Due to the unique architecture os PureData System for Analytics, there is no need to specify commit counts or any other parameters. JAQL will automatically load data in four parallel loads, splitting the data on the data slice ID. You can specify splits, which define the number of parts the data is divided in. This is useful if data loaded is used for further computations. These would be single threaded if no splits are defined. For our upload we do not need them.

Finally, we can execute the upload by streaming the email file into the insert statement: tblData -> write( dataIn );.

We have finished the upload of our data into PureData System for Analytics. We need to implement a drill-down function in the next sections. To do this, we need to truncate the text field, so PureData System for Analytics can load the text fields. PureData System for Analytics has an upper limit of 64,000 bytes for normal character fields and 16,000 bytes for unicode NCHAR fields. To truncate the text field, we use the substring function.

read(del("hdfs:/tmp/emails.del", 
	{schema: schema{label, text}, delimiter: "|", quoted: false}))
	->transform {ID: toNumber($.id), TEXT: substring($.text, 0, 15000)} 
	-> write(del("hdfs:/tmp/email_text.del", 
		{schema: schema{ID, TEXT}, 
		delimiter: "|",quoted:true}));

We have uploaded the email score into a Netezza table and prepared the email file for drill-through. We can now run the complete JAQL script with jaqlshell -b uploadscoring.jaql.

Netezza UDFs installation

Here, we will describe how to install the Netezza UDFs. These user-defined functions allow you to access a BigInsights system from PureData System for Analytics. Three functions are provided:

  • HDFS_READ: This function allows you to read files from the remote HDFS file system.
  • HDFS_WRITE: This function allows you to write files to HDFS.
  • JAQL_SUBMIT: This functions executes a JAQL string on BigInsights.

You can find the installer for the Netezza UDFs on the welcome page of the BigInsights web console. Download them and copy them over to the PureData System for Analytics host. Then install the Netezza extensions with the install.sh script in the scripts folder of the installation package.

This unpacks the UDFs and puts them in the Netezza extension folder. We still need to deploy them to a database. In our example, we have created a database called BIGIDEMO, which contains our sample tables. To deploy the UDFs into that database and specify connection parameters for the connection, you can utilize the following command. It will deploy the UDFs and store the connection information for the BigInsights server. You could connect to two different BigInsights servers by installing them in separate databases.

./nzbiuser --user admin --pw password --db bigidemo 
			--bilocation http://bi_host:8080 
			--biuser biadmin --bipw password

The Netezza BigInsights UDFs are now deployed and ready to use. Note the JAQL_SUBMIT function uses the JAQL ad-hoc application in BigInsights, so you need to have this application deployed to use this UDF.

Development of drill-down function

Here, we describe how to develop the drill-down function that will allow you to select a specific employee in the Cognos report and which will load the corresponding email messages from BigInsights. With that, we will have shown the integration of BigInsights and PureData System for Analytics in both directions. We will have uploaded information from BigInsights to PureData System for Analytics in the previous sections and will now connect from PureData System for Analytics to BigInsights. The function we use is included below. It utilizes the HDFS_READ function to read the email text file from HDFS specifying load parameters similar to a normal external table load definition. Note that we need to specify the underlying part files since JAQL splits up written files into multiple part files for each reduce task. This is transparent to JAQL, but not transparent to the HDFS_READ function, a Java table function, which means it needs to be executed in the FROM clause of the query and you need to prefix it with the TABLE WITH FINAL keyword.

SELECT 
EMPLOYEE.E_ID, 
EMPLOYEE_EMAIL.EMAIL_ID,
T.TEXT
from 
	BIGIDEMO..EMPLOYEE EMPLOYEE,
	BIGIDEMO..EMPLOYEE_EMAIL,
	table with final (hdfs_read('/tmp/email_text.del/part*', 
		'DELIMITER "|"', 'ID INT, TEXT NVARCHAR(15000)')) AS T
WHERE 
	EMPLOYEE.E_ID = EMPLOYEE_EMAIL.E_ID  AND EMPLOYEE_EMAIL.EMAIL_ID = T.ID

The result of this query is the following:

E_ID | EMAIL_ID | TEXT
5    |        3 | "Message-ID: <13093069.1075849668865 ... "
6    |        8 | "Message-ID: <384643.1075853137096.Ja... "
...

This will read all the email text into PureData System for Analytics and the statement can be utilized like any other table. We also join in the employee and email IDs so we can easily filter on the employee ID. For performance reasons, you would normally prefix this with a JAQL_SUBMIT call that pushes the filtering into BigInsights. We have omitted this for this article because it would force us to create a stored procedure that first does the JAQL_SUBMIT and then reads the filtered document. We would also need to either add the employee ID to the email documents or push the requested email IDs through into the JAQL string. This would go beyond the scope of the article, so take the following code as an inspiration:

CALL JAQL_SUBMIT('
	read(del("hdfs:/tmp/email_text.del", {schema: schema{id, text}}))
		-> filter $.id == "5" 
		-> write(del("hdfs:/tmp/single_employee_emails.del", 
				{schema: schema{id, text}}));
	');

We have now finished all preparations and can start creating our Cognos reports.


Report creation

We will very quickly describe the demonstration report we created. It will also contain the database tables of our sample application. We will first show the data model layer and then the created reports. It is a straightforward Cognos report, so we go into minimal detail. You can find more information on how to create Cognos reports in the Resource section.

Data model

Our data model consists of six query subjects. Four physical tables, one derivative for easier report creation, and one dynamic subject based on a query:

  • NATION: Every employee record has a nation id, this is a lookup table containing the nation information.
  • EMPLOYEE: A table with employee records. Containing employee ID, nation ID, first name, and last name.
  • EMPLOYEE_EMAIL: A table containing employee IDs to the email IDs. So we can match email to a specific employee.
  • EMAIL_SCORE: This is the table used to store the email score in our JAQL script. It has two columns: email ID and work_private score.
  • EMAIL_SCORE_FILTERED: This query subject is created on the EMAIL_SCORE to make the report creation easier.
  • EMAIL_TEXT: This query subject allows us to drill down to the email text for a given employee using the function described above.

We imported the first four query subjects in Cognos based on the underlying Netezza tables of the same name. We also need to create relationships between all query subjects, so Cognos can join them correctly on nation ID, email ID, or employee ID when columns from multiple query subjects are used in the same report query.

The query subject EMAIL_SCORE_FILTERED simplifies the work score for each email ID into two query items (WORK_MAILS and PRIVATE_MAILS), each of which can be 0 or 1. It utilizes CASE statements to check if the WORK_PRIVATE score is positive, which means the work column is 1 and the private column is 0; or negative, which means the work query item is 0 and the private query item 1. This allows us to easily count the number of private and work email messages each employee has been sending.

The query subject EMAIL_TEXT is utilizing the drill-through function we developed in the previous section. It is created on a native SQL query. Some minor syntax changes have been made to accommodate Cognos. When this query subject is referenced in a Cognos report the underlying query will be executed, fetching the email text into PureData System for Analytics and then to Cognos. The full data model looks like the following figure.

Figure 4. Data model
Image shows data model page

Putting it all together: The Cognos report

Our Cognos demo consists of two report pages connected via drill-through functionality. The main report shows columns from the static tables NATION, EMPLOYEE, and EMAIL_SCORE. It is a table of employees with personal information (employee ID, first name, last name, nation) and information from the text annotation (i.e., how many private and work email messages sent). We also implemented a drill-through functionality on the employee ID field. If a specific employee ID is selected the second report is called with the employee ID as input parameter to fetch the email text for this employee.

Figure 5. Main report page
Image shows main report page

The second report page contains a simple table of all the email messages of the specific employee. It displays the email ID and the truncated email text fetched from BigInsights. We also display the work_private score.

Figure 6. Drill-through report page
Image shows drill-through report page

Conclusion

We have demonstrated one way to integrate the PureData System for Analytics and the BigInsights Hadoop environment. We tried to demonstrate how each can be used for its strengths. PureData System for Analytics is used as a warehouse and base platform for running Cognos BI reports, and BigInsights is used to store and analyze text files. In BigInsights, we used the powerful text-extraction language AQL to extract structured information from the unstructured text files. We used JAQL to run the text annotation and upload the results into PureData System for Analytics. Finally, we used the Netezza UDFs to implement a drill-down capability from PureData System for Analytics to BigInsights showing both integration directions. We brought all of this together on the example of email analysis, storing and analyzing email data in BigInsights and the structured employee data in PureData System for Analytics.

We covered a lot of ground, so we have not been able to investigate each part in full detail. In a real production solution, the AQL would be much more complex. We would also store the data using JAQL in a compressed and optimized fashion, and the drill-through would be implemented in a way that is parallelizable. Nevertheless, we hope this article has given you a good starting point.


Download

DescriptionNameSize
Scripts and demo files used in this articleNZ-BigInsights-Article.zip96KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management, Cloud computing
ArticleID=900647
ArticleTitle=Integrate PureData System for Analytics and InfoSphere BigInsights for email analysis
publish-date=04302013