Cognos Business Intelligence 10.2 reporting on InfoSphere BigInsights

Learn how to consume IBM® InfoSphere BigInsights data through IBM® Cognos® Business Intelligence 10.2.

Jason Tavoularis, Product Manager, IBM China

Jason Tavoularis - author photoJason has spent the past several years engaging with IBM Cognos customers in roles within customer support, demonstrations and enablement, and product management. Jason is currently a Product Manager with a focus on IBM Cognos Business Intelligence data access.



January 2013 (First published 25 September 2012)

Also available in Chinese

Overview

Big Data marks a new era in business analytics. Organizations now have the opportunity to make informed decisions on matters where the associated data was traditionally unusable due to its volume, velocity, and variety.

IBM is uniquely positioned to offer the full breadth of technology and services needed to draw actionable insights from Big Data. Through its interoperability with IBM InfoSphere BigInsights, IBM Cognos Business Intelligence now has easy access to unstructured source data providing business analysts with exposure to the key conclusions found in large volumes of text.

This article provides guidance on consuming InfoSphere BigInsights data through Cognos Business Intelligence.

Applicability, exclusions, and exceptions

All references to IBM Cognos Business Intelligence imply Version 10.2.0. All references to IBM InfoSphere BigInsights imply Version 1.4. The techniques and product behaviors outlined in this document may not be applicable to future releases.

Cognos Business Intelligence 10.2 integration with InfoSphere BigInsights 1.4

Hive is a data warehouse system that resides on top of Hadoop. Hive offers extract, transform, and load (ETL) tools that can add structure to even text-based source data and enables the analysis of high volumes of data stored in the Hadoop Distributed File System (HDFS).

Cognos Business Intelligence generates Hive QL to query the BigInsights file system. As shown in Figure 1, the Cognos Business Intelligence server connects to BigInsights through the BigInsights Hive JDBC driver. The metadata from the Hive Catalog can be imported into Cognos Framework Manager enabling a business intelligence modeler to publish packages of BigInsights content to the Cognos Business Intelligence server. These packages can be used to create compelling Cognos reports (including Active Reports), dashboards, and workspaces, all while leveraging BigInsights' Map/Reduce capabilities.

Figure 1. High-level architecture
High-level architecture

Establish the data source connection

Make the Hive JDBC files available to the IBM Cognos Business Intelligence server

The required JDBC files can be downloaded through a URL exposed on a BigInsights server. Note that BigInsights requires the version of the Hive JDBC driver to be same one that is shipped with the server it will be connecting to.

Procedure

  1. From the BigInsights web interface, locate the option for Enable your Eclipse development environment for BigInsights application development in the Quick Links area as shown in Figure 2. This will launch a pop-up with a link to download a large ZIP file called BigInsightsEclipseTools.zip.
    Figure 2. Quick Links option for BigInsights Eclipse Tools download
    Quick Links option for BigInsights Eclipse Tools download
  2. Extract the BigInsightsEclipseTools.zip file and then extract the com.ibm.biginsights.hive.lib_1.1.0.v20120518_1947 and com.ibm.hadoop_1.0.0.v20120605_0341 JAR files that can be found within the \plugins directory shown in Figure 3.
    Figure 3. BigInsights Eclipse tools
    BigInsights Eclipse tools
  3. Copy the following Hive and Hadoop JAR files into the <c10_location>\webapps\p2pd\WEB-INF\lib folder of each Cognos install that will be connecting to the BigInsights server:
    • hive-exec-0.8.0.jar
    • hive-metastore-0.8.0.jar
    • libthrift.jar
    • hive-jdbc-0.8.0.jar
    • libfb303.jar
    • hive-service-0.8.0.jar
    • hadoop-core-1.0.0.jar
    Figure 4. BigInsights Hive JDBC JAR files
    BigInsights Hive JDBC jar files

    Author's note: All Hive JAR files that must be copied to the Cognos BI server for interoperability with IBM InfoSphere BigInsights Version 2.0 are included in the Download Hive JDBC package option on the BigInsights 2.0 web interface.

  4. Stop and restart the Cognos service.

Define the data source connection in Cognos Administration

After making the Hive JDBC files available to the Cognos Business Intelligence server, a new data source must be created in IBM Cognos Administration.

Procedure

  1. When creating a new data source in the Data Source Administration area of IBM Cognos Administration, select JDBC in the connection Type field and then click Next.
  2. Select the IBM InfoSphere BigInsights (Hive) entry in the Type field shown in Figure 5.
    Figure 5. InfoSphere BigInsights entry on the JDBC connection string screen
    IBM InfoSphere BigInsights entry on the JDBC connection string screen
  3. Enter the BigInsights server URI into the JDBC URL field in the format of //<host>:<port>/<dbname>. For example: //localhost:10000/default
  4. Hive Version 0.8.0, which comes with InfoSphere BigInsights V1.4, does not support authentication, therefore a dummy username and password should be entered when defining the connection in Cognos Administration, as shown in Figure 6.
    Figure 6. Dummy user ID and password
    Dummy user ID and password
  5. At the bottom of the screen, click Next to advance to where connection command blocks can be specified, as shown in Figure 7. Connection command blocks are the mechanism for which the Cognos Business Intelligence server may send additional context to BigInsights. (See Listing 1 for the XML command block.)
    Figure 7. Example of open session command block
    Example of open session command block
    Listing 1. XML database session command block
    <commandBlock>
      <commands>
        <sqlCommand>
         <sql>set hive.auto.convert.join = true</sql>
        </sqlCommand>   
      </commands>
    </commandBlock>

Metadata modeling

The standard guidelines documented in the Cognos Framework Manager user guide are applicable for modeling the metadata from the BigInsights' Hive Catalog, but it should be noted that the Hive in BigInsights V1.4 does not support certain SQL concepts like primary-foreign key definitions and non-equality joins. After data source query subjects are created following a metadata import of the Hive Catalog, relationships and determinants must be defined.

Note that Hive provides a MapJoin optimization whereby when joining a small table with a large table, the data of the small table is held in the Mapper and the processing associated with the join is performed during the Map stage exclusively. MapJoins can greatly improve query performance by avoiding the sorting and merging of files in the Reduce stage. Hive converts a join operation into a MapJoin automatically based on number of rows and size of each table. This is controlled by the flag hive.auto.convert.join= true, which can be set as an open session command block when defining the data source connection in Cognos Administration (Figure 7). You can find other settings that control the thresholds for initiating a MapJoin documented online.


Timestamps

Due to a known issue with Hive Version 0.8.0, you should avoid using Hive timestamp columns with Cognos Business Intelligence. Such columns should be declared as String before being imported into Cognos Framework Manager. From Cognos Framework Manager or in one of the authoring interfaces of Cognos Business Intelligence, you can cast the String columns back to timestamps to allow any date/time operations to be performed locally on the Cognos Business Intelligence server. For example, you can enter an expression like Select cast(c1 as timestamp) From T1 inside of a data source query subject in Cognos Framework Manager.


Performance considerations for Big Data reporting

In general, all the Cognos Business Intelligence best practices for authored reporting and interactive analysis should be considered when making BigInsights data available for consumption. This section emphasizes a few points that are important considering the nature of Big Data.

Prompt-based filters can be used to ensure that Cognos Business Intelligence server retrieves only the required records. BigInsights may be storing extremely high volumes of data; therefore filtering is the most important mechanism for ensuring acceptable report execution times. Note that the population of prompt values and other data containers can benefit from previously captured result sets that are stored in the Cognos query service's in-memory cache.

The architecture of Hive is better suited for large query processing than handling many light queries because the set up time for a query is significantly higher compared to the most popular RDBMS. Therefore, consider scheduling pre-authored reports when ah-hoc analysis is not absolutely necessary. Scheduling reports in batches will benefit from Cognos Business Intelligence query service's cache management system, minimizing the amount of data that is fetched from BigInsights and significantly reducing report execution times.

If interactivity is required on large volumes of data, consider whether the analysis requirements can be met by scheduling Cognos Active Reports. If ad-hoc analysis is required, educate business users on how to perform multiple operations with a single drag-and-drop gesture. (Modelers should consider adding stand-alone filters and calculations to the package so that multi-click operations to define the filters/calculations can be avoided.)


Partitioning

Hive provides the capability to organize tables into partitions based on the value of a partitioned column. Appropriate use of partitions can greatly improve performance, especially when consumers are primarily interested only in a particular subset of data.

For instance, consider a sample table table1 that has transactional data for year 2001 to 2012. A query such as select product, sum(sales amount) from table1 where year = 2001 in the absence of partitioned field will scan all of the data from table1. If this table is partitioned on a year column, the data for the year 2001 would be kept together and the same query will only process the partition corresponding to year 2001, thus minimizing the overhead of the query. Prompt values and other such results can be returned significantly faster with a partitioned column compared to a non-partitioned column.

Note that it is possible to create sub-partitions within a partition. Effective partitioning presumes an understanding of how the business user questions will be posed and the reflected query load. Once this understanding has been achieved, you will know how many levels of partitioning you need, for example, by country and within country by city.

Limitations of using partitions:

  • A filter value used in the query must exactly match the data in the partition field.
  • If the partitioned field is not used to filter the data, a query will be executed against all partitions

Hive tables and partitions can further be subdivided into buckets, which can be sorted by one or more columns. The advantages of buckets is that MapJoins will run substantially faster if the tables are joined on bucket fields, and the performance of group by operations will improve as well.


User-defined functions

Sometimes the desired query cannot be easily defined using built-in functions. By writing a user-defined function (UDF), you can easily plug in your own processing code and invoke it from a Hive query. UDFs have to be written in Java™, the language that Hive is written in. Note that UDFs are also a means for which you can execute custom Map/Reduce jobs from a Hive query.

Procedure

  1. Create the UDF Java sources and compile and package within a JAR file.
  2. Make the JAR file known to Hive runtime by either:
    • adding the JAR file to the Hive aux path within hive-site.xml. This will make it visible to all Hive sessions.
    • or executing a command such as, add jar/home/hdpadmin/udf/mystrip.jar. In this case, the JAR file would be visible to only the current session so you will need to enter this as an open session command when defining the BigInsights data source connection in the Cognos Administration.
  3. If, and only if, the Java sources were not registered in the hive-site.xml, then you must define the function as an open session command such as, create temporary function mystrip as ‘com.ibm.cognos.bigdata.udf.MyStrip'.
  4. Call upon the function in the Expression Editor interface of Cognos Business Intelligence.

Conclusion

Through BigInsights, there are many new sources of information that can now be analyzed in Cognos Business Intelligence; sources that were previously discarded due to technological limitations. This allows you to take advantage of Cognos' strengths in areas like dashboarding, distribution, and visualization, and BigInsights' massively parallel-processing power. The interoperability described in this article is merely the first milestone in a very promising future of integration between IBM Cognos Business Intelligence and IBM InfoSphere BigInsights.


Acknowledgements

Special thanks to Amol Sonawane, Bharath Ganesh, Santosh Ghule and Nigel Campbell for their contributions and reviews of this article.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=835305
ArticleTitle=Cognos Business Intelligence 10.2 reporting on InfoSphere BigInsights
publish-date=01252013