Big Data marks a new era in business analytics. Organizations now have the opportunity to make informed decisions on matters where the associated data was traditionally unusable due to its volume, velocity, and variety.
IBM is uniquely positioned to offer the full breadth of technology and services needed to draw actionable insights from Big Data. Through its interoperability with IBM InfoSphere BigInsights, IBM Cognos Business Intelligence now has easy access to unstructured source data providing business analysts with exposure to the key conclusions found in large volumes of text.
This article provides guidance on consuming InfoSphere BigInsights data through Cognos Business Intelligence.
Applicability, exclusions, and exceptions
All references to IBM Cognos Business Intelligence imply Version 10.2.0. All references to IBM InfoSphere BigInsights imply Version 1.4. The techniques and product behaviors outlined in this document may not be applicable to future releases.
Cognos Business Intelligence 10.2 integration with InfoSphere BigInsights 1.4
Hive is a data warehouse system that resides on top of Hadoop. Hive offers extract, transform, and load (ETL) tools that can add structure to even text-based source data and enables the analysis of high volumes of data stored in the Hadoop Distributed File System (HDFS).
Cognos Business Intelligence generates Hive QL to query the BigInsights file system. As shown in Figure 1, the Cognos Business Intelligence server connects to BigInsights through the BigInsights Hive JDBC driver. The metadata from the Hive Catalog can be imported into Cognos Framework Manager enabling a business intelligence modeler to publish packages of BigInsights content to the Cognos Business Intelligence server. These packages can be used to create compelling Cognos reports (including Active Reports), dashboards, and workspaces, all while leveraging BigInsights' Map/Reduce capabilities.
Figure 1. High-level architecture
Establish the data source connection
Make the Hive JDBC files available to the IBM Cognos Business Intelligence server
The required JDBC files can be downloaded through a URL exposed on a BigInsights server. Note that BigInsights requires the version of the Hive JDBC driver to be same one that is shipped with the server it will be connecting to.
- From the BigInsights web interface, locate the option for
Enable your Eclipse development environment for
BigInsights application development in the Quick Links
area as shown in Figure 2. This will launch a
pop-up with a link to download a large ZIP file called
Figure 2. Quick Links option for BigInsights Eclipse Tools download
- Extract the BigInsightsEclipseTools.zip file and then extract the
com.ibm.hadoop_1.0.0.v20120605_0341 JAR files that can be found within
\pluginsdirectory shown in Figure 3.
Figure 3. BigInsights Eclipse tools
- Copy the following Hive and Hadoop JAR files into the
<c10_location>\webapps\p2pd\WEB-INF\libfolder of each Cognos install that will be connecting to the BigInsights server:
Figure 4. BigInsights Hive JDBC JAR files
Author's note: All Hive JAR files that must be copied to the Cognos BI server for interoperability with IBM InfoSphere BigInsights Version 2.0 are included in the Download Hive JDBC package option on the BigInsights 2.0 web interface.
- Stop and restart the Cognos service.
Define the data source connection in Cognos Administration
After making the Hive JDBC files available to the Cognos Business Intelligence server, a new data source must be created in IBM Cognos Administration.
- When creating a new data source in the Data Source Administration area of IBM Cognos Administration, select JDBC in the connection Type field and then click Next.
- Select the IBM InfoSphere BigInsights (Hive) entry in
the Type field shown in Figure 5.
Figure 5. InfoSphere BigInsights entry on the JDBC connection string screen
- Enter the BigInsights server URI into the JDBC URL field in the format
//<host>:<port>/<dbname>. For example:
- Hive Version 0.8.0, which comes with InfoSphere BigInsights V1.4, does
not support authentication, therefore a dummy username and password
should be entered when defining the connection in Cognos
Administration, as shown in Figure 6.
Figure 6. Dummy user ID and password
- At the bottom of the screen, click
Next to advance to where connection command
blocks can be specified, as shown in Figure 7.
Connection command blocks are the mechanism for which the Cognos
Business Intelligence server may send additional context to
BigInsights. (See Listing 1 for the XML command
Figure 7. Example of open session command block
Listing 1. XML database session command block
<commandBlock> <commands> <sqlCommand> <sql>set hive.auto.convert.join = true</sql> </sqlCommand> </commands> </commandBlock>
The standard guidelines documented in the Cognos Framework Manager user guide are applicable for modeling the metadata from the BigInsights' Hive Catalog, but it should be noted that the Hive in BigInsights V1.4 does not support certain SQL concepts like primary-foreign key definitions and non-equality joins. After data source query subjects are created following a metadata import of the Hive Catalog, relationships and determinants must be defined.
Note that Hive provides a MapJoin optimization whereby when joining a small
table with a large table, the data of the small table is held in the
Mapper and the processing associated with the join is performed during the
Map stage exclusively. MapJoins can greatly improve query performance by
avoiding the sorting and merging of files in the Reduce stage. Hive
converts a join operation into a MapJoin automatically based on number of
rows and size of each table. This is controlled by the flag
hive.auto.convert.join= true, which can be set
as an open session command block when defining the data source connection
in Cognos Administration (Figure 7). You can find other
settings that control the thresholds for initiating a MapJoin documented online.
Due to a known issue with Hive Version 0.8.0, you should avoid using Hive
timestamp columns with Cognos Business Intelligence. Such columns should
be declared as String before being imported into Cognos Framework Manager.
From Cognos Framework Manager or in one of the authoring interfaces of
Cognos Business Intelligence, you can cast the String columns back to
timestamps to allow any date/time operations to be performed locally on
the Cognos Business Intelligence server. For example, you can enter an
cast(c1 as timestamp) From T1 inside of
a data source query subject in Cognos Framework Manager.
Performance considerations for Big Data reporting
In general, all the Cognos Business Intelligence best practices for authored reporting and interactive analysis should be considered when making BigInsights data available for consumption. This section emphasizes a few points that are important considering the nature of Big Data.
Prompt-based filters can be used to ensure that Cognos Business Intelligence server retrieves only the required records. BigInsights may be storing extremely high volumes of data; therefore filtering is the most important mechanism for ensuring acceptable report execution times. Note that the population of prompt values and other data containers can benefit from previously captured result sets that are stored in the Cognos query service's in-memory cache.
The architecture of Hive is better suited for large query processing than handling many light queries because the set up time for a query is significantly higher compared to the most popular RDBMS. Therefore, consider scheduling pre-authored reports when ah-hoc analysis is not absolutely necessary. Scheduling reports in batches will benefit from Cognos Business Intelligence query service's cache management system, minimizing the amount of data that is fetched from BigInsights and significantly reducing report execution times.
If interactivity is required on large volumes of data, consider whether the analysis requirements can be met by scheduling Cognos Active Reports. If ad-hoc analysis is required, educate business users on how to perform multiple operations with a single drag-and-drop gesture. (Modelers should consider adding stand-alone filters and calculations to the package so that multi-click operations to define the filters/calculations can be avoided.)
Hive provides the capability to organize tables into partitions based on the value of a partitioned column. Appropriate use of partitions can greatly improve performance, especially when consumers are primarily interested only in a particular subset of data.
For instance, consider a sample table
that has transactional data for year 2001 to 2012. A query such as
select product, sum(sales amount)
from table1 where year = 2001 in the absence
of partitioned field will scan all of the data from table1. If this table
is partitioned on a year column, the data for the year 2001 would be kept
together and the same query will only process the partition corresponding
to year 2001, thus minimizing the overhead of the query. Prompt values and
other such results can be returned significantly faster with a partitioned
column compared to a non-partitioned column.
Note that it is possible to create sub-partitions within a partition. Effective partitioning presumes an understanding of how the business user questions will be posed and the reflected query load. Once this understanding has been achieved, you will know how many levels of partitioning you need, for example, by country and within country by city.
Limitations of using partitions:
- A filter value used in the query must exactly match the data in the partition field.
- If the partitioned field is not used to filter the data, a query will be executed against all partitions
Hive tables and partitions can further be subdivided into buckets, which can be sorted by one or more columns. The advantages of buckets is that MapJoins will run substantially faster if the tables are joined on bucket fields, and the performance of group by operations will improve as well.
Sometimes the desired query cannot be easily defined using built-in functions. By writing a user-defined function (UDF), you can easily plug in your own processing code and invoke it from a Hive query. UDFs have to be written in Java™, the language that Hive is written in. Note that UDFs are also a means for which you can execute custom Map/Reduce jobs from a Hive query.
- Create the UDF Java sources and compile and package within a JAR file.
- Make the JAR file known to Hive runtime by either:
- adding the JAR file to the Hive aux path within hive-site.xml. This will make it visible to all Hive sessions.
- or executing a command such as,
add jar/home/hdpadmin/udf/mystrip.jar. In this case, the JAR file would be visible to only the current session so you will need to enter this as an open session command when defining the BigInsights data source connection in the Cognos Administration.
- If, and only if, the Java sources were not registered in the
hive-site.xml, then you must define the function as an open session
command such as,
create temporary function mystrip as ‘com.ibm.cognos.bigdata.udf.MyStrip'.
- Call upon the function in the Expression Editor interface of Cognos Business Intelligence.
Through BigInsights, there are many new sources of information that can now be analyzed in Cognos Business Intelligence; sources that were previously discarded due to technological limitations. This allows you to take advantage of Cognos' strengths in areas like dashboarding, distribution, and visualization, and BigInsights' massively parallel-processing power. The interoperability described in this article is merely the first milestone in a very promising future of integration between IBM Cognos Business Intelligence and IBM InfoSphere BigInsights.
Special thanks to Amol Sonawane, Bharath Ganesh, Santosh Ghule and Nigel Campbell for their contributions and reviews of this article.
- The IBM Cognos 10 Dynamic Query Cookbook (developerWorks Feb 2012): Gain a better understanding of techniques and product behaviours when dealing with the Dynamic Query Mode delivered with IBM Cognos 10.
- IBM Cognos Business Intelligence V10.1 Handbook: Check out this IBM Redbooks publication to learn more about core features.
- IBM InfoSphere BigInsights Information Center: Find more IBM resources for BigInsights.
- Apache Hive Wiki: Visit the wiki to learn more about this data warehouse software facilitates querying and managing large datasets residing in distributed storage.
- developerWorks Business analytics: Find more analytic technical resources for developers.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
- Evaluate InfoSphere BigInsights Basic Edition: Try an integrated, tested and pre-configured, no-charge download. You can also use this product on the cloud.
- Evaluate Cognos Business Intelligence: Try a no-charge download for 30 days. You can also use this product in a sandbox environment or on the cloud.
- Evaluation software: Download or explore other online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products.
- developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
- developerWorks profile: Create your profile today and set up a watchlist.