IBM® BigInsights® BigIntegrate and BigQuality allow for IBM InfoSphere® Information Server to be deployed on Apache Hadoop, leveraging resources in the Hadoop cluster through the Hadoop resource manager known as Yet Another Resource Negotiator (YARN). This offering introduces data locality, allowing for logic in existing and new IBM InfoSphere DataStage® jobs to run on the Hadoop data nodes where the Hadoop Distributed File System (HDFS) blocks exist.
Figure 1 shows where InfoSphere Information Server on Hadoop fits into the broader Hadoop architecture.
Figure 1. Hadoop Architecture with InfoSphere Information Server on Hadoop
As an InfoSphere® Information Server support engineer I often help IBM® BigInsights® BigIntegrate and BigQuality clients with questions related to the installation and configuration of InfoSphere Information Server on Hadoop. For example, clients often ask:
- “What method of installation for InfoSphere Information Server on Hadoop should I use?”
- “How does Kerberos affect my Hadoop cluster?”
- “What are the steps to install InfoSphere Information Server on Hadoop?”
- “How can I determine the required permissions for job processing?”
- “What are the key environment variables and configuration parameters for InfoSphere Information Server on Hadoop?”
- “What are the main log files I should review if jobs fail?”
To answer these and other questions and share my experience with a bigger audience I wrote the IBM Redbooks Analytics Web Doc IBM BigInsights BigIntegrate and BigQuality: IBM InfoSphere Information Server on Hadoop Deployment and Configuration Guide, TIPS1339. This document is intended to jumpstart your deployment and configuration of the IBM BigInsights BigIntegrate and BigQuality solution.
This document covers the following topics:
- InfoSphere Information Server on Hadoop installation
- InfoSphere Information Server on Hadoop configuration
- The APT_CONFIG_FILE environment variable
- Container resource requirements
- Log files
- IBM JDK recommendations
Scott Brokaw began his career with IBM working as a Software Engineer in the IBM InfoSphere Information Server area. Scott is currently a member of the IBM Analytics Advanced Technical Support team focusing on resolving complex issues and escalations. He is a subject matter expert in the area of InfoSphere Information Server authentication, IBM PureData System for Analytics (powered by Netezza) connectivity, source control integration, and Hadoop integration. Scott is a graduate of Providence College, Rhode Island, USA