Configuring analysis and data rule jobs to run on Spark

To perform an InfoSphere® Information Analyzer analysis or run data rules on files located in a Hadoop cluster, you can use an Apache Spark service in that cluster to run the analysis and data rule jobs.

Before you begin

  • All Apache Spark nodes must be licensed. For more information, see http://www-03.ibm.com/software/sla/sladb.nsf/search?OpenForm.
  • If you have a Hive data source that is co-located with Apache Spark, then configure the Hive data source to be your analysis database. If your job does not write results to the analysis database, then you do not need to configure Hive as your analysis database. For example if you are running a column analysis without storing frequency distributions, you do not need to use Hive as your analysis database. You also do not need to configure Hive as your analysis database if you only want to run primary key analysis jobs on Spark. The analysis results are written to Kafka and then to the metadata repository database.

  • You must have a Hadoop environment that has an Apache Spark version 2.0 or later service installed and a Livy server configured that allows you to connect to that Spark service. For Livy you can either use Cloudera Livy version 0.3.0 or Apache Livy version 0.5.0-incubating. If you are using Hortonworks Data Platform (HDP), refer to this installation documentation to install Apache Spark and refer to this configuration documentation to configure a corresponding Apache Livy server.
  • You must have the Data Administrator role.
  • If you store frequency distribution results, consider using the iisAdmin command to change the com.ibm.iis.ia.server.spark.parallel.writes parameter in order to run parallel jobs to write the frequency distribution results. This improves the performance.
  • If you created custom data classes, data quality dimensions, or data rule functions, you need to run the following commands in order to make them available to Apache Spark jobs:
    1. Use the IAAdmin tool to issue the following command to see which custom libraries exist on your system:
      <IIS_Install>/ASBServer/bin/IAAdmin -user admin -password admin 
      -url https://myhost:9443 -listLibraries
      If some of your libraries are not listed, such as custom data classifier libraries that were deployed before installing the latest InfoSphere Information Server fix pack, deploy the libraries by issuing the following command:
      <IIS_Install>/ASBServer/bin/IAAdmin -user admin -password admin 
      -url https://myhost:9443 -updateDataClasses myDataClasses.jar
      After running the command, verify that they appear in the list of libraries.
    2. Each library listed has a corresponding jar file in the <IIS_Install>/ASBServer/lib/ directory. Copy all of these jar files to the <IIS_Install>/ASBServer/apps/lib/iis/odf/ directory.
    3. Issue the following command to deploy the custom libraries:
      <IIS_Install>/ASBServer/bin/iisAdmin 
      -deploy -libModule odf 
      -libName iis -srcDir <IIS_Install>/ASBServer/apps/lib/iis/odf 

About this task

You can run column analysis, data quality analysis, primary key analysis, and data rules on Apache Spark from the InfoSphere Information Analyzer thin client and workbench. All other types of analysis jobs, such as relationship analysis, overlap analysis, and cross domain analysis, can only be run by InfoSphere DataStage® jobs. Analysis on Apache Spark is supported for HDFS data sets (CSV, ORC, Avro, and Parquet file formats) and Hive tables that are located on the same Spark cluster that is configured to run jobs.

Procedure

  1. Enable the Apache Spark and Livy server that you want to use for your analysis jobs by issuing the following command:
    /opt/IBM/InformationServer/ASBServer/bin/IAAdmin.sh 
    -url https://localhost:9443 -user <user_name> -password <password> 
    -registerLivyServer -dataConnectionName <dataConnectionName>
    -livyHost <livyHost> -livyPort <livyPort> -livyMaxNbOfSessions <maxNbOfSessions> 
    -livyAuthID <authID> -livyKeytabPath <keytabPath> -livyProxyUser <proxyUser> 
    -executorMemory <executorMemory> -numExecutors <NumberOfExecutors> 
    -executorCores <CoresPerExecutor> -driverMemory <driverMemory> 
    -driverCores <numberOfDriverCores> -queue <YarnQueue> 
    -sparkConfProperties <Comma-separated configuration properties> 
    -keepConnectionAlive <Use session pooling>
    This registers the Apache Livy server and enables Apache Spark for all analysis jobs that are run for all data sets that are imported into InfoSphere Information Analyzer by using the data connection that you specify for the -dataConnectionName parameter. Below are the parameters for the command:
    -livyHost
    The host name of the machine that hosts the Livy server.
    -livyPort
    The port number for the Livy server. The default port number is 8998. This parameter is optional.
    -livyMaxNbOfSessions
    The maximal number of Livy sessions (each with its own Spark context) that can be run on the Livy server. The default number of sessions is 6. This parameter is optional.
    -livyAuthID
    The principal user name that is used to authenticate against the Livy server. Recommended: Use the livy ID. This parameter is optional.
    -livyKeytabPath
    The path on the InfoSphere Information Server services tier where the keytab file for the principal user can be found. This parameter is optional.
    -livyProxyUser
    The proxy user to switch to when running Spark jobs. You must enable the impersonation feature in Livy to use this parameter. This parameter is optional.
    -executorMemory
    The amount of memory to use per executor process. The default value is 1G. This parameter is optional.
    -numExecutors
    The number of executors that are started for the current session. The default value is 2. This parameter is optional.
    -executorCores
    The number of cores that are used for each executor. The default value is 1. This parameter is optional.
    -driverMemory
    The amount of memory that is used for the driver process. The default value is 1024M. This parameter is optional.
    -driverCores
    The number of cores that are used for the driver process. The default value is 1. This parameter is optional.
    -queue
    The name of the YARN queue to which submitted. The default value is default. This parameter is optional.
    -sparkConfProperties
    A comma-separated list of Spark configuration properties. For example, if you want to use kryoSerializer property, specify -sparkConfProperties spark.serializer:org.apache.spark.serializer.JavaSerializer.
    Each property is a key-value pair, where key and value are separated by a colon (:).
    This parameter is optional.
    -keepConnectionAlive
    Use session pooling to reuse existing sessions. By default, session pooling is used.
    If you don't want to use session pooling, set the parameter to false. If some of the sessions are already pooled, they are not impacted. You must close these sessions to disable pooling on them.
    This parameter is optional.
    Note: If you want to use the settings that you specify in your Apache Spark and Livy server configuration, do not set the parameters executorMemory, numExecutors, executorCores, driverMemory, driverCores, and queue.

    If your Hadoop cluster is secured, you need to choose a user ID that is used for authentication when InfoSphere Information Server connects to the Livy server. Copy the (Kerberos) keytab file of the user ID to each machine on the InfoSphere Information Server services tier. The keytab file path must be the same on every machine of the services tier. The parameters -livyAuthID and -livyKeytabPath need to be specified if authentication is used.

    For example, if the Livy server on lived1.fyre.ibm.com listens on port 8999 (configured for Spark2) and the data connection is lived1-hdfs, then run the following command:
    
    /opt/IBM/InformationServer/ASBServer/bin/IAAdmin.sh -user isadmin 
    -password pass1word -registerLivyServer -livyPort 8999 
    -livyHost lived1.fyre.ibm.com -dataConnectionName lived1-hdfs 
    1. Test that registration ran successfully by using IAAdmin command line tool with the -getLivyServer option. For example:
      IAAdmin -user <user_name> -password <password> -getLivyServer 
      -dataConnectionName <data_connection_name>
      
  2. Open the IAAdmin command line tool and use the -setIADBParams command with the -hiveHDFSPath parameter to set the file path where you want the analysis results saved if your analysis database is a Hive database. For example:
    IAAdmin -user userName -password password -url https://localhost:9443 
    -setIADBParams -iaDBHost <HiveHost> -iaDBDataConnection <Hive_IADB_DataConnection> 
    -iaDataSource <JNDI Hive data source> 
    -hiveHDFSPath hdfs://nm1.domain.com:8020/apps/hive/warehouse 
    -bucketsInHiveTable 3
    

    If the Hive database is same as the HDFS source on which you are running analysis, you can just specify the Hive path. For example, /apps/hive/warehouse.

    Note: The -hiveHDFSPath parameter is a uniform resource identifier (URI) that has the prefix fs.defaultFS for its Hadoop HDFS property from the target HDFS cluster.

    Optionally, you can specify the -bucketsInHiveTable parameter to define the number of buckets that you want the Hive table distributed across.

    1. Verify that the user that runs jobs by using the Livy server has the right user roles to write to the HDFS application directory.
  3. If you want to analyze Hive files in the ORC, Avro, and Parquet formats, complete the following steps:
    1. Enable the com.ibm.iis.odf.livy.ugconnector.enabled variable to ensure that the correct file delimiter is identified to read the file content. Run the following command:
      /opt/IBM/InformationServer/ASBServer/bin/iisAdmin.sh -s -k com.ibm.iis.odf.livy.ugconnector.enabled -val true
    2. Configure settings on engine tier to be able to import these files by using InfoSphere Metadata Asset Manager. For details, see the Adding Hive files to the InfoSphere Information Analyzer thin client topic.
  4. Run a column analysis to run the analysis job on Spark.

What to do next

To verify that the analysis job is being run on the Hadoop cluster, open the monitoring application for the Spark cluster and look for a Livy application with the name livy-session-nnn and the type SPARK. For example, if you are using the Hadoop Data Platform with the standard resource manager (YARN), and the resource manager application address in the yarn-site.xml file is server1.domain.com:8088, then open the following monitoring application: http://server1.domain.com:8088/cluster/apps.

You can also check the Spark logs or the WebSphere Application Server logs in InfoSphere Information Server for entries that indicate that a Spark job was created, a connection to the Livy server was created, and so on.