Configuring analysis and data rule jobs to run on Spark
To perform an InfoSphere® Information Analyzer analysis or run data rules on files located in a Hadoop cluster, you can use an Apache Spark service in that cluster to run the analysis and data rule jobs.
Before you begin
- All Apache Spark nodes must be licensed. For more information, see http://www-03.ibm.com/software/sla/sladb.nsf/search?OpenForm.
-
If you have a Hive data source that is co-located with Apache Spark, then configure the Hive data source to be your analysis database. If your job does not write results to the analysis database, then you do not need to configure Hive as your analysis database. For example if you are running a column analysis without storing frequency distributions, you do not need to use Hive as your analysis database. You also do not need to configure Hive as your analysis database if you only want to run primary key analysis jobs on Spark. The analysis results are written to Kafka and then to the metadata repository database.
- You must have a Hadoop environment that has an Apache Spark version 2.0 or later service installed and a Livy server configured that allows you to connect to that Spark service. For Livy you can either use Cloudera Livy version 0.3.0 or Apache Livy version 0.5.0-incubating. If you are using Hortonworks Data Platform (HDP), refer to this installation documentation to install Apache Spark and refer to this configuration documentation to configure a corresponding Apache Livy server.
- You must have the Data Administrator role.
- If you store frequency distribution results, consider using the iisAdmin command to change the com.ibm.iis.ia.server.spark.parallel.writes parameter in order to run parallel jobs to write the frequency distribution results. This improves the performance.
- If you created custom data classes, data quality dimensions, or data rule functions, you need to
run the following commands in order to make them available to Apache Spark jobs:
- Use the IAAdmin tool to issue the following command to see which custom
libraries exist on your system:
If some of your libraries are not listed, such as custom data classifier libraries that were deployed before installing the latest InfoSphere Information Server fix pack, deploy the libraries by issuing the following command:<IIS_Install>/ASBServer/bin/IAAdmin -user admin -password admin -url https://myhost:9443 -listLibraries
After running the command, verify that they appear in the list of libraries.<IIS_Install>/ASBServer/bin/IAAdmin -user admin -password admin -url https://myhost:9443 -updateDataClasses myDataClasses.jar - Each library listed has a corresponding jar file in the <IIS_Install>/ASBServer/lib/ directory. Copy all of these jar files to the <IIS_Install>/ASBServer/apps/lib/iis/odf/ directory.
- Issue the following command to deploy the custom libraries:
<IIS_Install>/ASBServer/bin/iisAdmin -deploy -libModule odf -libName iis -srcDir <IIS_Install>/ASBServer/apps/lib/iis/odf
- Use the IAAdmin tool to issue the following command to see which custom
libraries exist on your system:
About this task
You can run column analysis, data quality analysis, primary key analysis, and data rules on Apache Spark from the InfoSphere Information Analyzer thin client and workbench. All other types of analysis jobs, such as relationship analysis, overlap analysis, and cross domain analysis, can only be run by InfoSphere DataStage® jobs. Analysis on Apache Spark is supported for HDFS data sets (CSV, ORC, Avro, and Parquet file formats) and Hive tables that are located on the same Spark cluster that is configured to run jobs.
Procedure
What to do next
http://server1.domain.com:8088/cluster/apps.You can also check the Spark logs or the WebSphere Application Server logs in InfoSphere Information Server for entries that indicate that a Spark job was created, a connection to the Livy server was created, and so on.