Running the Oozie workflows from the command line

You can run Big Match applications for deriving, comparing, and linking data (and more) as Oozie workflows from a command-line interface.

About this task

The oozieApps and oozieAppPropTemp directories can be found on HDFS under the /bigmatch/oozie directory. To extract, you can use this example command:
hadoop fs -get /bigmatch/oozie/oozieAppPropTemp
The general syntax for the commands is:
${OOZIE_HOME}/bin/oozie job -oozie ${OOZIE_URL} -config /home/oozie/{application_name}.properties -run
where OOZIE_HOME is the installation directory for Oozie and OOZIE_URL is URL address pointing to the Oozie service. For example, the following command runs the PME Derive application for an Oozie installation at /usr/hdp/current/oozie-client with the Oozie service running at http://mdmbigmatch01.somedomain.com:11000/oozie:
/usr/hdp/current/oozie-client/bin/oozie job -oozie http://mdmbigmatch01.somedomain.com:11000/oozie 
-config /home/oozie/derive.properties -run
Note: If you are running Oozie with SSL enabled, then the bigmatch user must have access to the Oozie client.
  1. Copy the oozie.truststore file onto the client machine and ensure that the bigmatch user can access it.
  2. Pass the trustStore to the JVM. The command syntax is as follows:
    export OOZIE_CLIENT_OPTS='-Djavax.net.ssl.trustStore=<path to oozie.truststore>'
    For example:
    export OOZIE_CLIENT_OPTS='-Djavax.net.ssl.trustStore=/home/bigmatch/oozie.truststore'
With SSL, the syntax for executing the Big Match Oozie workflow remains the same as without it, but the Oozie URL will be different. The secure Oozie URL follows the format -oozie https://<oozie_server>:<secure_port>/oozie. For example:
-oozie https://node1.domain.com:11443/oozie
Running the command returns a job ID. You can then use the following command to see the status of the Oozie workflow:
${OOZIE_HOME}/bin/oozie job -oozie ${OOZIE_URL} -info ${JOB_ID}
As explained elsewhere, the derive, compare, and link applications run automatically by default as you load data. If you are running the applications manually, you would typically run the applications in the following order:
  1. PME Derive
  2. PME Compare
  3. PME Link
For particular needs, you can also run the following applications:
  • Batch Processing
    • PME Derive
    • PME Generate Weights
    • PME Compare
    • PME Link
    • PME Re-index
    • PME Unlink
  • Analysis
    • PME Bulk Search
    • PME Bucket Analysis
    • PME Entity Analysis
    • PME Score Analysis
    • PME Export Sample Pairs
    • PME Token Frequency Analysis
  • Administration
    • PME Export Records
    • PME Extract Entities
    • PME Cache Indexes

These applications are not necessarily part of a typical workflow. See the information about each application for more detail.