GitHubContribute in GitHub: Edit online

Hive Scanner Guide

IBM Manta Data Lineage is a powerful data lineage platform that simplifies data management by supporting automated lineage harvesting of the Hive database. This guide will walk you through the steps to set up Hive lineage harvesting.

Clarify and verify the following details before starting the connection configuration to significantly reduce the amount of troubleshooting needed.

  1. Who is your Hadoop platform vendor: Cloudera, Hortonworks, or someone else? You will need the appropriate JDBC driver and documentation from the vendor as some aspects may differ significantly.

    1. The easiest and most reliable way to determine your Hadoop platform vendor is to refer to your system’s documentation or configuration. The vendor’s name is usually mentioned explicitly in the system’s documentation, installation guides, or configuration files.

    2. Some Hadoop vendors provide custom command-line tools or utilities. Running these tools might provide information about the vendor. For example, Cloudera might have commands prefixed with cloudera and Hortonworks (now part of Cloudera) might use hdp-select.

  2. Is your Hive platform Kerberized? If so, is it configured to use keytab or cache?
    Obtain the keytab and krb5 files or the JAAS file from your Hive administrator, depending on the authentication setup.

  3. Does your Hive platform use Zookeeper?

  4. Will your firewall/proxy allow communication from Manta Server to Hive’s nodes?

  5. Verify the other requirements and privileges detailed in Hive Integration Requirements.

To check the Hive and Hadoop versions, use the hive --version and hadoop version commands.

Step 2: Install the JDBC Driver

Because the Hive driver depends so much on the Hive vendor and, in some cases, even on the specific version of Hive, Manta Data Lineage does not bundle a Hive driver with the installation. The Manta administrator must place the downloaded Hive JDBC driver JAR file(s) in the designated mantaflow/agent/manta-flow-agent-dir/lib-extdirectory (the Manta Agent extractor method should be used for all Hive connections as of Manta Data Lineage R41). The Hive JDBC driver is generally bundled with the Hive installation and can be provided by the Hive administrator. If not, it can be downloaded separately from your Hive vendor’s website. Ensure that you identify your Hive vendor; drivers from other vendors will not work.

Manta Data Lineage cannot use the Apache Hive standalone driver because the driver contains incompatible versions of third-party libraries. Instead, use the following individual libraries. The best way to get the driver is to

  1. go to Maven repository and download hive-jdbc https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc for your respective version of Hive.

  2. download all the dependencies for it as well (some may be transitive, i.e. check dependencies of dependencies)

  3. from the list of jars hat have been downloaded, remove those that already do exist in mantaflow/cli/scenarios/manta-dataflow/cli/ or unclear for Agent. Check that the version of the library available in Manta Data Lineage fits into the range of supported libraries listed in Maven

  4. place the remaining jars into mantaflow/agent/manta-flow-agent-dir/lib-ext/ for Agent.

For example, for hive-jdbc-3.1.2-bin.tar, you will need to include:

Artifact Version Link
hive-jdbc 3.1.2 https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/3.1.2
curator-framework 2.12.0 https://mvnrepository.com/artifact/org.apache.curator/curator-framework/2.12.0
curator-client 2.12.0 https://mvnrepository.com/artifact/org.apache.curator/curator-client/2.12.0
hive-common 3.1.2 https://mvnrepository.com/artifact/org.apache.hive/hive-common/3.1.2
hive-service 3.1.2 https://mvnrepository.com/artifact/org.apache.hive/hive-service/3.1.2
hive-serde 3.1.2 https://mvnrepository.com/artifact/org.apache.hive/hive-serde/3.1.2
hive-metastore 3.1.2 https://mvnrepository.com/artifact/org.apache.hive/hive-metastore/3.1.2
hive-shims 3.1.2 https://mvnrepository.com/artifact/org.apache.hive/hive-shims/3.1.2
hive-service-rpc 3.1.2 https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2
libthrift 0.9.3 https://mvnrepository.com/artifact/org.apache.thrift/libthrift/0.9.3
zookeeper 3.4.6 https://mvnrepository.com/artifact/org.apache.zookeeper/zookeeper/3.4.6
hadoop-shaded-protobuf_3_7 1.1.1 https://mvnrepository.com/artifact/org.apache.hadoop.thirdparty/hadoop-shaded-protobuf_3_7/1.1.1
hive-exec 3.1.2 https://mvnrepository.com/artifact/org.apache.hive/hive-exec/3.1.2

As of Manta Data Lineage R40, add also

Artifact Version Link
commons-lang 2.6 https://mvnrepository.com/artifact/commons-lang/commons-lang/2.6

Manta Agent deployments require the above libraries deployed into agent/manta-flow-agent-dir/lib-ext. Additionally, the following are also required.

Artifact Version Link
httpclient 4.5.13 https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient/4.5.13
httpcore 4.4.13 https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore/4.4.13
commons-logging 1.2 https://mvnrepository.com/artifact/commons-logging/commons-logging/1.2
hadoop-common 3.3.4 https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/3.3.4
hadoop-auth 3.3.4 https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-auth/3.3.4
hadoop-shaded-guava 1.1.1 https://mvnrepository.com/artifact/org.apache.hadoop.thirdparty/hadoop-shaded-guava/1.1.1
slf4j-api 1.7.30 https://mvnrepository.com/artifact/org.slf4j/slf4j-api/1.7.30

Step 3: Configure the Connection

Create a new connection in Admin UI http(s)://<domain_name>:8181/manta-admin-gui/app/index.html?#/platform/connections/ to enable automated extraction and lineage analysis of Hive DDL files and the database dictionary by Manta Data Lineage. You will need one Hive connection in Manta Data Lineage for each Hive server that you want to connect to.

Mandatory Properties to Configure

Hive Connection Verification Using Beeline

It is possible to verify that the Manta host server is able to connect with the Hive server using Hive’s beeline command line tool. It works both with or without Kerberos. When utilizing Kerberos authentication, a ticket needs to be acquired in the default cache (bash variable KRB5CCNAME, verify with klist).

It is not necessary to have the beeline tool installed on the Manta Host machine (in fact most host machines do not have it installed), but it may be beneficial to install it for troubleshooting Manta Data Lineage Hive connections.

Use the following beeline command to test establishing a connection to the Hive server.

beeline -u [jdbc connection string]

Configuration of Apache Hive - Base authentication (non-Cloudera or any other distribution)

  1. Check the major version of Hive

  2. Set authentication method to Base in Hive connection

  3. If you are connecting to a Hive server that has Secure Sockets Layer (SSL) enabled, review the SSL Protected Connection section below.

Configuration of Cloudera Hive - Base authentication

  1. Download the latest Cloudera Driver (hive-jdbc driver) from https://www.cloudera.com/downloads/connectors/impala/jdbc/ or get the driver from your Hive DB administrator, as Cloudera’s driver is very different from the Apache Hive driver, it is necessary to make the following configuration changes.

  2. Place the downloaded Hive driver jar file in mantaflow/cli/scenarios/manta-dataflow-cli/lib-ext/

  3. Update the HiveServer2 driver class name in Hive Connection configuration to com.cloudera.hive.jdbc.HS2Driver
    hive.extractor.hiveserver2.driver=com.cloudera.hive.jdbc.HS2Driver

    1. To support JDBC 4.0, classes with the following FQCNs are available:

      • com.cloudera.hive.jdbc4.HS1Driver

      • com.cloudera.hive.jdbc4.HS2Driver

    2. To support JDBC 4.1, classes with the following FQCNs are available:

      • com.cloudera.hive.jdbc41.HS1Driver

      • com.cloudera.hive.jdbc41.HS2Driver

  4. Set authentication method to Base in Hive connection .

  5. If you are connecting to a Hive server that has Secure Sockets Layer (SSL) enabled, review the SSL Protected Connection section below.

Kerberos Authentication

Important: Neither the Kerberos nor JAAS authentication methods can be configured for Cloudera Public Cloud deployments. This is due to the unique infrastructure and architecture of Cloudera Public Cloud, which operates in a managed cloud environment. The inherent design and security considerations of Cloudera Public Cloud preclude the implementation of Kerberos/JAAS authentication as the management and administration of Kerberos within such a cloud service is complex and differs significantly from traditional on-premises deployments. Properties are case-sensitive. Do not duplicate properties in the connection URL.

Connecting to Hadoop HIVE using Kerberos Authentication

Kerberos Verification Before Configuring Manta Data Lineage

If the host machine has Kerberos tools installed (kinit, klist, kvno), follow the steps to verify that Kerberos is configured correctly.

This verification is not necessary to set up the Kerberos authentication, but it rules out most of the Kerberos configuration issues. Only applicable if Kerberos is used.

  1. If using ticket, just verify that klist lists acquired ticket. It is possible to specify cache file with klist -c [cache file] when needed

  2. If using keytab, check that these steps complete without error

    1. check that bash variable KRB5_CONFIG points to a krb5.conf file

    2. Check what principals credentials are saved in the keytab with klist [keytab location]

    3. do kinit -kt [keytab location] [principal name]

  3. Now with Ticket Granting Ticket (TGT) in the cache file (from kinit) check that we're able to get the service ticket for Hive

    1. Get Hive service principal from the Hive administrator (this value usually matches the following pattern: hive/host@DOMAIN.COM)

    2. Verify with: kvno [service principal] (e.g. kvno hive/host@DOMAIN.COM)

      1. It is also possible to use only the cache file, if kinit is not possible on the host machine, with kvno -c [cache file] [service principal]

Configuration

To configure the authentication to use Kerberos, you must provide Manta Data Lineage with the principal of the service (principal of Hive Server 2) and a principal that can obtain a service ticket for the service. Manta Data Lineage will then get a service token using this principal.

The properties that need to be configured are hive.extractor.hiveserver2.kerberos.principal (set to the Hive Server 2 service principal) and hive.extractor.hiveserver2.kerberos.method.

Optionally, you can provide the krb5.conf file path, if the file isn’t installed in the default location, using the hive.extractor.hiveserver2.krbPath property.

Note that the connection strings may differ across Hive JDBC drivers. For more information about troubleshooting / different configurations, see the Additional Details section below.

There are three ways you can provide the Kerberos credentials to Manta Data Lineage.

1. System Ticket

The system ticket method is the simplest way to configure Kerberos as Manta Data Lineage will use a TGT (ticket granting ticket) obtained by the system. The TGT must be valid (TGTs have limited validity), and the connection is to be configured with the hive.extractor.hiveserver2.kerberos.method set to System Ticket.

2. Keytab

Keytab is a secure credential storage that allows Manta Data Lineage to obtain a TGT for the configured principal/user. TGT validity is handled by Manta Data Lineage. The properties that need to be configured are as follows.

3. JAAS

The JAAS (Java Authentication and Authorization Service) file is a universal configuration file for the JDBC driver to configure the authentication. Some drivers (notably the Cloudera driver) may require this method. Note that you may be required to set the connection string parameters to use the JAAS config. For more information, see Additional Details below.

Important: As of R41, some drivers (notably Cloudera Private Cloud/Hortonworks) need to be configured using the JAAS config. Set the authentication method in the connection properties to Kerberos - JAAS and set the JAAS file path to /mantaflow/cli/scenarios/manta-dataflow-cli/etc/jaas/jaas.config

Here are two examples of jaas.conf files.

SSL Protected Connection

If https is enabled, the ssl certificate will need to be obtained and placed on the Manta server. If the certificate used for TLS/SSL is not included in your Java truststore, get an existing truststore with the certificate or create a new one (to be used in the next step).

  1. If you are using Manta Data Lineage R41 or older, add the SSL values to the JDBC connection string.

    1. For Cloudera, add SSL values to JDBC connection string
      SSL=1;SSLTrustStore=/path/to/truststore;SSLTrustStorePwd=pwd

    2. For Apache add SSL values to JDBC connection string
      ssl=true;sslTrustStore=/path/to/cacerts;trustStorePassword=changeit

  2. If you are using Manta Data Lineage R42 or newer, do not add anything to the JDBC connection string. If configured correctly, the validation will succeed once the proper fields are set. When an SSL protected connection is used, hive.extractor.truststore.path and hive.extractor.truststore.password are appended to hive.extractor.hiveserver2.url, and the following URL is used to connect to Hive Server 2.

jdbc:hive2://<host>:<port>/<database>;ssl=<hive.extractor.enabled.encryptConnection>;sslTrustStore=<hive.extractor.truststore.path>;trustStorePassword=<hive.extractor.truststore.password>

Since the result must be a valid URL, the truststore path and truststore password must not contain any of the URL reserved characters !*'();:@&=+$,/?%#[] or space characters. URL-encoding the forbidden characters is not supported by the Hive JDBC driver either.

Optional Properties to Configure

For additional configuration definitions, review Hive Resource Configuration.

Step 4: Validating the Connection

Upon configuring all the necessary parameters, the connection validation process can be initiated by selecting the Save button. For optimal base authentication, it’s strongly recommended that users set up a solitary extracted database in the configuration before finalizing the connection. Notably, opting for the Kerberos authentication method will result in Manta Data Lineage bypassing the validation stage and directly saving the connection settings. Consequently, if the Kerberos authentication method is chosen, it’s essential to subsequently execute the Hive extractor scenario to ensure successful connection validation.

Step 5: Variable Substitution Configuration

If Hive variable substitution is used in the manually provided Hive SQL scripts, you must provide the substitution configuration so that the analyzed scripts are preprocessed the same way as in Hive. Note that without the correct substitution configuration, the resulting lineage might be inaccurate or the input script may even fail to be valid Hive SQL.

By default, the variable substitution configuration is unique for each Hive source system and should be placed in a file named variableSubstitution.properties in the same directory as the manually provided Hive SQL scripts ( ${manta.dir.input}/hive/${hive.dictionary.id}).

The configuration format is; variable names are processed case insensitively:

name1=value1
name2=value2
...

Example:

hiveconf:year=2016
hivevar:month=01
name=John Doe

Additional Details

This segment offers supplementary insights beneficial for configuring a Hive connection. It’s important to acknowledge that not all details within this section will be universally relevant to every Hive configuration, given the diverse deployment options available for Hive.

Common Errors

This section summarizes the most common errors that may occur when setting up the connection. These errors can be found in Hive extraction logs.