Hive Scanner Guide

IBM Manta Data Lineage is a powerful data lineage platform that simplifies data management by supporting automated lineage harvesting of the Hive database. This guide will walk you through the steps to set up Hive lineage harvesting.

Clarify and verify the following details before starting the connection configuration to significantly reduce the amount of troubleshooting needed.

Who is your Hadoop platform vendor: Cloudera, Hortonworks, or someone else? You will need the appropriate JDBC driver and documentation from the vendor as some aspects may differ significantly.
1. The easiest and most reliable way to determine your Hadoop platform vendor is to refer to your system’s documentation or configuration. The vendor’s name is usually mentioned explicitly in the system’s documentation, installation guides, or configuration files.
2. Some Hadoop vendors provide custom command-line tools or utilities. Running these tools might provide information about the vendor. For example, Cloudera might have commands prefixed with cloudera and Hortonworks (now part of Cloudera) might use hdp-select.
Is your Hive platform Kerberized? If so, is it configured to use keytab or cache?
Obtain the keytab and krb5 files or the JAAS file from your Hive administrator, depending on the authentication setup.
Does your Hive platform use Zookeeper?
Will your firewall/proxy allow communication from Manta Server to Hive’s nodes?
Verify the other requirements and privileges detailed in Hive Integration Requirements.

To check the Hive and Hadoop versions, use the hive --version and hadoop version commands.

Step 2: Install the JDBC Driver

Because the Hive driver depends so much on the Hive vendor and, in some cases, even on the specific version of Hive, Manta Data Lineage does not bundle a Hive driver with the installation. The Manta administrator must place the downloaded Hive JDBC driver JAR file(s) in the designated mantaflow/agent/manta-flow-agent-dir/lib-ext directory (the Manta Agent extractor method should be used for all Hive connections). The Hive JDBC driver is generally bundled with the Hive installation and can be provided by the Hive administrator. If not, it can be downloaded separately from your Hive vendor’s website. Ensure that you identify your Hive vendor; drivers from other vendors will not work.

Cloudera

If you’re using Cloudera, you can find their drivers here: Cloudera Drivers Source.

note: Some newer versions of the driver require adding also woodstox-core jar file, e.g. https://mvnrepository.com/artifact/com.fasterxml.woodstox/woodstox-core. When the jar file is missing but is required, error message caused by: java.lang.NoClassDefFoundError: com/ctc/wstx/io/InputBootstrapper will appear in the Manta Admin UI and/or Hive extractor logs.

Apache Hive

For Apache Hive, the drivers can be downloaded from Apache Hive Drivers

Manta Data Lineage cannot use the Apache Hive standalone driver because the driver contains incompatible versions of third-party libraries. Instead, use the following individual libraries. The best way to get the driver is to

go to Maven repository and download hive-jdbc https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc for your respective version of Hive.
download all the dependencies for it as well (some may be transitive, i.e. check dependencies of dependencies)
from the list of jars that have been downloaded, remove those that already do exist in mantaflow/cli/scenarios/manta-dataflow/lib/ or unclear for Agent. Check that the version of the library available in Manta Data Lineage fits into the range of supported libraries listed in Maven
place the remaining jars into mantaflow/agent/manta-flow-agent-dir/lib-ext/ for Agent.

For example, for hive-jdbc-3.1.2-bin.tar, you will need to include:

Artifact	Version	Link
hive-jdbc	3.1.2	https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/3.1.2
curator-framework	2.12.0	https://mvnrepository.com/artifact/org.apache.curator/curator-framework/2.12.0
curator-client	2.12.0	https://mvnrepository.com/artifact/org.apache.curator/curator-client/2.12.0
hive-common	3.1.2	https://mvnrepository.com/artifact/org.apache.hive/hive-common/3.1.2
hive-service	3.1.2	https://mvnrepository.com/artifact/org.apache.hive/hive-service/3.1.2
hive-serde	3.1.2	https://mvnrepository.com/artifact/org.apache.hive/hive-serde/3.1.2
hive-metastore	3.1.2	https://mvnrepository.com/artifact/org.apache.hive/hive-metastore/3.1.2
hive-shims	3.1.2	https://mvnrepository.com/artifact/org.apache.hive/hive-shims/3.1.2
hive-service-rpc	3.1.2	https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2
libthrift	0.9.3	https://mvnrepository.com/artifact/org.apache.thrift/libthrift/0.9.3
zookeeper	3.4.6	https://mvnrepository.com/artifact/org.apache.zookeeper/zookeeper/3.4.6
hadoop-shaded-protobuf_3_7	1.1.1	https://mvnrepository.com/artifact/org.apache.hadoop.thirdparty/hadoop-shaded-protobuf_3_7/1.1.1
hive-exec	3.1.2	https://mvnrepository.com/artifact/org.apache.hive/hive-exec/3.1.2
commons-lang	2.6	https://mvnrepository.com/artifact/commons-lang/commons-lang/2.6

As of Manta Data Lineage R42.3 add also

Artifact	Version	Link
hadoop-auth	3.3.4	https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-auth/3.3.4
hadoop-shaded-guava	1.1.1	[https://mvnrepository.com/artifact/org.apache.hadoop.thirdparty/hadoop-shaded-guava/1.1.1

Manta Agent deployments require the above libraries deployed into agent/manta-flow-agent-dir/lib-ext. Additionally, the following are also required.

Artifact	Version	Link
httpclient	4.5.13	https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient/4.5.13
httpcore	4.4.13	https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore/4.4.13
commons-logging	1.2	https://mvnrepository.com/artifact/commons-logging/commons-logging/1.2
hadoop-common	3.3.4	https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/3.3.4
hadoop-auth	3.3.4	https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-auth/3.3.4
hadoop-shaded-guava	1.1.1	https://mvnrepository.com/artifact/org.apache.hadoop.thirdparty/hadoop-shaded-guava/1.1.1
slf4j-api	1.7.30	https://mvnrepository.com/artifact/org.slf4j/slf4j-api/1.7.30

Step 3: Configure the Connection

Create a new connection in Admin UI http(s)://<domain_name>:8181/manta-admin-gui/app/index.html?#/platform/connections/ to enable automated extraction and lineage analysis of Hive DDL files and the database dictionary by Manta Data Lineage. You will need one Hive connection in Manta Data Lineage for each Hive server that you want to connect to.

Mandatory Properties to Configure

Hive Connection Name: Name of the connection configuration in Manta Data Lineage.
Hive Dictionary ID: Name of a resource representing this Hive server known as a dictionary ID, used as an output subdirectory name for extracted DDL files and the database dictionary.
Extracted Databases (this field is not required but is highly recommended to reduce the time it takes to extract metadata from the host Hive server): List of database names that should be included in the Hive extraction.
HiveServer2 Connection String The JDBC connection string for the Hive server.
HiveServer2 Hostname: The HiveServer2 hostname is the address that points to the machine where the HiveServer2 service is running. This critical piece of information is required to establish a connection between Manta Data Lineage and the HiveServer2 service.
HiveServer2 Driver Class Name: The HiveServer2 driver class name is a specific class that acts as a bridge between your application and the HiveServer2 instance when connecting to a Hive database using JDBC. Depending on the Hive driver you are using, use org.apache.hive.jdbc.HiveDriver for Apache Hive distributions or com.cloudera.hive.jdbc.HS2Driver for Cloudera Hive distributions (check your driver version documentation for the class name that should be used; older versions of JDBC driver may use different ones).
Authentication Mode: This can be set to Base (base authentication is a straightforward method of user authentication involving the transmission of an encoded username and password in the HTTP header) or Kerberos (a strong authentication protocol that requires clients to authenticate with a Kerberos Key Distribution Center (KDC) before accessing Hive services).
For base authentication, you’ll be guided to configure the essential username and password fields with the necessary credentials.
For Kerberos authentication, the setup involves configuring the Kerberos principal, which embodies the interaction between clients, the KDC, and servers. This interaction employs encrypted tickets and session keys to establish secure communication within a networked environment. You’ll also select a Kerberos authentication method.
- For the Kerberos system ticket authentication method, termed the system ticket method, secure and trusted communication between clients and servers is established using encrypted tickets and session keys.
- The keytab authentication method involves using a keytab file — a secure repository containing pairs of principal names and encrypted keys. Principals encompass users, services, or hosts. To utilize this method, the following is required.
  - Kerberos user: This is an authenticated and authorized entity, like a user or service, for accessing network resources via the Kerberos protocol. Represented by a unique principal name within the Kerberos realm, their encrypted keys are stored in the keytab file.
  - Keytab path: This signifies the location on the Manta host machine where the keytab file resides: /opt/mantaflow/cli/scenarios/manta-dataflow-cli/conf/keytabs/manta.keytab.
  - Krb5 configuration file path (not obligatory but recommended): This file configures how Kerberos clients interact with the authentication infrastructure, governing realms, servers, behaviors, and mappings for secure communication.
- For the JAAS authentication method, based on Java, user authentication and access control are facilitated. If this method is selected for authorization, users must provide the JAAS file path for its configuration.
For R42, enhanced Cloudera support was added. Configure all fields in Admin UI.
- The Hive distribution should be Cloudera.

Base Authentication

Check the major version of Hive
Set authentication method to Base in Hive connection
If you are connecting to a Hive server that has Secure Sockets Layer (SSL) enabled, review the SSL Protected Connection section below.

Kerberos Authentication

Important: Neither the Kerberos nor JAAS authentication methods can be configured for Cloudera Public Cloud deployments. This is due to the unique infrastructure and architecture of Cloudera Public Cloud, which operates in a managed cloud environment. The inherent design and security considerations of Cloudera Public Cloud preclude the implementation of Kerberos/JAAS authentication as the management and administration of Kerberos within such a cloud service is complex and differs significantly from traditional on-premises deployments. Properties are case-sensitive. Do not duplicate properties in the connection URL.

Connecting to Hadoop HIVE using Kerberos Authentication

Kerberos Verification Before Configuring Manta Data Lineage

If the host machine has Kerberos tools installed (kinit, klist, kvno), follow the steps to verify that Kerberos is configured correctly.

This verification is not necessary to set up the Kerberos authentication, but it rules out most of the Kerberos configuration issues. Only applicable if Kerberos is used.

If using ticket, just verify that klist lists acquired ticket. It is possible to specify cache file with klist -c [cache file] when needed
If using keytab, check that these steps complete without error
1. check that bash variable KRB5_CONFIG points to a krb5.conf file
2. Check what principals credentials are saved in the keytab with klist [keytab location]
3. do kinit -kt [keytab location] [principal name]
Now with Ticket Granting Ticket (TGT) in the cache file (from kinit) check that we're able to get the service ticket for Hive
1. Get Hive service principal from the Hive administrator (this value usually matches the following pattern: hive/host@DOMAIN.COM)
2. Verify with: kvno [service principal] (e.g. kvno hive/host@DOMAIN.COM)
  1. It is also possible to use only the cache file, if kinit is not possible on the host machine, with kvno -c [cache file] [service principal]

Configuration

To configure the authentication to use Kerberos, you must provide Manta Data Lineage with the principal of the service (principal of Hive Server 2) and a principal that can obtain a service ticket for the service. Manta Data Lineage will then get a service token using this principal.

The properties that need to be configured are hive.extractor.hiveserver2.kerberos.principal (set to the Hive Server 2 service principal) and hive.extractor.hiveserver2.kerberos.method.

Optionally, you can provide the krb5.conf file path, if the file isn’t installed in the default location, using the hive.extractor.hiveserver2.krbPath property.

Note that the connection strings may differ across Hive JDBC drivers. For more information about troubleshooting / different configurations, see the Additional Details section below.

There are three ways you can provide the Kerberos credentials to Manta Data Lineage.

1. System Ticket

The system ticket method is the simplest way to configure Kerberos as Manta Data Lineage will use a TGT (ticket granting ticket) obtained by the system. The TGT must be valid (TGTs have limited validity), and the connection is to be configured with the hive.extractor.hiveserver2.kerberos.method set to System Ticket.

2. Keytab

Keytab is a secure credential storage that allows Manta Data Lineage to obtain a TGT for the configured principal/user. TGT validity is handled by Manta Data Lineage. The properties that need to be configured are as follows.

hive.extractor.hiveserver2.kerberos.method set to Keytab
hive.extractor.hiveserver2.keytabPath set to the filesystem path of the keytab file
hive.extractor.hiveserver2.kerberosUser set to the principal with credentials saved in the keytab file

3. JAAS

The JAAS (Java Authentication and Authorization Service) file is a universal configuration file for the JDBC driver to configure the authentication. Some drivers (notably the Cloudera driver) may require this method. Note that you may be required to set the connection string parameters to use the JAAS config. For more information, see Additional Details below.

Important: Some drivers (notably Cloudera Private Cloud/Hortonworks) need to be configured using the JAAS config. Set the authentication method in the connection properties to Kerberos - JAAS and set the JAAS file path to /mantaflow/cli/scenarios/manta-dataflow-cli/etc/jaas/jaas.config

Here are two examples of jaas.conf files.

Create jaas.config in /mantaflow/cli/scenarios/manta-dataflow-cli/etc/jaas/.

If you’re using keytab authentication:

Client { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=true 
keyTab="/path/to/some.keytab" 
principal="principal.in.keytab@DOMAIN.COM" 
doNotPrompt=true; 
};

If you’re using a cache:

Client { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=false 
principal="principal.in.cache@DOMAIN.COM" //this is not required
doNotPrompt=true
useTicketCache=true
ticketCache="/etc/krb5cache";
};

JDBC connection string configuration
- Use the following JDBC string to connect to Hive.
  - ```
  jdbc:hive2://node1.example.com:10000;AuthMech=1;KrbRealm=EXAMPLE.COM;KrbHostFQDN=hs2node1.example.com;KrbServiceName=hive;KrbAuthType=1
```
- Set the AuthMech property to 1 for Kerberos.
- If your Kerberos setup has a default realm, omit setting the KrbRealm property. If not, set it to the realm of the Hive server.
- Define the fully-qualified domain name of the Hive server host using the KrbHostFQDN property.
- Optionally, specify how the driver obtains the Kerberos subject by setting the KrbAuthType.
  - To auto-detect the subject method, use 0.
  - To use LoginContext from JAAS, use 1.
  - To use LoginContext from the Kerberos ticket cache, use 2.

SSL Protected Connection

If https is enabled, the ssl certificate will need to be obtained and placed on the Manta server. If the certificate used for TLS/SSL is not included in your Java truststore, get an existing truststore with the certificate or create a new one (to be used in the next step).

If you are using Manta Data Lineage R41, add the SSL values to the JDBC connection string.
1. For Cloudera, add SSL values to JDBC connection string
  SSL=1;SSLTrustStore=/path/to/truststore;SSLTrustStorePwd=pwd
2. For Apache add SSL values to JDBC connection string
  ssl=true;sslTrustStore=/path/to/cacerts;trustStorePassword=changeit
If you are using Manta Data Lineage R42 or newer, do not add anything to the JDBC connection string. If configured correctly, the validation will succeed once the proper fields are set. When an SSL protected connection is used, hive.extractor.truststore.path and hive.extractor.truststore.password are appended to hive.extractor.hiveserver2.url, and the following URL is used to connect to Hive Server 2.

jdbc:hive2://<host>:<port>/<database>;ssl=<hive.extractor.enabled.encryptConnection>;sslTrustStore=<hive.extractor.truststore.path>;trustStorePassword=<hive.extractor.truststore.password>

Since the result must be a valid URL, the truststore path and truststore password must not contain any of the URL reserved characters !*'();:@&=+$,/?%#[] or space characters. URL-encoding the forbidden characters is not supported by the Hive JDBC driver either.

Optional Properties to Configure

To control the scope of the extraction, use hive.extractor.dbsInclude and hive.extractor.dbsExclude to restrict the analysis of design-time lineage.

For additional configuration definitions, review Hive Resource Configuration.

Step 4: Validating the Connection

Upon configuring all the necessary parameters, the connection validation process can be initiated by selecting the Save button. For optimal base authentication, it’s strongly recommended that users set up a solitary extracted database in the configuration before finalizing the connection. Notably, opting for the Kerberos authentication method will result in Manta Data Lineage bypassing the validation stage and directly saving the connection settings. Consequently, if the Kerberos authentication method is chosen, it’s essential to subsequently execute the Hive extractor scenario to ensure successful connection validation.

Hive Connection Verification Using Beeline

It is possible to verify that the Manta host server is able to connect with the Hive server using Hive’s beeline command line tool. It works both with or without Kerberos. When utilizing Kerberos authentication, a ticket needs to be acquired in the default cache (bash variable KRB5CCNAME, verify with klist).

It is not necessary to have the beeline tool installed on the Manta Host machine (in fact most host machines do not have it installed), but it may be beneficial to install it for troubleshooting Manta Data Lineage Hive connections.

Use the following beeline command to test establishing a connection to the Hive server.

beeline -u [jdbc connection string]

Step 5: Variable Substitution Configuration

If Hive variable substitution is used in the manually provided Hive SQL scripts, you must provide the substitution configuration so that the analyzed scripts are preprocessed the same way as in Hive. Note that without the correct substitution configuration, the resulting lineage might be inaccurate or the input script may even fail to be valid Hive SQL.

By default, the variable substitution configuration is unique for each Hive source system and should be placed in a file named variableSubstitution.properties in the same directory as the manually provided Hive SQL scripts ( ${manta.dir.input}/hive/${hive.dictionary.id}).

The configuration format is; variable names are processed case insensitively:

name1=value1
name2=value2
...

Example:

hiveconf:year=2016
hivevar:month=01
name=John Doe

Additional Details

This segment offers supplementary insights beneficial for configuring a Hive connection. It’s important to acknowledge that not all details within this section will be universally relevant to every Hive configuration, given the diverse deployment options available for Hive.

In Manta Data Lineage releases prior to R42, when utilizing Cloudera Hive Driver with HTTPS enabled through the advanced settings for the connection, the application will include supplementary parameters in the JDBC string to enable HTTPS communication. Among these parameters is ssl=true. However, the application may generate an error due to this parameter, stating Connection property ssl has invalid value of true. Valid values are: 0, 1. This is due to the Cloudera driver expecting a numeric value for the ssl parameter. To rectify this, ensure that the Connecting with Encryption advanced setting is set to false. Subsequently, modify the JDBC string directly in the HiveServer2 Connection String field. For instance: jdbc:hive2://<data_warehouse_host_name>:<port_num>/default;ssl=1.

Note that if your account employs Cloudera Public Cloud, no additional parameters are needed in the JDBC string. However, if you are utilizing the Cloudera Private Cloud solution, you will need to include sslTrustStore and trustStorePassword parameters in the JDBC string. For example: jdbc:hive2://<host>:<port>/<database>;ssl=1;sslTrustStore=<hive.extractor.truststore.path>;trustStorePassword=<hive.extractor.truststore.password>.
When considering connecting to Cloudera Public Cloud, be aware that establishing a connection to Cloudera Data Hub is not feasible. This is primarily because Cloudera Data Hub doesn’t host Hive databases; they are housed within Cloudera Data Warehouse. An indicator that your JDBC string may be pointing to a Data Hub URL is if the initial two characters of the site name are dh in the format dh-xxxxxx.cloudera.site. In such cases, the JDBC string might inadvertently point to a Data Hub URL instead of a Cloudera Data Warehouse URL.

To address this, within Cloudera, you can navigate to the virtual warehouse section to access the correct JDBC string for the virtual warehouse. After copying the JDBC string, it’s essential to modify the ssl parameter, changing it from the default true to 1, thus ensuring a proper connection.
In Manta Data Lineage releases prior to R42, when establishing connections to Cloudera Private Cloud environments, the Manta-generated JDBC string might lack the AuthMech property. In such scenarios, users are required to manually include the AuthMech property in the JDBC string used by Manta Data Lineage. The structure should resemble the following: jdbc:hive2://localhost:10000;AuthMech=1;principal=hive/hs2.example.com@EXAMPLE.COM. The value assigned to AuthMech will align with the specific authentication type. (See https://docs.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf, section Driver Configuration Options | AuthMech for more details.)
- 0 = Without authentication
- 1 = Kerberos authentication
- 2 = User name authentication
Ensure that you have the latest cryptography libraries installed
Check that you’ve configured your /etc/krb5.conf successfully
- If you’ve done this correctly you should be able to run kinit in terminal and create a ticket without issue

Common Errors

This section summarizes the most common errors that may occur when setting up the connection. These errors can be found in Hive extraction logs.

If you get message similar to Caused by: eu.profinit.manta.connector.common.extractor.exceptions.ExtractorException: Could not create connection with HiveServer2. Connection string (without truststore parameters): jdbc:hive2://h12nd.st.customer.org:10001;principal=hive/h12nd.st.customer.org@st.customer.ORG check if initial database name is part of the JDBC connection string; in this case, you might want to add something like. jdbc:hive2://h12nd.st.customer.org:10001/default;principal=hive/h12nd.st.customer.org@st.customer.ORG
If you get Caused by: java.lang.ClassNotFoundException: com.cloudera.hive.jdbc.HS2Driver, Manta Data Lineage could not load the Cloudera JDBC driver. Verify that Step 2 has been successfully completed, and verify that the version of the Cloudera driver is specified in the hive.extractor.hiveserver2.driver property.
For Example: hive.extractor.hiveserver2.driver=com.cloudera.hive.jdbc41.HS2Driver
Error Caused by: org.apache.thrift.transport.TTransportException: javax.net.ssl.SSLHandshakeException: No subject alternative names matching IP address 10.26.12.64 found is caused by the combination of the way Hive driver works internally and network setup. Hive driver translating the DNS name from JDBC connection string to IP adress. Then performs a reverse DNS lookup of the IP adress to the domain name and lastly it validates the result against the certificate subject and SubjectAlternativeNames in the certificate. In case of this error, reverse DNS lookup does not find a match for the IP address and therefore attempts to validate the IP address itself against subject a SubjectAlternativeNames. There are three possible solutions:
- fix the reverse DNS lookup to return proper DNS name for the IP address
- add the expect subject name from certificate to system hosts file for the IP address in question
- issue new SSL certificate that does have an IP address of the server as SubjectAlternativeName
Invalid status 21 means the Hive server is encrypted.
- Use SSL values in the JDBC connection string.
- It can also mean that Cloudera Hive is being used with the Apache Hive driver.
java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format) occurs if the library versions in Manta Data Lineage do not match your version of Hive.
Caused by: java.sql.SQLException: No suitable driver found for jdbc:hive2://xyz1.customer.com:2181,xyz2.customer.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;LogLevel=6;LogPath=/app/mantaflow/server/logs/;KrbRealm=HADOOP.CUSTOMER.COM;KrbHostFQDN=happ01.hadoop.customer.com;KrbServiceName=hive;;principal=hive/svc-bd-itd-prod@HADOOP.CUSTOMER.COM can be caused by:
- An unconfigured Hive Server 2 driver class
- Or, by a typo in the JDBC URL (In this case, there are two semicolons ;; prior to principal=..., which the Cloudera JDBC driver doesn’t seem to handle gracefully.)
The occurrence of org.springframework.beans.factory.BeanCreationException... arises when a conflict emerges between the classes within the supplied Hive driver and other Manta Data Lineage provided drivers. To mitigate this issue, take one of the following actions: either alter the Hive driver file extension (for example, .bkp) or relocate the Hive driver to a server location distinct from the lib-ext directory. This approach ensures the separation of driver resources and aids in the resolution of the aforementioned exception.