Hive Scanner Guide
IBM Manta Data Lineage is a powerful data lineage platform that simplifies data management by supporting automated lineage harvesting of the Hive database. This guide will walk you through the steps to set up Hive lineage harvesting.
Step 1: Collect the Required Information Related to Your Hive Environment
Clarify and verify the following details before starting the connection configuration to significantly reduce the amount of troubleshooting needed.
-
Who is your Hadoop platform vendor: Cloudera, Hortonworks, or someone else? You will need the appropriate JDBC driver and documentation from the vendor as some aspects may differ significantly.
-
The easiest and most reliable way to determine your Hadoop platform vendor is to refer to your system’s documentation or configuration. The vendor’s name is usually mentioned explicitly in the system’s documentation, installation guides, or configuration files.
-
Some Hadoop vendors provide custom command-line tools or utilities. Running these tools might provide information about the vendor. For example, Cloudera might have commands prefixed with
cloudera
and Hortonworks (now part of Cloudera) might usehdp-select
.
-
-
Is your Hive platform Kerberized? If so, is it configured to use
keytab
orcache
?
Obtain thekeytab
andkrb5
files or the JAAS file from your Hive administrator, depending on the authentication setup. -
Does your Hive platform use Zookeeper?
-
Will your firewall/proxy allow communication from Manta Server to Hive’s nodes?
-
Verify the other requirements and privileges detailed in Hive Integration Requirements.
hive --version
and
hadoop version
commands.
Step 2: Install the JDBC Driver
Because the Hive driver depends so much on the Hive vendor and, in some cases, even on the specific version of Hive, Manta Data Lineage does not bundle a Hive driver with the installation. The Manta administrator must place the downloaded Hive JDBC
driver JAR file(s) in the designated
mantaflow/agent/manta-flow-agent-dir/lib-ext
directory (the Manta Agent extractor method should be used for all Hive connections as of Manta Data Lineage R41). The Hive JDBC driver is generally bundled with the Hive installation and
can be provided by the Hive administrator. If not, it can be downloaded separately from your Hive vendor’s website. Ensure that you identify your Hive vendor; drivers from other vendors will not work.
-
If you’re using Cloudera, you can find their drivers here: Cloudera Drivers Source.
-
For Apache Hive, the drivers can be downloaded from Apache Hive Drivers
Manta Data Lineage cannot use the Apache Hive standalone driver because the driver contains incompatible versions of third-party libraries. Instead, use the following individual libraries. The best way to get the driver is to
-
go to Maven repository and download hive-jdbc https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc for your respective version of Hive.
-
download all the dependencies for it as well (some may be transitive, i.e. check dependencies of dependencies)
-
from the list of jars hat have been downloaded, remove those that already do exist in
mantaflow/cli/scenarios/manta-dataflow/cli/
or unclear for Agent. Check that the version of the library available in Manta Data Lineage fits into the range of supported libraries listed in Maven -
place the remaining jars into
mantaflow/agent/manta-flow-agent-dir/lib-ext/
for Agent.
For example, for hive-jdbc-3.1.2-bin.tar
, you will need to include:
As of Manta Data Lineage R40, add also
Artifact | Version | Link |
---|---|---|
commons-lang | 2.6 | https://mvnrepository.com/artifact/commons-lang/commons-lang/2.6 |
Manta Agent deployments require the above libraries deployed into
agent/manta-flow-agent-dir/lib-ext
. Additionally, the following are also required.
Artifact | Version | Link |
---|---|---|
httpclient | 4.5.13 | https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient/4.5.13 |
httpcore | 4.4.13 | https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore/4.4.13 |
commons-logging | 1.2 | https://mvnrepository.com/artifact/commons-logging/commons-logging/1.2 |
hadoop-common | 3.3.4 | https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/3.3.4 |
hadoop-auth | 3.3.4 | https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-auth/3.3.4 |
hadoop-shaded-guava | 1.1.1 | https://mvnrepository.com/artifact/org.apache.hadoop.thirdparty/hadoop-shaded-guava/1.1.1 |
slf4j-api | 1.7.30 | https://mvnrepository.com/artifact/org.slf4j/slf4j-api/1.7.30 |
Step 3: Configure the Connection
Create a new connection in Admin UI http(s)://<domain_name>:8181/manta-admin-gui/app/index.html?#/platform/connections/
to enable automated extraction and lineage analysis of Hive DDL files and the database dictionary by Manta
Data Lineage. You will need one Hive connection in Manta Data Lineage for each Hive server that you want to connect to.
Mandatory Properties to Configure
-
Hive Connection Name: Name of the connection configuration in Manta Data Lineage.
-
Hive Dictionary ID: Name of a resource representing this Hive server known as a dictionary ID, used as an output subdirectory name for extracted DDL files and the database dictionary.
-
Extracted Databases (this field is not required but is highly recommended to reduce the time it takes to extract metadata from the host Hive server): List of database names that should be included in the Hive extraction.
-
HiveServer2 Connection String The JDBC connection string for the Hive server.
-
HiveServer2 Hostname: The HiveServer2 hostname is the address that points to the machine where the HiveServer2 service is running. This critical piece of information is required to establish a connection between Manta Data Lineage and the HiveServer2 service.
-
HiveServer2 Driver Class Name: The HiveServer2 driver class name is a specific class that acts as a bridge between your application and the HiveServer2 instance when connecting to a Hive database using JDBC. The driver class name serves as a reference to the implementation of the JDBC driver for HiveServer2. It’s used by MManta Data Lineage to load and instantiate the appropriate driver when establishing a connection to the HiveServer2 instance. In the context of Hive and JDBC, the HiveServer2 driver class name is typically in the form of a Java package and class name, something like
org.apache.hive.jdbc.HiveDriver
for Apache distributions orcom.cloudera.hive.jdbc.HS2Driver
for Cloudera distributions. -
Authentication Mode: This can be set to Base (base authentication is a straightforward method of user authentication involving the transmission of an encoded username and password in the HTTP header) or Kerberos (a strong authentication protocol that requires clients to authenticate with a Kerberos Key Distribution Center (KDC) before accessing Hive services).
-
For base authentication, you’ll be guided to configure the essential username and password fields with the necessary credentials.
-
For Kerberos authentication, the setup involves configuring the Kerberos principal, which embodies the interaction between clients, the KDC, and servers. This interaction employs encrypted tickets and session keys to establish secure communication within a networked environment. You’ll also select a Kerberos authentication method.
-
For the Kerberos system ticket authentication method, termed the system ticket method, secure and trusted communication between clients and servers is established using encrypted tickets and session keys.
-
The keytab authentication method involves using a
keytab
file — a secure repository containing pairs of principal names and encrypted keys. Principals encompass users, services, or hosts. To utilize this method, the following is required.-
Kerberos user: This is an authenticated and authorized entity, like a user or service, for accessing network resources via the Kerberos protocol. Represented by a unique principal name within the Kerberos realm, their encrypted keys are stored in the keytab file.
-
Keytab path: This signifies the location on the Manta host machine where the keytab file resides:
/opt/mantaflow/cli/scenarios/manta-dataflow-cli/conf/keytabs/manta.keytab
. -
Krb5 configuration file path (not obligatory but recommended): This file configures how Kerberos clients interact with the authentication infrastructure, governing realms, servers, behaviors, and mappings for secure communication.
-
-
For the JAAS authentication method, based on Java, user authentication and access control are facilitated. If this method is selected for authorization, users must provide the JAAS file path for its configuration.
-
-
For R42, enhanced Cloudera support was added. Configure all fields in Admin UI.
- The Hive distribution should be Cloudera.
Hive Connection Verification Using Beeline
It is possible to verify that the Manta host server is able to connect with the Hive server using Hive’s beeline
command line tool. It works both with or without Kerberos. When utilizing Kerberos authentication, a ticket needs to be
acquired in the default cache (bash variable
KRB5CCNAME, verify with klist
).
It is not necessary to have the beeline tool installed on the Manta Host machine (in fact most host machines do not have it installed), but it may be beneficial to install it for troubleshooting Manta Data Lineage Hive connections.
Use the following beeline command to test establishing a connection to the Hive server.
beeline -u [jdbc connection string]
Configuration of Apache Hive - Base authentication (non-Cloudera or any other distribution)
-
Check the major version of Hive
-
Set authentication method to Base in Hive connection
-
If you are connecting to a Hive server that has Secure Sockets Layer (SSL) enabled, review the SSL Protected Connection section below.
Configuration of Cloudera Hive - Base authentication
-
Download the latest Cloudera Driver (hive-jdbc driver) from https://www.cloudera.com/downloads/connectors/impala/jdbc/ or get the driver from your Hive DB administrator, as Cloudera’s driver is very different from the Apache Hive driver, it is necessary to make the following configuration changes.
-
Place the downloaded Hive driver jar file in
mantaflow/cli/scenarios/manta-dataflow-cli/lib-ext/
-
Update the HiveServer2 driver class name in Hive Connection configuration to
com.cloudera.hive.jdbc.HS2Driver
hive.extractor.hiveserver2.driver
=com.cloudera.hive.jdbc.HS2Driver
-
To support JDBC 4.0, classes with the following FQCNs are available:
-
com.cloudera.hive.jdbc4.HS1Driver
-
com.cloudera.hive.jdbc4.HS2Driver
-
-
To support JDBC 4.1, classes with the following FQCNs are available:
-
com.cloudera.hive.jdbc41.HS1Driver
-
com.cloudera.hive.jdbc41.HS2Driver
-
-
-
Set authentication method to Base in Hive connection .
-
If you are connecting to a Hive server that has Secure Sockets Layer (SSL) enabled, review the SSL Protected Connection section below.
Kerberos Authentication
Connecting to Hadoop HIVE using Kerberos Authentication
Kerberos Verification Before Configuring Manta Data Lineage
If the host machine has Kerberos tools installed (kinit, klist, kvno), follow the steps to verify that Kerberos is configured correctly.
This verification is not necessary to set up the Kerberos authentication, but it rules out most of the Kerberos configuration issues. Only applicable if Kerberos is used.
-
If using ticket, just verify that
klist
lists acquired ticket. It is possible to specify cache file withklist -c [cache file]
when needed -
If using keytab, check that these steps complete without error
-
check that bash variable KRB5_CONFIG points to a krb5.conf file
-
Check what principals credentials are saved in the keytab with
klist [keytab location]
-
do
kinit -kt [keytab location] [principal name]
-
-
Now with Ticket Granting Ticket (TGT) in the cache file (from
kinit
) check that we're able to get the service ticket for Hive-
Get Hive service principal from the Hive administrator (this value usually matches the following pattern: hive/host@DOMAIN.COM)
-
Verify with:
kvno [service principal]
(e.g.kvno hive/host@DOMAIN.COM
)- It is also possible to use only the cache file, if
kinit
is not possible on the host machine, withkvno -c [cache file] [service principal]
- It is also possible to use only the cache file, if
-
Configuration
To configure the authentication to use Kerberos, you must provide Manta Data Lineage with the principal of the service (principal of Hive Server 2) and a principal that can obtain a service ticket for the service. Manta Data Lineage will then get a service token using this principal.
The properties that need to be configured are hive.extractor.hiveserver2.kerberos.principal
(set to the Hive Server 2 service principal) and hive.extractor.hiveserver2.kerberos.method
.
Optionally, you can provide the krb5.conf
file path, if the file isn’t installed in the default location, using the hive.extractor.hiveserver2.krbPath
property.
Note that the connection strings may differ across Hive JDBC drivers. For more information about troubleshooting / different configurations, see the Additional Details section below.
There are three ways you can provide the Kerberos credentials to Manta Data Lineage.
1. System Ticket
The system ticket method is the simplest way to configure Kerberos as Manta Data Lineage will use a TGT (ticket granting ticket) obtained by the system. The TGT must be valid (TGTs have limited validity), and the connection is to be configured with
the hive.extractor.hiveserver2.kerberos.method
set to System Ticket.
2. Keytab
Keytab is a secure credential storage that allows Manta Data Lineage to obtain a TGT for the configured principal/user. TGT validity is handled by Manta Data Lineage. The properties that need to be configured are as follows.
-
hive.extractor.hiveserver2.kerberos.method
set toKeytab
-
hive.extractor.hiveserver2.keytabPath
set to the filesystem path of the keytab file -
hive.extractor.hiveserver2.kerberosUser
set to the principal with credentials saved in the keytab file
3. JAAS
The JAAS (Java Authentication and Authorization Service) file is a universal configuration file for the JDBC driver to configure the authentication. Some drivers (notably the Cloudera driver) may require this method. Note that you may be required to set the connection string parameters to use the JAAS config. For more information, see Additional Details below.
/mantaflow/cli/scenarios/manta-dataflow-cli/etc/jaas/jaas.config
Here are two examples of jaas.conf
files.
-
Create
jaas.config
in/mantaflow/cli/scenarios/manta-dataflow-cli/etc/jaas/
.-
If you’re using keytab authentication:
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/path/to/some.keytab" principal="principal.in.keytab@DOMAIN.COM" doNotPrompt=true; };
-
If you’re using a cache:
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=false principal="principal.in.cache@DOMAIN.COM" //this is not required doNotPrompt=true useTicketCache=true ticketCache="/etc/krb5cache"; };
-
-
JDBC connection string configuration
-
Use the following JDBC string to connect to Hive.
jdbc:hive2://node1.example.com:10000;AuthMech=1;KrbRealm=EXAMPLE.COM;KrbHostFQDN=hs2node1.example.com;KrbServiceName=hive;KrbAuthType=1
-
Set the
AuthMech
property to1
for Kerberos. -
If your Kerberos setup has a default realm, omit setting the
KrbRealm
property. If not, set it to the realm of the Hive server. -
Define the fully-qualified domain name of the Hive server host using the
KrbHostFQDN
property. -
Optionally, specify how the driver obtains the Kerberos subject by setting the
KrbAuthType
.-
To auto-detect the subject method, use
0
. -
To use LoginContext from JAAS, use
1
. -
To use LoginContext from the Kerberos ticket cache, use
2
.
-
-
SSL Protected Connection
If https
is enabled, the ssl
certificate will need to be obtained and placed on the Manta server. If the certificate used for TLS/SSL is not included in your Java truststore, get an existing truststore with the certificate
or create a new one (to be used in the next step).
-
If you are using Manta Data Lineage R41 or older, add the SSL values to the JDBC connection string.
-
For Cloudera, add SSL values to JDBC connection string
SSL=1;SSLTrustStore=/path/to/truststore;SSLTrustStorePwd=pwd
-
For Apache add SSL values to JDBC connection string
ssl=true;sslTrustStore=/path/to/cacerts;trustStorePassword=changeit
-
-
If you are using Manta Data Lineage R42 or newer, do not add anything to the JDBC connection string. If configured correctly, the validation will succeed once the proper fields are set. When an SSL protected connection is used,
hive.extractor.truststore.path
andhive.extractor.truststore.password
are appended tohive.extractor.hiveserver2.url
, and the following URL is used to connect to Hive Server 2.
jdbc:hive2://<host>:<port>/<database>;ssl=<hive.extractor.enabled.encryptConnection>;sslTrustStore=<hive.extractor.truststore.path>;trustStorePassword=<hive.extractor.truststore.password>
Since the result must be a valid URL, the truststore path and truststore password must not contain any of the URL reserved characters !*'();:@&=+$,/?%#[]
or space characters. URL-encoding the forbidden characters is not supported
by the Hive JDBC driver either.
Optional Properties to Configure
- To control the scope of the extraction, use
hive.extractor.dbsInclude
andhive.extractor.dbsExclude
to restrict the analysis of design-time lineage.
For additional configuration definitions, review Hive Resource Configuration.
Step 4: Validating the Connection
Upon configuring all the necessary parameters, the connection validation process can be initiated by selecting the Save button. For optimal base authentication, it’s strongly recommended that users set up a solitary extracted database in the configuration before finalizing the connection. Notably, opting for the Kerberos authentication method will result in Manta Data Lineage bypassing the validation stage and directly saving the connection settings. Consequently, if the Kerberos authentication method is chosen, it’s essential to subsequently execute the Hive extractor scenario to ensure successful connection validation.
Step 5: Variable Substitution Configuration
If Hive variable substitution is used in the manually provided Hive SQL scripts, you must provide the substitution configuration so that the analyzed scripts are preprocessed the same way as in Hive. Note that without the correct substitution configuration, the resulting lineage might be inaccurate or the input script may even fail to be valid Hive SQL.
By default, the variable substitution configuration is unique for each Hive source system and should be placed in a file named
variableSubstitution.properties
in the same directory as the manually provided Hive SQL scripts (
${manta.dir.input}/hive/${hive.dictionary.id}
).
The configuration format is; variable names are processed case insensitively:
name1=value1
name2=value2
...
Example:
hiveconf:year=2016
hivevar:month=01
name=John Doe
Additional Details
This segment offers supplementary insights beneficial for configuring a Hive connection. It’s important to acknowledge that not all details within this section will be universally relevant to every Hive configuration, given the diverse deployment options available for Hive.
-
In Manta Data Lineage releases prior to R42, when utilizing Cloudera Hive Driver with HTTPS enabled through the advanced settings for the connection, the application will include supplementary parameters in the JDBC string to enable HTTPS communication. Among these parameters is
ssl=true
. However, the application may generate an error due to this parameter, statingConnection property ssl has invalid value of true. Valid values are: 0, 1
. This is due to the Cloudera driver expecting a numeric value for thessl
parameter. To rectify this, ensure that the Connecting with Encryption advanced setting is set tofalse
. Subsequently, modify the JDBC string directly in the HiveServer2 Connection String field. For instance:jdbc:hive2://<data_warehouse_host_name>:<port_num>/default;ssl=1
.Note that if your account employs Cloudera Public Cloud, no additional parameters are needed in the JDBC string. However, if you are utilizing the Cloudera Private Cloud solution, you will need to include
sslTrustStore
andtrustStorePassword
parameters in the JDBC string. For example:jdbc:hive2://<host>:<port>/<database>;ssl=1;sslTrustStore=<hive.extractor.truststore.path>;trustStorePassword=<hive.extractor.truststore.password>
. -
When considering connecting to Cloudera Public Cloud, be aware that establishing a connection to Cloudera Data Hub is not feasible. This is primarily because Cloudera Data Hub doesn’t host Hive databases; they are housed within Cloudera Data Warehouse. An indicator that your JDBC string may be pointing to a Data Hub URL is if the initial two characters of the site name are
dh
in the formatdh-xxxxxx.cloudera.site
. In such cases, the JDBC string might inadvertently point to a Data Hub URL instead of a Cloudera Data Warehouse URL.To address this, within Cloudera, you can navigate to the virtual warehouse section to access the correct JDBC string for the virtual warehouse. After copying the JDBC string, it’s essential to modify the
ssl
parameter, changing it from the defaulttrue
to1
, thus ensuring a proper connection. -
In Manta Data Lineage releases prior to R42, when establishing connections to Cloudera Private Cloud environments, the Manta-generated JDBC string might lack the
AuthMech
property. In such scenarios, users are required to manually include theAuthMech
property in the JDBC string used by Manta Data Lineage. The structure should resemble the following:jdbc:hive2://localhost:10000;AuthMech=1;principal=hive/hs2.example.com@EXAMPLE.COM
. The value assigned toAuthMech
will align with the specific authentication type. (See https://docs.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf, section Driver Configuration Options | AuthMech for more details.)-
0 = Without authentication
-
1 = Kerberos authentication
-
2 = User name authentication
-
-
Ensure that you have the latest cryptography libraries installed
-
Check that you’ve configured your
/etc/krb5.conf
successfully- If you’ve done this correctly you should be able to run
kinit
in terminal and create a ticket without issue
- If you’ve done this correctly you should be able to run
-
For Jass authentication adding the following lines to the
/mantaflow/cli/platform/bin/mantar[.sh|.bat]
file may be necessary as well-
-Djava.security.krb5.conf=c:\kerberos\krb5.ini
- note: this is the windows equivalent of
/etc/krb5.conf
- note: this is the windows equivalent of
-
-Djava.security.auth.login.config=c:\mantaflow\cli\scenarios\manta-dataflow-cli\etc\jaas\
-
note: this is the windows equivalent of
/opt/mantaflow/cli/scenarios/manta-dataflow-cli/etc/jaas/
-
success has also been reported with the following jaas.conf file & keytab usage
Client { com.sun.security.auth.module.Krb5LoginModule required debug=true doNotPrompt=true useKeyTab=true keyTab="C:\mantaflow\keytab.keytab" useTicketCache=true renewTGT=true principal="{user}@FOO.BAR"; };
-
-
Common Errors
This section summarizes the most common errors that may occur when setting up the connection. These errors can be found in Hive extraction logs.
-
If you get
Caused by: java.lang.ClassNotFoundException: com.cloudera.hive.jdbc.HS2Driver
, Manta Data Lineage could not load the Cloudera JDBC driver. Verify that Step 2 has been successfully completed, and verify that the version of the Cloudera driver is specified in thehive.extractor.hiveserver2.driver
property.
For Example:hive.extractor.hiveserver2.driver
=com.cloudera.hive.jdbc41.HS2Driver
-
Invalid status 21
means the Hive server is encrypted.-
Use SSL values in the JDBC connection string.
-
It can also mean that Cloudera Hive is being used with the Apache Hive driver.
-
-
java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format)
occurs if the library versions in Manta Data Lineage do not match your version of Hive. -
Caused by: java.sql.SQLException: No suitable driver found for jdbc:hive2://xyz1.customer.com:2181,xyz2.customer.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;LogLevel=6;LogPath=/app/mantaflow/server/logs/;KrbRealm=HADOOP.CUSTOMER.COM;KrbHostFQDN=happ01.hadoop.customer.com;KrbServiceName=hive;;principal=hive/svc-bd-itd-prod@HADOOP.CUSTOMER.COM
can be caused by:-
An unconfigured Hive Server 2 driver class
-
Or, by a typo in the JDBC URL (In this case, there are two semicolons
;;
prior toprincipal=...
, which the Cloudera JDBC driver doesn’t seem to handle gracefully.)
-
-
The occurrence of
org.springframework.beans.factory.BeanCreationException...
arises when a conflict emerges between the classes within the supplied Hive driver and other Manta Data Lineage provided drivers. To mitigate this issue, take one of the following actions: either alter the Hive driver file extension (for example,.bkp
) or relocate the Hive driver to a server location distinct from thelib-ext
directory. This approach ensures the separation of driver resources and aids in the resolution of the aforementioned exception.