Hive Resource Configuration

Before you configure your scanner, make sure you meet the prerequisites. Read our guide on Hive integration requirements to double-check.

Source System Properties

This configuration can be setup by creating a new connection on Admin UI > Connections tab or editing an existing connection in Admin UI > Connections > Databases > Hive > specific connection. New connection can also be created via Manta Orchestration API.

Granularity of the IBM Automatic Data Lineage connection for Hive is one Hive server. Use filter on databases to limit the scope of analysis as needed. Use of multiple connections against a single Hive server may lead to within-system lineage not be connected properly.

Common Source System Properties

Property name

Description

Example

hive.dictionary.id

Name of a resource representing this Hive server known as a dictionary ID, used as an output subdirectory name for extracted DDL files and the database dictionary

DWH

hive.extractor.dbsInclude

List of databases to extract, separated by commas; each part is evaluated as a regular expression

database1,database2,stage_.*

hive.extractor.dbsExclude

List of databases and schemas to exclude from extraction, separated by commas

database1,database2,stage_.*

hive.extraction.method

Set to Agent:default when the desired extraction method is the default Manta Extractor Agent, set to Agent:{remote_agent_name} when a remote Agent is the desired extraction method, set to Git:{git.dictionary.id} when the Git ingest method is the desired extraction method. For more information on setting up a remote extractor Agent please refer to the Manta Flow Agent Configuration for Extraction documentation. For additional details on configuring a Git ingest method, please refer to the Manta Flow Agent Configuration for Extraction:Git Source documentation.

default

Git

agent

hive.ddl.encoding

Encoding of automatically extracted Hive DDL scripts. See Encodings for applicable values.

utf8

hive.script.encoding

Encoding of manually provided Hive SQL scripts performed on this Hive source system. See Encodings for applicable values.

utf8

hive.extractor.hiveserver2.url

JDBC connection string to HiveServer2

jdbc:hive2://localhost:10000/databasename

hive.extractor.hiveserver2.host

Host name of the server where HiveServer2 is installed

localhost

hive.extractor.hiveserver2.driver

JDBC driver class used to connect to HiveServer2

org.apache.hive.jdbc.HiveDriver

hive.extractor.hiveserver2.distribution

Hive vendor: Cloudera, Apache, or another vendor

Cloudera

Apache

Other

hive.extractor.hiveserver2.user

User name used to connect to HiveServer2

hive_user

hive.extractor.hiveserver2.password

Password used to connect to HiveServer2

hive_password

hive.extractor.hiveserver2.kerberos.principal

User/principal of the HiveServer2 service

Only applicable when hive.extractor.hiveserver2.distribution is set to Apache

hive/server.example.com@EXAMPLE.COM

hive.extractor.hiveserver2.krbPath

Path for the Kerberos KRB5 file

Only applicable when hive.extractor.hiveserver2.distribution is set to Apache

/etc/krb5.conf

hive.extractor.hiveserver2.kerberos.method

Kerberos authentication method

Only applicable when hive.extractor.hiveserver2.distribution is set to Apache

System ticket

Keytab

JAAS

hive.extractor.hiveserver2.kerberosUser

Kerberos user for the Manta extractor

Only applicable when hive.extractor.hiveserver2.kerberos.method is set to Keytab and hive.extractor.hiveserver2.distribution is set to Apache

admin/kadmin@EXAMPLE.COM

hive.extractor.hiveserver2.keytabPath

Path for the Kerberos keytab

Only applicable when hive.extractor.hiveserver2.kerberos.method is set to Keytab and hive.extractor.hiveserver2.distribution is set to Apache

/opt/mantaflow/cli/scenarios/manta-dataflow-cli/conf/keytabs/manta.keytab

java.security.auth.login.config

Path for the JAAS configuration file

Only applicable when hive.extractor.hiveserver2.kerberos.method is set to JAAS and hive.extractor.hiveserver2.distribution is set to Apache

/etc/jaas.conf

hive.extractor.hiveserver2.authentication.mode

Authentication mode used for the Cloudera connection

Only applicable when hive.extractor.hiveserver2.distribution is set to Cloudera

Kerberos

Base

hive.extractor.hiveserver2.cloudera.kerberos.realm

Kerberos realm

If your Kerberos setup does not define a default realm or if the realm of your Hive server is not the default, then set the appropriate realm using this property

Only applicable when hive.extractor.hiveserver2.distribution is set to Cloudera

EXAMPLE.COM

hive.extractor.hiveserver2.cloudera.kerberos.hostfqdn

Kerberos fully qualified domain name of the Hive Server 2 host

Only applicable when hive.extractor.hiveserver2.distribution is set to Cloudera

hs2.example.com

hive.extractor.hiveserver2.cloudera.kerberos.servicename

The service name of Hive Server 2

Only applicable when hive.extractor.hiveserver2.distribution is set to Cloudera

hive

Files for configuration path properties can be anywhere in the filesystem as long as Automatic Data Lineage can access them.

Common Scanner Properties

This configuration is common for all Hive source systems and for all Hive scenarios, and is configure in Admin UI > Configuration > CLI > Hive > Hive Common . It can be overridden on individual connection level.

Property name

Description

Example

hive.dictionary.dir

Directory with data dictionaries extracted from Hive

${manta.dir.temp}/hive

filepath.lowercase

Whether paths to files should be lowercase (false for case-sensitive file systems, true otherwise)

false
true

hive.dll.output

Directory for automatically extracted Hive DDL scripts (for the extraction phase)

${manta.dir.temp}/hive/${hive.dictionary.id}/ddl

hive.script.input

Directory with manually provided Hive SQL scripts which are performed on this Hive source system (for the analysis phase)

${manta.dir.input}/hive/${hive.dictionary.id}/

hive.script.replace

Path to the CSV file with the replacements to be applied to the provided HiveQL scripts; see Placeholder Replacement in Input Scripts

${manta.dir.input}/hive/${hive.dictionary.id}/replace.csv

hive.script.replace.regex

Flag specifying whether replacements for HiveQL scripts in the provided CSV file specified in hive.script.replace should be interpreted as regular expressions (true) or simple text (false)

false
true

hive.analyze.parallelCount

Number of parallel threads which will analyze DDL and Hive SQL scripts

4

hive.variableSubstitutionFile

Path to user-provided variable substitution configuration

${manta.dir.input}/hive/${hive.dictionary.id}/variableSubstitution.properties

hive.dictionary.mappingFile

Path to automatically generated mappings for Hive databases (the file will be created if missing)

${manta.dir.temp}/hive/hiveDictionaryMantaMapping.csv

hive.dictionary.mappingManualFile

Path to manually provided mappings for Hive databases (the file must exist — create the file if it is missing; the file may be empty)

${manta.dir.scenario}/conf/hiveDictionaryMantaMappingManual.csv

hive.extractor.enabled.encryptConnection

Use an SSL-protected connection

false
true

hive.analyze.retainUnusedResultSetColumns

Flag specifying whether the data lineage should include sub-query resultset columns that do not have any downstream lineage

By default, set to false

false
true

Manual Mapping Properties

It is possible to manually configure mappings for Hive servers in the file ${hive.dictionary.mappingManualFile} listed in the table above. In order to configure manual mapping, copy a relevant line from the automatically generated file ${hive.dictionary.mappingFile} and modify the Connection ID column with the external connection identifier.

Each mapping has its own row with the following parts separated by semicolons.

Property name Description Example
Dictionary ID Name of a resource representing the Hive server known as dictionary ID, used as an output subdirectory name for extracted DDL files and the database dictionary Hive
Host name Name of the host where the databases are running localhost
Connection ID External Hive connection ID in third-party tools, or it can be left empty Hive
Included databases List of databases to extract, separated by commas; leave blank to mark all databases in the server database1,database2
Excluded databases List of databases NOT to extract, separated by commas; leave blank to not restrict extractions database1,database2