Apache Hive connection
To access your data in Apache Hive, create a connection asset for it.
Apache Hive is a data warehouse software project that provides data query and analysis and is built on top of Apache Hadoop.
Supported versions
- Amazon Elastic MapReduce 2.1.4+
- Apache Hadoop Hive
- Cloudera CDH3 update 4+
- Hortonworks 1.3+
- MapR 1.2+
- Pivotal HD Enterprise 2.0.1
Prerequisite for Kerberos authentication
To use Kerberos authentication, the data source must be configured for Kerberos and the service that you plan to use the connection in must support Kerberos. For information, see Enabling platform connections to use Kerberos authentication.
Create a connection to Apache Hive
To create the connection asset, you need these connection details:
- Database name
- Hostname or IP address
- Port number
- HTTP path (Optional): The path of the endpoint such as the gateway, default, or hive if the server is configured for the HTTP transport mode.
- SSL certificate (if required by the database server)
Authentication method
You can choose Kerberos credentials or Username and password.
For Kerberos credentials, you must complete the prerequisite for Kerberos authentication and you
need the following connection details:
- Service principal name (SPN) that is configured for the data source
- User principal name to connect to the Kerberized data source
- The keytab file for the user principal name that is used to authenticate to the Key Distribution Center (KDC)
ZooKeeper discovery (optional)
Select Use ZooKeeper discovery to ensure continued access to the connection in case the Apache Hive server that you log in to fails.
Prerequisites for ZooKeeper discovery:
- ZooKeeper must be configured in your Hadoop cluster.
- The Hive service in the Hadoop cluster must be configured for ZooKeeper, along with the ZooKeeper namespace.
- Alternative servers for failover.
Enter the ZooKeeper namespace and a comma-separated list of alternative servers in this format:
hostname1:port-number1,hostname2:port-number2,hostname3:port-number3
For Credentials and Certificates, you can use secrets if a vault is configured for the platform and the service supports vaults. For information, see Using secrets from vaults in connections.
Choose the method for creating a connection based on where you are in the platform
- In a project
- Click Assets > New asset > Data access tools > Connection. See Adding a connection to a project.
- In a catalog
- Click Add to catalog > Connection. See Adding a connection asset to a catalog.
- In a deployment space
- Click Add to space > Connection. See Adding connections to a deployment space.
- In the Platform assets catalog
- Click New connection. See Adding platform connections.
Next step: Add data assets from the connection
Where you can use this connection
You can use the Apache Hive connection in the following workspaces and tools:
Projects
-
Data quality rules (Watson Knowledge Catalog)
-
Data Refinery (Watson Studio or Watson Knowledge Catalog)
-
DataStage (DataStage service). See Connecting to a data source in DataStage.
-
Metadata enrichment (Watson Knowledge Catalog)
-
Metadata import (Watson Knowledge Catalog). For information about the supported product versions and other prerequisites when connections are based on MANTA Automated Data Lineage for IBM Cloud Pak for Data scanners, see the Lineage Scanner Configuration section in the MANTA Automated Data Lineage on IBM Cloud Pak for Data Installation and Usage Manual. This documentation is available at https://www.ibm.com/support/pages/node/6597457.
For metadata import (lineage), advanced metadata import must be enabled and a MANTA Automated Data Lineage license key must be installed. See Installing Watson Knowledge Catalog and Enabling lineage import.
In addition, a specific driver must be uploaded to the manta-dataflow pod or the manta-admin-gui pod. See Uploading JDBC drivers for lineage import. For Kerberos authentication, also some prerequisite configuration is required. See Configuring Hive with Kerberos for lineage imports.
- Notebooks (Watson Studio). Use the insert-to-code function to get the connection credentials and load the data into a data structure. See Load data from data source connections.
- SPSS Modeler (SPSS Modeler service)
Catalogs
- Platform assets catalog
- Other catalogs (Watson Knowledge Catalog)
- Watson Query service
- You can connect to this data source from Watson Query.
Data discovery and Data quality
- Automated discovery
Apache Hive setup
Restriction
For all services except DataStage, you can use this connection only for source data. You cannot write to data or export data with this connection. In DataStage, you can use this connection as a target if you select Use DataStage properties in the connector's properties.
Running SQL statements
To ensure that your SQL statements run correctly, refer to the SQL Operations in the Apache Hive documentation for the correct syntax.
Learn more
Parent topic: Supported connections