Exploring the expanded capability in IBM Cloud to build and manage cloud data lakes on IBM Cloud Object Storage.
In particular, it explains the role of table metadata and how the IBM Cloud Data Engine service delivers this important component for your data lake.
We recommend you also watch the replay of the recent webinar for “Modernize your Big Data Analytics with Data Lakehouse in IBM Cloud” as well as the accompanying demo video to see the broader ecosystem in which this capability fits.
It’s not breaking news that metadata is a major element that needs to be managed for data and analytics solutions. Most people immediately associate data governance with this subject, and this is well justified because this is the type of metadata that ensures easy discoverability, data protection and tracking of the lineage for your data
However, metadata comprises more factors than just data governance. Most importantly, it also includes the so-called technical metadata. This is information about the schema of a data set, its data type and statistical information about the values in each column. This technical metadata is especially relevant when we talk about data lakes because unlike integrated database repositories such as RDBMS — which have built-in technical metadata — the technical metadata is a separate component in a data lake that needs to be set up and maintained explicitly.
Often, this component is referred to as the metastore or the table catalog. It’s technical information about your data that is required to compile and execute analytic queries — in particular, SQL statements.
The recent trend to data lakehouse technology is pushing technical metadata to be partially more collocated and stored along with the data itself in compound table formats like Iceberg and Delta Lake. However, this does not eliminate the need for a dedicated and central metastore component because table formats can only handle table-level metadata. Data is typically stored across multiple tables in a more or less complex table schema, which sometimes also includes information about referential relationships between tables or logical data models on top of tables as so-called views.
For these reasons, every data lake requires a metastore component or service. The most widely established metastore interface that is supported by a broad set of big data query and processing engines and libraries is the Hive Metastore. As the name reveals, its origin are in the Hadoop ecosystem. However it is not tied to or depending on Hadoop at all anymore, and it is frequently deployed and consumed in Hadoop-less environments, such as in a cloud data lake solution stack.
The metadata in a Hive Metastore is just as important as your data in the data lake itself and must be handled accordingly. This means that its metadata must be made persistent, highly available and included in any disaster recovery setup.
IBM launches IBM Cloud Data Engine
In our ongoing journey to expand IBM Cloud’s built-in data lake functionality, we launched the IBM Cloud Data Engine in May 2022. It expands the established serverless SQL processing service (formerly known as IBM Cloud SQL Query) by adding a fully managed Hive Metastore functionality.
Each serverless instance of IBM Cloud Data Engine is now also a dedicated instance and namespace of a Hive Metastore that can be used to configure, store and manage your table and data model metadata for all your data lake data on IBM Cloud Object Storage. You don’t have to worry about backups — the Hive Metastore data is highly available as part of the entire Data Engine service itself. The serverless consumption model of Data Engine also applies to the Hive Metastore function, which means that you are only charged for actual requests. There are no standing costs for having a Data Engine instance with metadata in its Hive Metastore.
This seamlessly integrates with the serverless SQL-based data ingestion, data transformation and analytic query functions that IBM Cloud Data Engine inherits from the IBM Cloud SQL Query service:
But Data Engine can now also be used as a Hive Metastore with other big data runtimes that you deploy and provision elsewhere. For instance, you can use the Spark runtime services in IBM Cloud Pak for Data with IBM Watson Studio or IBM Analytics Engine to connect to your instance of Data Engine as the Hive Metastore that serves as your relational table catalog for your Spark SQL jobs. The following diagram visualizes this architecture:
Using Data Engine with Spark aaS in IBM Cloud
Using Data Engine as your table catalog is very easy when you use built-in Spark runtime services in IBM Cloud and IBM Cloud Pak for Data. The required connectors to Hive Metastore of Data Engine are already deployed there out of the box. The following few lines of PySpark code set up a SparkSession object that is configured with your own instance of IBM Data Engine:
You can now use the SparkSession as usual; for instance, to get a listing of the currently defined tables and to submit SQL statements that access these tables:
Using Data Engine with your custom Spark deployments
When you manage your own Spark runtimes, you can use the same mechanisms as above. However, you have to first set up the Data Engine connector libraries in your Spark environment,
Install the Data Engine SparkSession builder
- Download the jar file for the SparkSession builder and place it in a folder in the classpath of your Spark installation (normally you should use the folder “user-libs/spark2”).
- Download the Python library to a local directory on the machine of your Spark installation and install it with pip:
Install and activate the Data Engine Hive client library
- Download the Hive client from this link and store it in a directory on your machine where you run Spark.
- Specify that directory name as an additional parameter when building the SparkSession with Data Engine as the catalog:
For more details, please refer to the Hive Metastore documentation of Data Engine. You can also use our Data Engine demo notebook that you can also download for local usage in your own Jupyter notebook environment or in the Watson Studio notebook service in Cloud Park for Data.
In chapter 10 of the notebook you can find a detailed setup and usage demo for Spark with Hive Metastore in Data Engine. You can also see a short demo of that Notebook at minute 14:35 here in the aforementioned demo video for the “Modernize your Big Data Analytics with Data Lakehouse in IBM Cloud” webinar.
With the new Hive Metastore as a Service capability in IBM Cloud described by this article, you get a central element for state-of-the-art data lakes in IBM Cloud delivery fully out of the box. There is no Day 1 setup or Day 2 operational overhead that you have to plan for. Just go and set up a serverless cloud-native data lake by provisioning an IBM Cloud Object Storage instance for your data and a Data Engine instance for your metadata.
Then, you can start ingesting, preparing, curating and using your data lake data with Data Engine service itself or with your custom Spark applications, Analytics Engine service, Spark runtimes in Watson Studio or your completely custom Spark runtime anywhere, connected to the same data on Object Storage and the same metadata in Data Engine.
Learn more about IBM Cloud Data Engine.