February 21, 2023 By Torsten Steinbach 5 min read

Exploring the expanded capability in IBM Cloud to build and manage cloud data lakes on IBM Cloud Object Storage.

In particular, it explains the role of table metadata and how the IBM Cloud Data Engine service delivers this important component for your data lake.

We recommend you also watch the replay of the recent webinar for “Modernize your Big Data Analytics with Data Lakehouse in IBM Cloud” as well as the accompanying demo video to see the broader ecosystem in which this capability fits.

Context

It’s not breaking news that metadata is a major element that needs to be managed for data and analytics solutions. Most people immediately associate data governance with this subject, and this is well justified because this is the type of metadata that ensures easy discoverability, data protection and tracking of the lineage for your data

However, metadata comprises more factors than just data governance. Most importantly, it also includes the so-called technical metadata. This is information about the schema of a data set, its data type and statistical information about the values in each column. This technical metadata is especially relevant when we talk about data lakes because unlike integrated database repositories such as RDBMS — which have built-in technical metadata — the technical metadata is a separate component in a data lake that needs to be set up and maintained explicitly.

Often, this component is referred to as the metastore or the table catalog. It’s technical information about your data that is required to compile and execute analytic queries — in particular, SQL statements.

The recent trend to data lakehouse technology is pushing technical metadata to be partially more collocated and stored along with the data itself in compound table formats like Iceberg and Delta Lake. However, this does not eliminate the need for a dedicated and central metastore component because table formats can only handle table-level metadata. Data is typically stored across multiple tables in a more or less complex table schema, which sometimes also includes information about referential relationships between tables or logical data models on top of tables as so-called views.

For these reasons, every data lake requires a metastore component or service. The most widely established metastore interface that is supported by a broad set of big data query and processing engines and libraries is the Hive Metastore. As the name reveals, its origin are in the Hadoop ecosystem. However it is not tied to or depending on Hadoop at all anymore, and it is frequently deployed and consumed in Hadoop-less environments, such as in a cloud data lake solution stack.

The metadata in a Hive Metastore is just as important as your data in the data lake itself and must be handled accordingly. This means that its metadata must be made persistent, highly available and included in any disaster recovery setup.

IBM launches IBM Cloud Data Engine

In our ongoing journey to expand IBM Cloud’s built-in data lake functionality, we launched the IBM Cloud Data Engine in May 2022. It expands the established serverless SQL processing service (formerly known as IBM Cloud SQL Query) by adding a fully managed Hive Metastore functionality.

Each serverless instance of IBM Cloud Data Engine is now also a dedicated instance and namespace of a Hive Metastore that can be used to configure, store and manage your table and data model metadata for all your data lake data on IBM Cloud Object Storage. You don’t have to worry about backups — the Hive Metastore data is highly available as part of the entire Data Engine service itself. The serverless consumption model of Data Engine also applies to the Hive Metastore function, which means that you are only charged for actual requests. There are no standing costs for having a Data Engine instance with metadata in its Hive Metastore.

This seamlessly integrates with the serverless SQL-based data ingestion, data transformation and analytic query functions that IBM Cloud Data Engine inherits from the IBM Cloud SQL Query service:

But Data Engine can now also be used as a Hive Metastore with other big data runtimes that you deploy and provision elsewhere. For instance, you can use the Spark runtime services in IBM Cloud Pak for Data with IBM Watson Studio or IBM Analytics Engine to connect to your instance of Data Engine as the Hive Metastore that serves as your relational table catalog for your Spark SQL jobs. The following diagram visualizes this architecture:

Using Data Engine with Spark aaS in IBM Cloud

Using Data Engine as your table catalog is very easy when you use built-in Spark runtime services in IBM Cloud and IBM Cloud Pak for Data. The required connectors to Hive Metastore of Data Engine are already deployed there out of the box. The following few lines of PySpark code set up a SparkSession object that is configured with your own instance of IBM Data Engine:

instancecrn = <your Data Engine instance ID>
apikey = <your API key to access your Data Engine instance>
from dataengine import SparkSessionWithDataengine
session_builder = SparkSessionWithDataengine.enableDataengine(instancecrn, apikey)
spark = session_builder.appName("My Spark App").getOrCreate()

You can now use the SparkSession as usual; for instance, to get a listing of the currently defined tables and to submit SQL statements that access these tables:

spark.sql('show tables').show()
spark.sql('select count(*), country from my_customers group by country').show()

Using Data Engine with your custom Spark deployments

When you manage your own Spark runtimes, you can use the same mechanisms as above. However, you have to first set up the Data Engine connector libraries in your Spark environment,

Install the Data Engine SparkSession builder

  1. Download the jar file for the SparkSession builder and place it in a folder in the classpath of your Spark installation (normally you should use the folder “user-libs/spark2”).
  2. Download the Python library to a local directory on the machine of your Spark installation and install it with pip:
    pip install --force-reinstall <download dir>/dataengine_spark-1.0.10-py3-none-any.whl

Install and activate the Data Engine Hive client library

  1. Download the Hive client from this link and store it in a directory on your machine where you run Spark.
  2. Specify that directory name as an additional parameter when building the SparkSession with Data Engine as the catalog:
    session_builder = SparkSessionWithDataengine.enableDataengine(instancecrn, apikey, pathToHiveMetastoreJars=<directory name with hive client>)

For more details, please refer to the Hive Metastore documentation of Data Engine. You can also use our Data Engine demo notebook that you can also download for local usage in your own Jupyter notebook environment or in the Watson Studio notebook service in Cloud Park for Data.

In chapter 10 of the notebook you can find a detailed setup and usage demo for Spark with Hive Metastore in Data Engine. You can also see a short demo of that Notebook at minute 14:35 here in the aforementioned demo video for the “Modernize your Big Data Analytics with Data Lakehouse in IBM Cloud” webinar.

Conclusion

With the new Hive Metastore as a Service capability in IBM Cloud described by this article, you get a central element for state-of-the-art data lakes in IBM Cloud delivery fully out of the box. There is no Day 1 setup or Day 2 operational overhead that you have to plan for. Just go and set up a serverless cloud-native data lake by provisioning an IBM Cloud Object Storage instance for your data and a Data Engine instance for your metadata.

Then, you can start ingesting, preparing, curating and using your data lake data with Data Engine service itself or with your custom Spark applications, Analytics Engine service, Spark runtimes in Watson Studio or your completely custom Spark runtime anywhere, connected to the same data on Object Storage and the same metadata in Data Engine.

Learn more about IBM Cloud Data Engine.

Was this article helpful?
YesNo

More from Cloud

A clear path to value: Overcome challenges on your FinOps journey 

3 min read - In recent years, cloud adoption services have accelerated, with companies increasingly moving from traditional on-premises hosting to public cloud solutions. However, the rise of hybrid and multi-cloud patterns has led to challenges in optimizing value and controlling cloud expenditure, resulting in a shift from capital to operational expenses.   According to a Gartner report, cloud operational expenses are expected to surpass traditional IT spending, reflecting the ongoing transformation in expenditure patterns by 2025. FinOps is an evolving cloud financial management discipline…

IBM Power8 end of service: What are my options?

3 min read - IBM Power8® generation of IBM Power Systems was introduced ten years ago and it is now time to retire that generation. The end-of-service (EoS) support for the entire IBM Power8 server line is scheduled for this year, commencing in March 2024 and concluding in October 2024. EoS dates vary by model: 31 March 2024: maintenance expires for Power Systems S812LC, S822, S822L, 822LC, 824 and 824L. 31 May 2024: maintenance expires for Power Systems S812L, S814 and 822LC. 31 October…

24 IBM offerings winning TrustRadius 2024 Top Rated Awards

2 min read - TrustRadius is a buyer intelligence platform for business technology. Comprehensive product information, in-depth customer insights and peer conversations enable buyers to make confident decisions. “Earning a Top Rated Award means the vendor has excellent customer satisfaction and proven credibility. It’s based entirely on reviews and customer sentiment,” said Becky Susko, TrustRadius, Marketing Program Manager of Awards. Top Rated Awards have to be earned: Gain 10+ new reviews in the past 12 months Earn a trScore of 7.5 or higher from…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters