June 24, 2021 By Torsten Steinbach 6 min read

IBM Cloud Pak for Data now comes with an integrated connector to data lakes on IBM Cloud Object Storage.

IBM Cloud provides a fully managed data lake platform for data stored on IBM Cloud Object Storage. The central service of this platform is IBM Cloud SQL Query, which provides data ingestion (streaming-based), data preparation, data analytics and an integrated relational table catalog.

IBM Cloud Pak® for Data as a Service — a managed platform for broad range of data analytics and artificial intelligence (AI) — now comes with a built-in connector to the IBM Cloud SQL Query service. This allows you to set up connections to data lakes on IBM Cloud Object Storage and import the data lake table’s assets into projects and catalogs in Cloud Pak for Data.

You can create and use these cloud data lake connections like any other data source connection in Cloud Pak for Data. You can create them in your own projects, in a shared catalog and in a deployment. This post provides a short description of creating a cloud data lake connection in a project.

Adding a cloud data lake connection

Navigate to one of your projects and click Add to project:

Select Connection as the asset type to add:

For connection type, select SQL Query and then click Select:

Fill out the connection properties for your SQL Query service instance. See below for details on how to identify the values for your account:

Finding the CRN of the IBM Cloud SQL Query service

Navigate to the IBM Cloud console for your services and select the SQL Query instance to which you want to connect. In the service dashboard, you’ll find the CRN of the instance, as seen in the following screenshot:

Identifying the target Cloud Object Storage

IBM Cloud SQL Query uses a defined IBM Cloud Object Storage (COS) location to persist query results, and that’s why you must specify this location now. You can select any COS bucket to which you have write access. A recommended best practice is that you specify the COS bucket that is being created when you launch the SQL Query web console for the first time. Click Launch SQL Query UI. In the SQL Query UI, you can copy the value in Target location to your clipboard:

Getting an API key

In IBM Cloud console, navigate to Manage > Access (IAM):

Select API keys and then click Create an IBM Cloud API key:

Specify a revealing name for your new API key and click Create:

The API key is created and can immediately be copied to clipboard or downloaded to your local disk. Note that you will not be able to retrieve the actual value of the API key after you close this dialog, so make sure to download or save it otherwise at this point:

Importing a single data asset from Cloud Data Lake

In your project, click Add to project and select asset type Connected data:

In the following dialog, click Select source, select your new Cloud Data Lake connection from the list of connections and select a table that you want to import. After a few seconds (after a sample query finishes that is executed in the background), you can click Select.

Note the table catalog of a fresh instance if SQL Query is empty. In this case, no table is available to import. You first need to populate it with the table definitions for the data that make up your data lake. See the section Creating tables in Cloud Data Lake at the end of this blog post:

Provide a name for the new data asset and click Create:

After successful creation of the data asset, you can navigate to it and, for instance, do a preview of its content:

You can also generate a data profile in order get a statistical overview of the content of a data asset:

Batch importing multiple data assets from a Cloud Data Lake

If you have a cloud data lake set up already with a larger data model of tables, you can import all of its content in one operation using the Metadata Import function.

In your project, click Add to project and select asset type Metadata Import:

Provide a name for the import job and select your connection to the Cloud Data Lake as source and then click Next:

Select all tables that you want to import. Note that there is a background query submitted for each table that you select. Then, click Choose:

Review the batch import details before clicking Create to run the import job:

After a short while, you will see the all selected Cloud Data Lake tables as data assets in your project:

Creating tables in Cloud Data Lake

Each instance of IBM Cloud SQL Query provides its own table catalog. You can create your own tables with DDL SQL statements, very similar to how you do that in any relational database system. You can find more details about the metastore catalog in the SQL Query documentation. In addition, you will find a few ready-to-use DDL examples right inside the SQL Query UI in the Samples menu. You can create tables by explicitly specifying the list of columns and their data types. SQL Query also allows you to automatically infer the schema of the table from the data on IBM Cloud Object Storage. You then just have to specify the location of the data:

Further resources

Was this article helpful?
YesNo

More from Cloud

Fortressing the digital frontier: A comprehensive look at IBM Cloud network security services

6 min read - The cloud revolution has fundamentally transformed how businesses operate. Its superior scalability, agility and cost-effectiveness have made it the go-to platform for organizations of all sizes. However, this shift to the cloud has introduced a new landscape of ever-evolving security threats. Data breaches and cyberattacks continue to hit organizations, making robust cloud network security an absolute necessity. IBM®, a titan in the tech industry, recognizes this critical need, provides a comprehensive suite of tools and offers unmatched expertise to fortify…

How well do you know your hypervisor and firmware?

6 min read - IBM Cloud® Virtual Private Cloud (VPC) is designed for secured cloud computing, and several features of our platform planning, development and operations help ensure that design. However, because security in the cloud is typically a shared responsibility between the cloud service provider and the customer, it’s essential for you to fully understand the layers of security that your workloads run on here with us. That’s why here, we detail a few key security components of IBM Cloud VPC that aim…

New IBM study: How business leaders can harness the power of gen AI to drive sustainable IT transformation

3 min read - As organizations strive to balance productivity, innovation and environmental responsibility, the need for sustainable IT practices is even more pressing. A new global study from the IBM Institute for Business Value reveals that emerging technologies, particularly generative AI, can play a pivotal role in advancing sustainable IT initiatives. However, successful transformation of IT systems demands a strategic and enterprise-wide approach to sustainability. The power of generative AI in sustainable IT Generative AI is creating new opportunities to transform IT operations…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters