June 24, 2021 By Torsten Steinbach 6 min read

IBM Cloud Pak for Data now comes with an integrated connector to data lakes on IBM Cloud Object Storage.

IBM Cloud provides a fully managed data lake platform for data stored on IBM Cloud Object Storage. The central service of this platform is IBM Cloud SQL Query, which provides data ingestion (streaming-based), data preparation, data analytics and an integrated relational table catalog.

IBM Cloud Pak® for Data as a Service — a managed platform for broad range of data analytics and artificial intelligence (AI) — now comes with a built-in connector to the IBM Cloud SQL Query service. This allows you to set up connections to data lakes on IBM Cloud Object Storage and import the data lake table’s assets into projects and catalogs in Cloud Pak for Data.

You can create and use these cloud data lake connections like any other data source connection in Cloud Pak for Data. You can create them in your own projects, in a shared catalog and in a deployment. This post provides a short description of creating a cloud data lake connection in a project.

Adding a cloud data lake connection

Navigate to one of your projects and click Add to project:

Select Connection as the asset type to add:

For connection type, select SQL Query and then click Select:

Fill out the connection properties for your SQL Query service instance. See below for details on how to identify the values for your account:

Finding the CRN of the IBM Cloud SQL Query service

Navigate to the IBM Cloud console for your services and select the SQL Query instance to which you want to connect. In the service dashboard, you’ll find the CRN of the instance, as seen in the following screenshot:

Identifying the target Cloud Object Storage

IBM Cloud SQL Query uses a defined IBM Cloud Object Storage (COS) location to persist query results, and that’s why you must specify this location now. You can select any COS bucket to which you have write access. A recommended best practice is that you specify the COS bucket that is being created when you launch the SQL Query web console for the first time. Click Launch SQL Query UI. In the SQL Query UI, you can copy the value in Target location to your clipboard:

Getting an API key

In IBM Cloud console, navigate to Manage > Access (IAM):

Select API keys and then click Create an IBM Cloud API key:

Specify a revealing name for your new API key and click Create:

The API key is created and can immediately be copied to clipboard or downloaded to your local disk. Note that you will not be able to retrieve the actual value of the API key after you close this dialog, so make sure to download or save it otherwise at this point:

Importing a single data asset from Cloud Data Lake

In your project, click Add to project and select asset type Connected data:

In the following dialog, click Select source, select your new Cloud Data Lake connection from the list of connections and select a table that you want to import. After a few seconds (after a sample query finishes that is executed in the background), you can click Select.

Note the table catalog of a fresh instance if SQL Query is empty. In this case, no table is available to import. You first need to populate it with the table definitions for the data that make up your data lake. See the section Creating tables in Cloud Data Lake at the end of this blog post:

Provide a name for the new data asset and click Create:

After successful creation of the data asset, you can navigate to it and, for instance, do a preview of its content:

You can also generate a data profile in order get a statistical overview of the content of a data asset:

Batch importing multiple data assets from a Cloud Data Lake

If you have a cloud data lake set up already with a larger data model of tables, you can import all of its content in one operation using the Metadata Import function.

In your project, click Add to project and select asset type Metadata Import:

Provide a name for the import job and select your connection to the Cloud Data Lake as source and then click Next:

Select all tables that you want to import. Note that there is a background query submitted for each table that you select. Then, click Choose:

Review the batch import details before clicking Create to run the import job:

After a short while, you will see the all selected Cloud Data Lake tables as data assets in your project:

Creating tables in Cloud Data Lake

Each instance of IBM Cloud SQL Query provides its own table catalog. You can create your own tables with DDL SQL statements, very similar to how you do that in any relational database system. You can find more details about the metastore catalog in the SQL Query documentation. In addition, you will find a few ready-to-use DDL examples right inside the SQL Query UI in the Samples menu. You can create tables by explicitly specifying the list of columns and their data types. SQL Query also allows you to automatically infer the schema of the table from the data on IBM Cloud Object Storage. You then just have to specify the location of the data:

Further resources

Was this article helpful?
YesNo

More from Cloud

A clear path to value: Overcome challenges on your FinOps journey 

3 min read - In recent years, cloud adoption services have accelerated, with companies increasingly moving from traditional on-premises hosting to public cloud solutions. However, the rise of hybrid and multi-cloud patterns has led to challenges in optimizing value and controlling cloud expenditure, resulting in a shift from capital to operational expenses.   According to a Gartner report, cloud operational expenses are expected to surpass traditional IT spending, reflecting the ongoing transformation in expenditure patterns by 2025. FinOps is an evolving cloud financial management discipline…

IBM Power8 end of service: What are my options?

3 min read - IBM Power8® generation of IBM Power Systems was introduced ten years ago and it is now time to retire that generation. The end-of-service (EoS) support for the entire IBM Power8 server line is scheduled for this year, commencing in March 2024 and concluding in October 2024. EoS dates vary by model: 31 March 2024: maintenance expires for Power Systems S812LC, S822, S822L, 822LC, 824 and 824L. 31 May 2024: maintenance expires for Power Systems S812L, S814 and 822LC. 31 October…

24 IBM offerings winning TrustRadius 2024 Top Rated Awards

2 min read - TrustRadius is a buyer intelligence platform for business technology. Comprehensive product information, in-depth customer insights and peer conversations enable buyers to make confident decisions. “Earning a Top Rated Award means the vendor has excellent customer satisfaction and proven credibility. It’s based entirely on reviews and customer sentiment,” said Becky Susko, TrustRadius, Marketing Program Manager of Awards. Top Rated Awards have to be earned: Gain 10+ new reviews in the past 12 months Earn a trScore of 7.5 or higher from…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters