June 24, 2021 By Torsten Steinbach 6 min read

IBM Cloud Pak for Data now comes with an integrated connector to data lakes on IBM Cloud Object Storage.

IBM Cloud provides a fully managed data lake platform for data stored on IBM Cloud Object Storage. The central service of this platform is IBM Cloud SQL Query, which provides data ingestion (streaming-based), data preparation, data analytics and an integrated relational table catalog.

IBM Cloud Pak® for Data as a Service — a managed platform for broad range of data analytics and artificial intelligence (AI) — now comes with a built-in connector to the IBM Cloud SQL Query service. This allows you to set up connections to data lakes on IBM Cloud Object Storage and import the data lake table’s assets into projects and catalogs in Cloud Pak for Data.

You can create and use these cloud data lake connections like any other data source connection in Cloud Pak for Data. You can create them in your own projects, in a shared catalog and in a deployment. This post provides a short description of creating a cloud data lake connection in a project.

Adding a cloud data lake connection

Navigate to one of your projects and click Add to project:

Select Connection as the asset type to add:

For connection type, select SQL Query and then click Select:

Fill out the connection properties for your SQL Query service instance. See below for details on how to identify the values for your account:

Finding the CRN of the IBM Cloud SQL Query service

Navigate to the IBM Cloud console for your services and select the SQL Query instance to which you want to connect. In the service dashboard, you’ll find the CRN of the instance, as seen in the following screenshot:

Identifying the target Cloud Object Storage

IBM Cloud SQL Query uses a defined IBM Cloud Object Storage (COS) location to persist query results, and that’s why you must specify this location now. You can select any COS bucket to which you have write access. A recommended best practice is that you specify the COS bucket that is being created when you launch the SQL Query web console for the first time. Click Launch SQL Query UI. In the SQL Query UI, you can copy the value in Target location to your clipboard:

Getting an API key

In IBM Cloud console, navigate to Manage > Access (IAM):

Select API keys and then click Create an IBM Cloud API key:

Specify a revealing name for your new API key and click Create:

The API key is created and can immediately be copied to clipboard or downloaded to your local disk. Note that you will not be able to retrieve the actual value of the API key after you close this dialog, so make sure to download or save it otherwise at this point:

Importing a single data asset from Cloud Data Lake

In your project, click Add to project and select asset type Connected data:

In the following dialog, click Select source, select your new Cloud Data Lake connection from the list of connections and select a table that you want to import. After a few seconds (after a sample query finishes that is executed in the background), you can click Select.

Note the table catalog of a fresh instance if SQL Query is empty. In this case, no table is available to import. You first need to populate it with the table definitions for the data that make up your data lake. See the section Creating tables in Cloud Data Lake at the end of this blog post:

Provide a name for the new data asset and click Create:

After successful creation of the data asset, you can navigate to it and, for instance, do a preview of its content:

You can also generate a data profile in order get a statistical overview of the content of a data asset:

Batch importing multiple data assets from a Cloud Data Lake

If you have a cloud data lake set up already with a larger data model of tables, you can import all of its content in one operation using the Metadata Import function.

In your project, click Add to project and select asset type Metadata Import:

Provide a name for the import job and select your connection to the Cloud Data Lake as source and then click Next:

Select all tables that you want to import. Note that there is a background query submitted for each table that you select. Then, click Choose:

Review the batch import details before clicking Create to run the import job:

After a short while, you will see the all selected Cloud Data Lake tables as data assets in your project:

Creating tables in Cloud Data Lake

Each instance of IBM Cloud SQL Query provides its own table catalog. You can create your own tables with DDL SQL statements, very similar to how you do that in any relational database system. You can find more details about the metastore catalog in the SQL Query documentation. In addition, you will find a few ready-to-use DDL examples right inside the SQL Query UI in the Samples menu. You can create tables by explicitly specifying the list of columns and their data types. SQL Query also allows you to automatically infer the schema of the table from the data on IBM Cloud Object Storage. You then just have to specify the location of the data:

Further resources

Was this article helpful?
YesNo

More from Cloud

Attention new clients: exciting financial incentives for VMware Cloud Foundation on IBM Cloud

4 min read - New client specials: Get up to 50% off when you commit to a 1- or 3-year term contract on new VCF-as-a-Service offerings, plus an additional value of up to USD 200K in credits through 30 June 2025 when you migrate your VMware workloads to IBM Cloud®.1 Low starting prices: On-demand VCF-as-a-Service deployments begin under USD 200 per month.2 The IBM Cloud benefit: See the potential for a 201%3 return on investment (ROI) over 3 years with reduced downtime, cost and…

24 IBM offerings winning TrustRadius 2024 Top Rated Awards

2 min read - TrustRadius is a buyer intelligence platform for business technology. Comprehensive product information, in-depth customer insights and peer conversations enable buyers to make confident decisions. “Earning a Top Rated Award means the vendor has excellent customer satisfaction and proven credibility. It’s based entirely on reviews and customer sentiment,” said Becky Susko, TrustRadius, Marketing Program Manager of Awards. Top Rated Awards have to be earned: Gain 10+ new reviews in the past 12 months Earn a trScore of 7.5 or higher from…

IBM Tech Now: April 8, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 96 On this episode, we're covering the following topics: IBM Cloud Logs A collaboration with IBM watsonx.ai and Anaconda IBM offerings in the G2 Spring Reports Stay plugged in You can check out the…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters