IBM Cloud Pak for Data now comes with an integrated connector to data lakes on IBM Cloud Object Storage.
IBM Cloud provides a fully managed data lake platform for data stored on IBM Cloud Object Storage. The central service of this platform is IBM Cloud SQL Query, which provides data ingestion (streaming-based), data preparation, data analytics and an integrated relational table catalog.
IBM Cloud Pak® for Data as a Service — a managed platform for broad range of data analytics and artificial intelligence (AI) — now comes with a built-in connector to the IBM Cloud SQL Query service. This allows you to set up connections to data lakes on IBM Cloud Object Storage and import the data lake table’s assets into projects and catalogs in Cloud Pak for Data.
You can create and use these cloud data lake connections like any other data source connection in Cloud Pak for Data. You can create them in your own projects, in a shared catalog and in a deployment. This post provides a short description of creating a cloud data lake connection in a project.
Adding a cloud data lake connection
Navigate to one of your projects and click Add to project:
Select Connection as the asset type to add:
For connection type, select SQL Query and then click Select:
Fill out the connection properties for your SQL Query service instance. See below for details on how to identify the values for your account:
Finding the CRN of the IBM Cloud SQL Query service
Navigate to the IBM Cloud console for your services and select the SQL Query instance to which you want to connect. In the service dashboard, you’ll find the CRN of the instance, as seen in the following screenshot:
Identifying the target Cloud Object Storage
IBM Cloud SQL Query uses a defined IBM Cloud Object Storage (COS) location to persist query results, and that’s why you must specify this location now. You can select any COS bucket to which you have write access. A recommended best practice is that you specify the COS bucket that is being created when you launch the SQL Query web console for the first time. Click Launch SQL Query UI. In the SQL Query UI, you can copy the value in Target location to your clipboard:
Getting an API key
In IBM Cloud console, navigate to Manage > Access (IAM):
Select API keys and then click Create an IBM Cloud API key:
Specify a revealing name for your new API key and click Create:
The API key is created and can immediately be copied to clipboard or downloaded to your local disk. Note that you will not be able to retrieve the actual value of the API key after you close this dialog, so make sure to download or save it otherwise at this point:
Importing a single data asset from Cloud Data Lake
In your project, click Add to project and select asset type Connected data:
In the following dialog, click Select source, select your new Cloud Data Lake connection from the list of connections and select a table that you want to import. After a few seconds (after a sample query finishes that is executed in the background), you can click Select.
Note the table catalog of a fresh instance if SQL Query is empty. In this case, no table is available to import. You first need to populate it with the table definitions for the data that make up your data lake. See the section Creating tables in Cloud Data Lake at the end of this blog post:
Provide a name for the new data asset and click Create:
After successful creation of the data asset, you can navigate to it and, for instance, do a preview of its content:
You can also generate a data profile in order get a statistical overview of the content of a data asset:
Batch importing multiple data assets from a Cloud Data Lake
If you have a cloud data lake set up already with a larger data model of tables, you can import all of its content in one operation using the Metadata Import function.
In your project, click Add to project and select asset type Metadata Import:
Provide a name for the import job and select your connection to the Cloud Data Lake as source and then click Next:
Select all tables that you want to import. Note that there is a background query submitted for each table that you select. Then, click Choose:
Review the batch import details before clicking Create to run the import job:
After a short while, you will see the all selected Cloud Data Lake tables as data assets in your project:
Creating tables in Cloud Data Lake
Each instance of IBM Cloud SQL Query provides its own table catalog. You can create your own tables with DDL SQL statements, very similar to how you do that in any relational database system. You can find more details about the metastore catalog in the SQL Query documentation. In addition, you will find a few ready-to-use DDL examples right inside the SQL Query UI in the Samples menu. You can create tables by explicitly specifying the list of columns and their data types. SQL Query also allows you to automatically infer the schema of the table from the data on IBM Cloud Object Storage. You then just have to specify the location of the data:
Further resources
- IBM Cloud SQL Query
- IBM Cloud Object Storage
- IBM Cloud Pak for Data as a Service
- Tutorial notebook in Watson Studio for working with IBM Cloud SQL Query
- Stream data ingestion from Kafka into Cloud Data Lake
- Serverless Cloud Data Lakes at Subsurface Winter 2021 Cloud Data Lake conference
- Big Data The Cloud-Native Way at IBM Cloud Native Day conference 2021
- JDBC Connectivity to IBM Cloud SQL Query