How-tos

How To Consolidate PostgreSQL Data to IBM Cloud Object Storage for IBM Cloud SQL Query

Share this post:

Consolidate PostgreSQL data to IBM Cloud Object Storage for IBM Cloud SQL Query

Data that are infrequently accessed and stored in a database for long-term storage can end up becoming a costly solution whether the database is on premises or in the cloud. Consolidating this data into a durable and highly reliable data storage service like IBM Cloud Object Storage (COS) will not only reduce costs, but it provides a flexible, durable, and scalable solution for storing all sorts of unstructured data.

With the introduction of IBM Cloud SQL Query, you will no longer have to worry about restoring data back into your database if you need access to it. Using the SQL Query UI or libraries for Python or Node.js, you can use ANSI SQL, Spark SQL functions, and a growing geospatial toolkit to query, explore, and analyze CSV, JSON, and Apache Parquet data stored in COS for a fraction of what it costs to maintain it in cold storage within a database.

This tutorial focuses on PostgreSQL, but it works for any database technology that allows for CSV outputs. The workflow can be used to consolidate on-premises and cloud hardware/license costs, as well as semi-retire expensive databases. By leveraging SQL Query and COS, you will still have quick serverless SQL access to data for historical analysis, auditing, and data science at a commodity price point.

This tutorial covers the following steps:

  • Copy PostgreSQL tables into CSV files
  • Create the COS service, a COS bucket, and upload CSV files to COS
  • Create COS service keys to use with other applications
  • Transform COS CSV objects to Parquet using the csvtoparquet command-line tool
  • Transform COS CSV objects to Parquet using IBM Cloud Functions

Copying Data to CSV from PostgreSQL

The first step to consolidating data is figuring out what sort of data you want to extract. I’ll leave that up to you. But, for this example, let’s say that you want to extract all of the data from a particular table in PostgreSQL to CSV. You can do this by PostgreSQL’s psql if you have PostgreSQL installed. You’ll need PostgreSQL to use this, so if you don’t already have the database installed, you do that using brew install postgresql if you are running macOS with homebrew. For other platforms, check out the PostgreSQL documentation.

With PostgreSQL installed, use psql to copy the data from your tables by connecting to the database and using the COPY command. You’ll then use the SELECT statement to select the data you want to copy from a particular table and save it to a new CSV file on your machine from standard output (stdout).

psql -U user -d db_name -H host -p port -c "COPY (SELECT * FROM your_table) TO STDOUT WITH CSV HEADER DELIMITER ',';" > my_table.csv

If you’re using a cloud provider like IBM Compose for PostgreSQL, it’s a similar command, except you’re connecting to the database using your Compose credentials like this:

PGPASSWORD=NOTAPASSWORD1234 psql "sslmode=require host=sl-us-south-1-portal.99.dblayer.com port=99999 dbname=db_name user=user_name" -c "COPY (SELECT * FROM your_table) TO STDOUT WITH CSV HEADER DELIMITER ',';" > your_table.csv

This SELECT * statment copies everything from a table. If you wanted to only select a portion of the table or a number of tables, you could run any SQL query and it would create a CSV file from the output of that query.

Other commands that we used are HEADER (that includes the column names we use will be the CSV header names) and DELIMITER (that separates each column’s data with a comma ,).

Once you have the CSV files, you can then start storing them in COS.

Setting up IBM Cloud Object Storage

Creating a COS service on IBM Cloud can be done either on the UI or using the ibmcloud CLI. We’ll use the CLI for our example. The instructions to download and log in to your IBM Cloud account are found in the documentation.

Now that you are logged in, let’s provision the COS service by using the following command:

ibmcloud resource service-instance-create <instance-name> cloud-object-storage <plan> global

For the plan, there are a number of plans to choose from, but use lite here because it’s free. The instance-name can be any name that you choose, which will be the name of the COS service instance that you’ll use for the rest of the tutorial.

Running the command to create COS, you’ll get something like this:

Service instance my-cos was created.
Name      Location   State    Type               Tags   
my-cos   global     active   service_instance    

Now that COS is provisioned, we can start putting the CSV files that you’ve saved into a COS bucket. Then, using the COS UI, you can drag-and-drop files into a COS bucket easily. If you’re logged into your IBM Cloud account in your browser, click on the COS service you provisioned on your Cloud dashboard. Then click on the Create bucket button to create a bucket.

COS UI

Use the default settings for the bucket and type in a new name. The bucket name must be unique across all COS buckets on IBM Cloud. When the bucket is created, you can start dragging and dropping your CSV files into the bucket. When they have been uploaded, they’ll be objects in your COS bucket which can then be transformed into Parquet.

COS UI Create bucket

Transforming CSV objects to Parquet

The are two ways to transform your CSV objects to Parquet using your existing COS bucket. The first way is to download the Python csvtoparquet CLI tool that I’ve developed using pip install csvtoparquetcsvtoparquet uses an IBM Cloud API key or a service IAM key, the COS service endpoint URI, the COS bucket name where the CSV objects are located, the names of the objects to be converted, and the new names of the objects as Parquet objects.

To use the tool, run it like this in your terminal:

csvtoparquet -a api-key -e s3-api.us-geo.objectstorage.softlayer.net -b new-bucket-2018 -f original-object.csv -n new-object

For a complete description of the options available, type csvtoparquet --help.

Getting the endpoint -e can be done through the COS UI. Select Endpoint on the left-side menu. That will show you the endpoints for the region that your COS service is located in. If you used us-geo as your COS location, then it is: s3-api.us-geo.objectstorage.softlayer.net.

-b is the bucket name that contains the CSV objects, -f is the names of the objects to convert, and -n is the names that the transformed objects will have. .parquet will be appended automatically to the new objects.

To get a cloud service key (Cloud IAM key) to interact with COS and serve as the API key needed for csvtoparquet, you can provision one in the COS UI or use the terminal. From the terminal, run the following:

ibmcloud resource service-key-create key-example <role> --instance-name my-cos 

For the role, you can add either Writer, Reader, Manager, Administrator, Operator, Viewer, or Editor (which pertain to IAM roles). You can read about different user roles here in the IBM Cloud IAM access documentation. For now, assign the role as Writer, which will be sufficient.

Running that command, you’ll get the following information:

Name:          key-example   
ID:            ...   
Created At:    Mon Aug 20 10:00:53 UTC 2018   
State:         active   
Credentials:                                   
               iam_apikey_description:   ...      
               iam_apikey_name:          ...               
               iam_role_crn:             crn:v1:bluemix:public:iam::::serviceRole:Writer      
               iam_serviceid_crn:        ...      
               resource_instance_id:     ...      
               apikey:                   <api_key>      
		       endpoints:                ....

The important piece of information here is the apikey, which you will use for csvtoparquet. Insert this key after the -aflag, like in the example. After that, when you run the tool and let it finish processing the objects, you will see the new objects in your COS bucket.

There are a couple caveats using this tool, however. First, the tool is really in beta and we are working out any issues as they arise (and we welcome feedback). Second, it consumes the memory on your machine to do the processing. If you have large files, it will use your resources to get the objects from object storage, transform them to Parquet on your machine, and stream back the data to object storage in the form of a Parquet object. Nothing is stored on your machine, it’s only doing the processing.

On the other hand, there is another way. I’ve been working on a serverless solution to this issue which will allow all the transformation to happen using IBM Cloud resources for free. My Docker image abdullahalger/csvtoparquet-convert uses IBM Cloud Functions to run the CSV to Parquet conversion of your data in COS without tying up your machine’s resources.

Let’s look at setting up IBM Cloud Functions.

Using IBM Cloud Functions to transform CSV to Parquet

In order to isolate various cloud functions from others you might create later, we’ll create a package, which is basically a namespace for your serverless functions:

ibmcloud fn package create <package_name>   

Once that’s done, we bind the credentials from our COS service (our API key is included) to the package so that we don’t need to manually insert it as a parameter, as we did with the csvtoparquet tool.

ibmcloud fn service bind cloud-object-storage <package_name> --instance my-cos --keyname key-example   

In the above, the --instance is your COS instance service name, and the --keyname is the service key name for COS that you created above that has the API key. Where it says package_name, substitute that for the cloud function package name you just created.

Then, you can check to see if these credentials are bound to the package using the following:

ibmcloud fn package get <package_name>

If all goes well, you won’t have to remember your API key to run the function to transform your objects now. Next, we’ll need to create a cloud function action. This will run the conversion. I am maximizing the memory and timeout for the action so that we can get the best performance when converting large objects. In addition, we’ll make sure to use the Docker image abdullahalger/csvtoparquet-convert so that the necessary code and libraries will be configured and used.

ibmcloud fn action create cos-objects/convert --docker abdullahalger/csvtoparquet-convert --timeout 600000 --memory 512

In order to run the action, we’ll need to specify the parameters for converting objects. These include the COS endpoint, COS bucket name, and the names of the objects to convert, and their new names after they’re stored as Parquet objects. You can do this using a JSON file like the following, which I’ve named params.json:

{
    "endpoint": "s3-api.us-geo.objectstorage.softlayer.net",
    "bucket": "new-bucket-2018",
    "convert_objs": ["table1.csv", "table2.csv", "table3.csv"],
    "new_names": ["new/table1", "new/table2", "new/table3"]
}

The JSON is relatively straightforward. For new_names, I am using the prefix new/ to place the Parquet objects in a new namespace, which will help me if I want to query the entire dataset in SQL Query later and separate the Parquet files from others in the bucket. This is optional, of course, but useful.

Now, let’s use the JSON file as a parameter to run the action:

ibmcloud fn action invoke cos-objects/convert -r -P params.json 

Once the conversion is finished, you will see the following response and the new Parquet objects in your bucket:

{
    "converted_objects": "done"
}

The caveat for using the cloud function is that with the maximum memory resource set to 512 MB, it will only allow you to convert objects that are less than 50 MB per object. So, if your objects are over 50 MB each, I recommend you use the csvtoparquet tool instead to convert to Parquet.

Start querying

With the Parquet objects in your bucket, we can use SQL Query to start querying objects. You can use SQL Query either with the SQL Query UI or programmatically using Python or Node.js. We have several resources put together regarding making all types of queries using SQL Query. Go check them out and try using SQL Query and COS to consolidate and query your data in the cloud.

Software Engineer and Evangelist - IBM Cloud Databases

More How-tos stories
April 11, 2019

How to Automate TLS Certificate Rotation to Avoid Outages

In this post, we'll share how you can make sure you have end-to-end protection for data in transit without running into any TLS certificate expiry issues.

Continue reading

April 5, 2019

Node.js 502 Bad Gateway Issues and How To Resolve Them

In December of 2018, many Node.js users noticed that their applications randomly returned an HTTP status code 502 "Bad Gateway" error. In this post, we'll show you how to resolve this issue if you have been affected.

Continue reading

April 3, 2019

Managing IBM Cloud Resources with a Service ID Through the Command Line Interface

We are excited to announce that you can now log into IBM Cloud with a service ID in v0.15.0 of the IBM Cloud CLI. This enables users to manage IBM Cloud resources with a service ID created within an account through the command line interface.

Continue reading