September 17, 2018 | Written by: Dr. Abdullah Alger
Categorized: Compute Infrastructure | How-tos
Share this post:
Consolidate PostgreSQL data to IBM Cloud Object Storage for IBM Cloud SQL Query
Data that are infrequently accessed and stored in a database for long-term storage can end up becoming a costly solution whether the database is on premises or in the cloud. Consolidating this data into a durable and highly reliable data storage service like IBM Cloud Object Storage (COS) will not only reduce costs, but it provides a flexible, durable, and scalable solution for storing all sorts of unstructured data.
With the introduction of IBM Cloud SQL Query, you will no longer have to worry about restoring data back into your database if you need access to it. Using the SQL Query UI or libraries for Python or Node.js, you can use ANSI SQL, Spark SQL functions, and a growing geospatial toolkit to query, explore, and analyze CSV, JSON, and Apache Parquet data stored in COS for a fraction of what it costs to maintain it in cold storage within a database.
This tutorial focuses on PostgreSQL, but it works for any database technology that allows for CSV outputs. The workflow can be used to consolidate on-premises and cloud hardware/license costs, as well as semi-retire expensive databases. By leveraging SQL Query and COS, you will still have quick serverless SQL access to data for historical analysis, auditing, and data science at a commodity price point.
This tutorial covers the following steps:
- Copy PostgreSQL tables into CSV files
- Create the COS service, a COS bucket, and upload CSV files to COS
- Create COS service keys to use with other applications
- Transform COS CSV objects to Parquet using the
csvtoparquet command-line tool
- Transform COS CSV objects to Parquet using IBM Cloud Functions
Copying Data to CSV from PostgreSQL
The first step to consolidating data is figuring out what sort of data you want to extract. I’ll leave that up to you. But, for this example, let’s say that you want to extract all of the data from a particular table in PostgreSQL to CSV. You can do this by PostgreSQL’s
psql if you have PostgreSQL installed. You’ll need PostgreSQL to use this, so if you don’t already have the database installed, you do that using
brew install postgresql if you are running macOS with homebrew. For other platforms, check out the PostgreSQL documentation.
With PostgreSQL installed, use
psql to copy the data from your tables by connecting to the database and using the
COPY command. You’ll then use the
SELECT statement to select the data you want to copy from a particular table and save it to a new CSV file on your machine from standard output (stdout).
psql -U user -d db_name -H host -p port -c "COPY (SELECT * FROM your_table) TO STDOUT WITH CSV HEADER DELIMITER ',';" > my_table.csv
If you’re using a cloud provider like IBM Compose for PostgreSQL, it’s a similar command, except you’re connecting to the database using your Compose credentials like this:
PGPASSWORD=NOTAPASSWORD1234 psql "sslmode=require host=sl-us-south-1-portal.99.dblayer.com port=99999 dbname=db_name user=user_name" -c "COPY (SELECT * FROM your_table) TO STDOUT WITH CSV HEADER DELIMITER ',';" > your_table.csv
SELECT * statment copies everything from a table. If you wanted to only select a portion of the table or a number of tables, you could run any SQL query and it would create a CSV file from the output of that query.
Other commands that we used are
HEADER (that includes the column names we use will be the CSV header names) and
DELIMITER (that separates each column’s data with a comma
Once you have the CSV files, you can then start storing them in COS.
Setting up IBM Cloud Object Storage
Creating a COS service on IBM Cloud can be done either on the UI or using the
ibmcloud CLI. We’ll use the CLI for our example. The instructions to download and log in to your IBM Cloud account are found in the documentation.
Now that you are logged in, let’s provision the COS service by using the following command:
ibmcloud resource service-instance-create <instance-name> cloud-object-storage <plan> global
plan, there are a number of plans to choose from, but use
lite here because it’s free. The
instance-name can be any name that you choose, which will be the name of the COS service instance that you’ll use for the rest of the tutorial.
Running the command to create COS, you’ll get something like this:
Service instance my-cos was created.
Name Location State Type Tags
my-cos global active service_instance
Now that COS is provisioned, we can start putting the CSV files that you’ve saved into a COS bucket. Then, using the COS UI, you can drag-and-drop files into a COS bucket easily. If you’re logged into your IBM Cloud account in your browser, click on the COS service you provisioned on your Cloud dashboard. Then click on the Create bucket button to create a bucket.
Use the default settings for the bucket and type in a new name. The bucket name must be unique across all COS buckets on IBM Cloud. When the bucket is created, you can start dragging and dropping your CSV files into the bucket. When they have been uploaded, they’ll be objects in your COS bucket which can then be transformed into Parquet.
Transforming CSV objects to Parquet
The are two ways to transform your CSV objects to Parquet using your existing COS bucket. The first way is to download the Python
csvtoparquet CLI tool that I’ve developed using
pip install csvtoparquet.
csvtoparquet uses an IBM Cloud API key or a service IAM key, the COS service endpoint URI, the COS bucket name where the CSV objects are located, the names of the objects to be converted, and the new names of the objects as Parquet objects.
To use the tool, run it like this in your terminal:
csvtoparquet -a api-key -e s3-api.us-geo.objectstorage.softlayer.net -b new-bucket-2018 -f original-object.csv -n new-object
For a complete description of the options available, type
Getting the endpoint
-e can be done through the COS UI. Select Endpoint on the left-side menu. That will show you the endpoints for the region that your COS service is located in. If you used us-geo as your COS location, then it is: s3-api.us-geo.objectstorage.softlayer.net.
-b is the bucket name that contains the CSV objects,
-f is the names of the objects to convert, and
-n is the names that the transformed objects will have.
.parquet will be appended automatically to the new objects.
To get a cloud service key (Cloud IAM key) to interact with COS and serve as the API key needed for
csvtoparquet, you can provision one in the COS UI or use the terminal. From the terminal, run the following:
ibmcloud resource service-key-create key-example <role> --instance-name my-cos
role, you can add either Writer, Reader, Manager, Administrator, Operator, Viewer, or Editor (which pertain to IAM roles). You can read about different user roles here in the IBM Cloud IAM access documentation. For now, assign the role as
Writer, which will be sufficient.
Running that command, you’ll get the following information:
Created At: Mon Aug 20 10:00:53 UTC 2018
The important piece of information here is the
apikey, which you will use for
csvtoparquet. Insert this key after the
-aflag, like in the example. After that, when you run the tool and let it finish processing the objects, you will see the new objects in your COS bucket.
There are a couple caveats using this tool, however. First, the tool is really in beta and we are working out any issues as they arise (and we welcome feedback). Second, it consumes the memory on your machine to do the processing. If you have large files, it will use your resources to get the objects from object storage, transform them to Parquet on your machine, and stream back the data to object storage in the form of a Parquet object. Nothing is stored on your machine, it’s only doing the processing.
On the other hand, there is another way. I’ve been working on a serverless solution to this issue which will allow all the transformation to happen using IBM Cloud resources for free. My Docker image abdullahalger/csvtoparquet-convert uses IBM Cloud Functions to run the CSV to Parquet conversion of your data in COS without tying up your machine’s resources.
Let’s look at setting up IBM Cloud Functions.
Using IBM Cloud Functions to transform CSV to Parquet
In order to isolate various cloud functions from others you might create later, we’ll create a package, which is basically a namespace for your serverless functions:
ibmcloud fn package create <package_name>
Once that’s done, we bind the credentials from our COS service (our API key is included) to the package so that we don’t need to manually insert it as a parameter, as we did with the
ibmcloud fn service bind cloud-object-storage <package_name> --instance my-cos --keyname key-example
In the above, the
--instance is your COS instance service name, and the
--keyname is the service key name for COS that you created above that has the API key. Where it says
package_name, substitute that for the cloud function package name you just created.
Then, you can check to see if these credentials are bound to the package using the following:
ibmcloud fn package get <package_name>
If all goes well, you won’t have to remember your API key to run the function to transform your objects now. Next, we’ll need to create a cloud function action. This will run the conversion. I am maximizing the memory and timeout for the action so that we can get the best performance when converting large objects. In addition, we’ll make sure to use the Docker image abdullahalger/csvtoparquet-convert so that the necessary code and libraries will be configured and used.
ibmcloud fn action create cos-objects/convert --docker abdullahalger/csvtoparquet-convert --timeout 600000 --memory 512
In order to run the action, we’ll need to specify the parameters for converting objects. These include the COS endpoint, COS bucket name, and the names of the objects to convert, and their new names after they’re stored as Parquet objects. You can do this using a JSON file like the following, which I’ve named
"convert_objs": ["table1.csv", "table2.csv", "table3.csv"],
"new_names": ["new/table1", "new/table2", "new/table3"]
The JSON is relatively straightforward. For
new_names, I am using the prefix
new/ to place the Parquet objects in a new namespace, which will help me if I want to query the entire dataset in SQL Query later and separate the Parquet files from others in the bucket. This is optional, of course, but useful.
Now, let’s use the JSON file as a parameter to run the action:
ibmcloud fn action invoke cos-objects/convert -r -P params.json
Once the conversion is finished, you will see the following response and the new Parquet objects in your bucket:
The caveat for using the cloud function is that with the maximum memory resource set to 512 MB, it will only allow you to convert objects that are less than 50 MB per object. So, if your objects are over 50 MB each, I recommend you use the
csvtoparquet tool instead to convert to Parquet.
With the Parquet objects in your bucket, we can use SQL Query to start querying objects. You can use SQL Query either with the SQL Query UI or programmatically using Python or Node.js. We have several resources put together regarding making all types of queries using SQL Query. Go check them out and try using SQL Query and COS to consolidate and query your data in the cloud.