IBM Cloudant and PostgreSQL: Combining Databases for App Speed and Business Intelligence Insights

5 min read

How to move data from multiple IBM Cloudant tables to a PostgreSQL "data warehouse" so that you can query the data in more creative ways.

In previous posts, we have looked at combining various IBM services to create applications that address real-life requirements faced by enterprises. For example, combining IBM Cloudant and IBM Code Engine or IBM Cloudant and IBM Databases for Redis.

One important requirement in most organisations is the need to make your data work hard for different purposes. On the one hand, you may need a highly available, multi-region, lightning-fast store of data so that your e-commerce application can quickly sell your items and process the orders. Meanwhile, your business intelligence team may want to interrogate that sales data and join it with other sources to produce reports, dashboards or machine learning models that drive business and logistical decisions. Some of that data will likely not even be in your operational data store, but originate somewhere else — inside or outside your organisation — like clickstream data, historical weather reports or application performance monitoring. So how do we start to deal with this deluge of data to drive insights? 

For the first requirement, something like IBM Cloudant would do a great job. IBM Cloudant is a fully managed, distributed database optimized for heavy workloads and fast-growing web and mobile apps.

But Cloudant would not be optimal for the second requirement because "joining" data from different documents is not something that document stores do well. For that, you want a more relational type of database, like IBM Cloud Databases for PostgreSQL. A relational database like PostgreSQL is great for operational or analytical use cases, and it has robust JDBC support for all your BI tooling. 

Joining forces

In this tutorial, we will go through a simple NodeJS example of how you can move data from multiple IBM Cloudant tables to a PostgreSQL "data warehouse" so that you can query the data in more creative ways.

This will not be cost-free because PostgreSQL does not offer a free tier. But if you set it up and de-provision the infrastructure straight after you are done, it should not cost more than a few dollars. Cloudant offers a generous free tier that you can use for this tutorial.

The tutorial itself should take between an hour and two hours to complete. It is a bit harder than the ones mentioned above because it requires more terminal work and command-line applications to move data around, but hopefully, it will be worth the extra effort. If you like the command line, this one is for you.

Prerequisites

  • Access to a terminal on your computer/laptop (Linux/macOS)
  • An IBM Cloud pay-go account 
  • Git 
  • Terraform: We will be using Terraform to simplify the creation of the infrastructure
  • Couchwarehouse: A command-line facility that lets you export data from Couch-like document databases like Cloudant into relational databases like PostgreSQL
  • jq: Used to manipulate JSON structures
  • base64 (this is normally part of your laptop distribution)
  • psql
  • couchimport: A command-line facility to import datasets into Cloudant

Scenario

Your Cloudant database stores all the orders for your company. Every order has the following data points:

        "_id": "R96Z8DFQ6JQ6IEMX",
        "type": "order",
        "customerId": "1791",
        "customerName": "Keely Greenberg",
        "email": "keely_greenberg@gmail.com",
        "orderId": "R96Z8DFQ6JQ6IEMX",
        "date": "2021-03-22T08:54:52.421Z",
        "total": "178.21",
        "tax": "8.74",
        "address": {
                "street": "1249 Tanners Circle",
                "state": "Arizona",
                "zip": "55942"
        },
        "basket": ["DORINA", "BHAKTI", "PANCA"]

In a separate Cloudant database, you have generic data about all 50 US states:

        "name": "Wyoming",
        "abbreviation": "WY",
        "capital": "Cheyenne",
        "city": "Cheyenne",
        "population": 582658,
        "area": 253335,
        "waterarea": 1865,
        "landarea": 251470,
        "houseseats": 1,
        "statehood": "1890-07-10"

You could easily figure out the sales per state using only Cloudant (because every order document has the state in the delivery address). But what happens when your marketing team asks for state sales per head of population to try to figure out which are their more lucrative locations? You have the data, but in separate documents they are going to be hard to combine.

Step-by-step instructions

For this tutorial we will do the following:

  • Install the cloud databases
  • Import some pre-made orders and geo data into Cloudant using couchimport
  • Export it from Cloudant to PostgreSQL using couchwarehouse
  • Query the PostgreSQL data joining different tables

Step 1: Obtain an API key to deploy infrastructure to your account

Follow the steps in this document to create an API key and make a note of it for Step 2.

Step 2: Clone the repo and cd into the Terraform directory

git clone https://github.com/danmermel/cloudant-postgres-warehouse.git
cd cloudant-postgres-warehouse/terraform

Create a document called terraform.tfvars with the following fields: 

ibmcloud_api_key = "<your_api_key>"
region = "eu-gb"
postgres_password = "<make_up_a_password>"

The terraform.tfvars document contains variables that you may want to keep secret, so it is ignored by the GitHub repository.

Step 3: Create the databases

TL;DR - Run the Terraform script:

terraform init 
terraform apply --auto-approve

A bit more info: The Terraform folder contains a small number of simple scripts: 

  • main.tf tells Terraform to use the IBM Cloud 
  • variables.tf contains the variable definitions whose values will be populated from terraform.tfvars
  • cloudant.tf creates a free tier Cloudant DB and some credentials that we will use later to access it
  • postgres.tf creates the PostgreSQL DB 

It will take several minutes for the databases to be ready, but you should now you have a PostgreSQL DB and a Cloudant DB in your account.

Step 4: Get access to PostgreSQL

First, you need Postgres to be accessible to you and to couchwarehouse — we will use the account credentials we got from a previous step by copying them to a file:

terraform output -json > ../creds.json
cd ..

Then, manipulate the output to create environment variables that couchwarehouse and Postgres can use:

cat creds.json | jq '.postgres_credentials.value[].certbase64' | sed 's/"//g' | base64 --decode > psql.cert
export PGSSLROOTCERT=psql.cert
export PGHOST=`cat creds.json | jq '.postgres_credentials.value[].hosts[].hostname' | sed 's/"//g'`
export PGPORT=`cat creds.json | jq '.postgres_credentials.value[].hosts[].port' | sed 's/"//g'`
export PGDATABASE=ibmclouddb
export PGUSER=admin
export PGPASSWORD="<the password you created in step 2>"

Step 5: Upload data to Cloudant

You will need some more environment variables to access the Cloudant database and put some data into it:

export COUCH_URL=`cat creds.json | jq '.cloudant_credentials.value.url' | sed 's/"//g'`
export COUCH_DATABASE=orders

First, import the order data:

//Create the database
curl -X PUT "$COUCH_URL/$COUCH_DATABASE"
//{"ok":true}

//import the data into Cloudant
cat orderdata.json | couchimport --type jsonl

//run couchwarehouse to copy the data to postgres
couchwarehouse --databaseType postgresql

Now, repeat the process for the geo data. You'll first have to CTRL+C on the terminal because couchwarehouse continues to listen for changes to import:

export COUCH_DATABASE=geodata

//Create the database
curl -X PUT "$COUCH_URL/$COUCH_DATABASE"
//{"ok":true}

// Import the data into Cloudant
cat geodata.json | couchimport --type jsonl

//run couchwarehouse to copy the data to postgres

couchwarehouse --databaseType postgresql

Now you should be able to see all the data in PostgreSQL by running the psql client (which uses the credentials from Step 4). In the terminal, type the following:

psql

You should be in the psql client:

psql (13.4, server 12.7)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.
ibmcloud=>

You should now be able to see some order data. For example, to see all the orders per state:

select orders.address_state, count(*)  from orders group by 1 order by 2 desc;

More importantly, you can now join the tables to see dollar sales per head of population:

select name, (sum(cast(orders.total as float))/geodata.population)  from orders join geodata on orders.address_state = geodata.name group by 1,geodata.population order by 2 asc;

The results in this example are predictable because it is a randomly generated set of orders so the more populous states will end up with the smaller orders per head. It illustrates the power of combining different technologies to do the jobs at which they are best.

Remember to de-provision your resources so you don't incur extra costs. To do that, go into your project's Terraform directory and type the following:

terraform destroy

You will be prompted to confirm the action and Terraform will de-provision the resources. 

In a real-life scenario, a lot of this shifting of data would be done by automated background processes, and you would be using each database for what it does best.

Learn more

Remember that if you are new to these services you can get free usage credits to get you started.

Be the first to hear about news, product updates, and innovation from IBM Cloud