How to move data from multiple IBM Cloudant tables to a PostgreSQL "data warehouse" so that you can query the data in more creative ways.
In previous posts, we have looked at combining various IBM services to create applications that address real-life requirements faced by enterprises. For example, combining IBM Cloudant and IBM Code Engine or IBM Cloudant and IBM Databases for Redis.
One important requirement in most organisations is the need to make your data work hard for different purposes. On the one hand, you may need a highly available, multi-region, lightning-fast store of data so that your e-commerce application can quickly sell your items and process the orders. Meanwhile, your business intelligence team may want to interrogate that sales data and join it with other sources to produce reports, dashboards or machine learning models that drive business and logistical decisions. Some of that data will likely not even be in your operational data store, but originate somewhere else — inside or outside your organisation — like clickstream data, historical weather reports or application performance monitoring. So how do we start to deal with this deluge of data to drive insights?
For the first requirement, something like IBM Cloudant would do a great job. IBM Cloudant is a fully managed, distributed database optimized for heavy workloads and fast-growing web and mobile apps.
But Cloudant would not be optimal for the second requirement because "joining" data from different documents is not something that document stores do well. For that, you want a more relational type of database, like IBM Cloud Databases for PostgreSQL. A relational database like PostgreSQL is great for operational or analytical use cases, and it has robust JDBC support for all your BI tooling.
In this tutorial, we will go through a simple NodeJS example of how you can move data from multiple IBM Cloudant tables to a PostgreSQL "data warehouse" so that you can query the data in more creative ways.
This will not be cost-free because PostgreSQL does not offer a free tier. But if you set it up and de-provision the infrastructure straight after you are done, it should not cost more than a few dollars. Cloudant offers a generous free tier that you can use for this tutorial.
The tutorial itself should take between an hour and two hours to complete. It is a bit harder than the ones mentioned above because it requires more terminal work and command-line applications to move data around, but hopefully, it will be worth the extra effort. If you like the command line, this one is for you.
- Access to a terminal on your computer/laptop (Linux/macOS)
- An IBM Cloud pay-go account
- Terraform: We will be using Terraform to simplify the creation of the infrastructure
- Couchwarehouse: A command-line facility that lets you export data from Couch-like document databases like Cloudant into relational databases like PostgreSQL
- jq: Used to manipulate JSON structures
- base64 (this is normally part of your laptop distribution)
- couchimport: A command-line facility to import datasets into Cloudant
Your Cloudant database stores all the orders for your company. Every order has the following data points:
In a separate Cloudant database, you have generic data about all 50 US states:
You could easily figure out the sales per state using only Cloudant (because every order document has the state in the delivery address). But what happens when your marketing team asks for state sales per head of population to try to figure out which are their more lucrative locations? You have the data, but in separate documents they are going to be hard to combine.
For this tutorial we will do the following:
- Install the cloud databases
- Import some pre-made orders and geo data into Cloudant using couchimport
- Export it from Cloudant to PostgreSQL using couchwarehouse
- Query the PostgreSQL data joining different tables
Step 1: Obtain an API key to deploy infrastructure to your account
Follow the steps in this document to create an API key and make a note of it for Step 2.
Step 2: Clone the repo and cd into the Terraform directory
Create a document called terraform.tfvars with the following fields:
The terraform.tfvars document contains variables that you may want to keep secret, so it is ignored by the GitHub repository.
Step 3: Create the databases
TL;DR - Run the Terraform script:
A bit more info: The Terraform folder contains a small number of simple scripts:
main.tftells Terraform to use the IBM Cloud
variables.tfcontains the variable definitions whose values will be populated from terraform.tfvars
cloudant.tfcreates a free tier Cloudant DB and some credentials that we will use later to access it
postgres.tfcreates the PostgreSQL DB
It will take several minutes for the databases to be ready, but you should now you have a PostgreSQL DB and a Cloudant DB in your account.
Step 4: Get access to PostgreSQL
First, you need Postgres to be accessible to you and to couchwarehouse — we will use the account credentials we got from a previous step by copying them to a file:
Then, manipulate the output to create environment variables that couchwarehouse and Postgres can use:
Step 5: Upload data to Cloudant
You will need some more environment variables to access the Cloudant database and put some data into it:
First, import the order data:
Now, repeat the process for the geo data. You'll first have to CTRL+C on the terminal because couchwarehouse continues to listen for changes to import:
Now you should be able to see all the data in PostgreSQL by running the psql client (which uses the credentials from Step 4). In the terminal, type the following:
You should be in the psql client:
You should now be able to see some order data. For example, to see all the orders per state:
More importantly, you can now join the tables to see dollar sales per head of population:
The results in this example are predictable because it is a randomly generated set of orders so the more populous states will end up with the smaller orders per head. It illustrates the power of combining different technologies to do the jobs at which they are best.
Remember to de-provision your resources so you don't incur extra costs. To do that, go into your project's Terraform directory and type the following:
You will be prompted to confirm the action and Terraform will de-provision the resources.
In a real-life scenario, a lot of this shifting of data would be done by automated background processes, and you would be using each database for what it does best.
Remember that if you are new to these services you can get free usage credits to get you started.