Test Your Database with GBs of Data

6 min read

How you can use IBM Cloud to generate and upload 50GBs of data to a Redis database to test scenarios in your app design.

It is a bit of a truism that most databases work just fine with small amounts of data. It is only when your database has a lot of data in it (and a lot of users) that the unexpected consequences of your design decisions become apparent and things start to go wrong or get complicated. But, by then, it may be too late to make fundamental architectural changes to your database schema or access patterns.

In this tutorial, we will show you how you can use IBM Cloud to generate and upload 50GBs of data to a Redis database in a matter of hours. You will then be ready to design some test scenarios around your application design using realistic amounts of data. You could also use it to make realistic calculations about what a Restore Time Objective (RTO) would be in a disaster-recovery scenario.

The principles in this tutorial can be extended to other databases and larger or smaller datasets.

Two challenges

There are two main challenges when dealing with large amounts of data:

  1. How to generate it quickly.
  2. How to feed it into a database as fast as possible.

For the first challenge, we will use datamaker, a NodeJS utility for generating realistic-looking data in any format (based on a given template).

For the second challenge, we will deploy a virtual machine (VM) in the cloud that is in the same Availability Zone as our database. In that way, not only can you use as powerful a virtual machine as you want, but the round trip of loading data from your machine into the database is minimised.

This tutorial will not be cost-free because you will need database and VM resources that are not in the IBM Cloud free tier. But if you deprovision the resources after completing the tutorial, it should not cost more than a few dollars.

The tutorial will take several hours to complete, although most of that will be waiting time while data is created and then shipped into the database. It is not a beginner tutorial as it may require some knowledge of more advanced command-line features if you need to troubleshoot.

Prerequisites

You will need the following:

(datamaker, Node.js and npm are also required, but they do not need to be installed on your machine because they will be installed in the VM in the cloud).

Step 1: Obtain an API key to deploy infrastructure to your account

Follow the steps in this document to create an API key and make a note of it.

Step 2: Create a secure key to log into your virtual machine

In this step, you will create an ssh key that will allow you to log in from your computer into the virtual machine that you are going to create later.

In a terminal window, type the following:

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub

The ssh-keygen facility generates various types of authentication keys for ssh. In our case, we are creating one of type RSA. The keys are stored by defaultin the .ssh folder of your user directory. If it is not there, check whether your operating system has different defaults.

Make a note of the string that is printed. 

Step 3: Clone the repo and cd into the Terraform directory

git clone https://github.com/danmermel/bigdata.git
cd bigdata/terraform

Create a document called terraform.tfvars with the following fields: 

ibmcloud_api_key = "<your_api_key_from_step_1>"
region = "eu-gb"
redis_password  = "<make_up_a_password>"
public_key = "<your_ssh_key_from_step_2>"

The terraform.tfvars document contains variables that you may want to keep secret so it is ignored by the GitHub repository.

Step 4: Create the infrastructure

In this step, you will create all the infrastructure that you need.

TL;DR: Run the Terraform script:

terraform init 
terraform apply --auto-approve

The Terraform folder contains a number of simple scripts: 

  • main.tf tells Terraform to use the IBM Cloud.
  • variables.tf contains the variable definitions whose values will be populated from terraform.tfvars.
  • vpc.tf creates the VPC (virtual private cloud) infrastructure where your virtual machine will be deployed.
  • vm.tf creates a virtual machine inside the VPC and gives it the ssh key that allows you to log into it later.
  • redis.tf creates the Redis instance with 60GB of RAM.

It will take several minutes for the resources to be ready, but you should now you have an IBM Cloud Databases for Redis instance and a VM that you can access.  You can check by visiting the Resources section of your IBM Cloud account.

The Terraform script will output several bits of information that you will need for the next steps.

Step 5: Prepare files for your virtual machine

In this step, you will prepare a few files that allow you to install all the software you will need on your virtual machine and give it the access it needs.

In the root of your project, there is a file called build.sh. This file will do all the heavy lifting of installing the software needed, creating the data and pushing it into Redis.

You need to edit this file and replace the redis password parameter with the one you created in Step 3.

In the root of the folder, there is also a file called stunnel.confstunnel is a facility that allows the redis-cli (which does not support https) to connect to the IBM Redis instance (which is on https). 

You need to edit this file and replace the redis_host and redis_port variables with those in the output from the Terraform script (Step 4).

Finally, you need to obtain the certificate file that is required to access the Redis instance. We will do this using the ibmcloud CLI. The CRN of your Redis instance is in the output of the Terraform script.

ibmcloud cdb cacert "<your_crn_here>"

Now copy everything from (and including). -----BEGIN CERTIFICATE----- to (and including) -----END CERTIFICATE----- into a new file called redis.cert in the root of your project.

Step 6: Copy files to your virtual machine

Now you want to copy all these files to your virtual machine. We will do this using the scp (secure copy) utility. Your VM IP address is in the output of the Terraform script. Be mindful of the colon at the end of the following commands.

In a terminal window in the root of the project, type the following:

scp stunnel.conf root@<vm_ip_address>: 
#stunnel.conf                                                                                  100%  135     7.0KB/s   00:00
scp build.sh root@<vm_ip_address>: 
#build.sh                                                                                  100%  135     7.0KB/s   00:00
scp redis.cert root@<vm_ip_address>: 
#redis.cert                                                                                  100%  135     7.0KB/s   00:00

If your ssh key is in the right place and your files are in the right folder, then they should be copied up to the virtual machine.

Step 7: Log into your virtual machine and set it up

In this step, you will remotely access your virtual machine and then run the script that will install all the required software, create 50GBs of data and feed it into Redis:

ssh root@<vm_ip_address>

#Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
#permitted by applicable law.
#root@big-data-vm:~#

You can check that all the files you just copied up are there. The following command should list them:

ls-al

The build.sh script

This script will do the following:

  • Install NodeJS
  • Install datamaker to create data
  • Install redis-server to access our Redis instance,
  • Install stunnel to allow an https connection via the redis CLI
  • Generate 50GB worth of data
  • Pipe that data into Redis

All you have to do is run it!

#first, make sure the script is executable
chmod +x build.sh
./build.sh

The data making will take a while (perhaps an hour) and pumping it into Redis will take a few minutes.

How data is generated

The datamaker facility can generate realistic-looking data of many types. Essentially, you give it a template and it will replace anything it finds inside double curly brackets {{ }} with data of the given type. In our case, our template is one line of a Redis instruction that inserts a key-value pair (SET). The key will be a random UUID string. The value will be an object consisting of an integer, 10 words and an email address. So a typical line generated by datamaker will be:

SET YA7NIQR37T7BSFZH '{"a":16,"txt":"nickel circumstances slightly virginia rapidly stuart diagnostic understanding envelope without","email":"janeen.chapin@gmail.com"}'

When that line gets pumped into Redis, it will create one key-value pair in memory.

We calculated that each key-value pair is about 150 bytes of data in memory, so to generate 50GB of data, you have to create 350 million of those lines.

That is exactly what the script does, and it puts the data in a file called batch.txt.

Did your data go in?

Once the script has finished running, you will want to check that your data went in. 

You will have a massive batch.txt file that you don't want to open. But you can get the first 10 lines of it by doing the following:

head batch.txt
#you will get data like this:
#SET QU2OTGK1YG5G2ZQC '{"a":58,"txt":"dimensional field reseller confidentiality occupied remix sixth #partition notification albums","email":"esta.wilbur35@behind.com"}'
#SET PUPG10J7I4QHB8JY '{"a":23,"txt":"ids difficulties pizza ferry upload test amino anticipated craft #josh","email":"kenton0@richards.com"}'
#....etc

You can then launch the Redis CLI and retrieve one of those keys:

redis-cli -p 6830
127.0.0.1:6830> auth admin <your_redis_password_from_terraform_tf_vars>
#OK
127.0.0.1:6830> get QU2OTGK1YG5G2ZQC
#127.0.0.1:6830> '{"a":58,"txt":"dimensional field reseller confidentiality occupied remix sixth #partition notification albums","email":"esta.wilbur35@behind.com"}'

You can also check that all the keys have gone in by issuing the INFO command to Redis:

127.0.0.1:6830> INFO

You will see a long list of information, but at the bottom you will see the following:

# Keyspace
db0:keys=350000000,expires=0,avg_ttl=0

Summary

In this tutorial you have learned how to generate large amounts of realistic-looking data and insert it into Redis.

The principles of this tutorial can be extended to other databases. For example, datamaker could create a data file for a MySQL database with lines such as the one below to insert values into a products table:

INSERT INTO products VALUES ('{{autoinc}}','{{words 5}}', {{price}}, '{{address}}', {{boolean 0.5}});

And you can use a  bulk upload facility like LOAD DATA to insert data at speed.

If you followed this tutorial, remember to de-provision your infrastructure to stop incurring charges. On your terminal, do the following:

cd terraform/
terraform destroy

If you want to take the next step in your developer journey, check out some of our trial offers

Learn more about IBM Cloud Databases for Redis.

Be the first to hear about news, product updates, and innovation from IBM Cloud