Share this post:
How to upload data with IBM Aspera and make SQL queries across data
IBM Cloud SQL Query lets you use standard SQL and Apache Spark SQL functions to query your structured and semi-structured data stored in Cloud Object Storage (COS). It’s a serverless solution that makes it easy to analyze lots of data in COS by pointing to the COS bucket that stores your data. You don’t have to create schemas or do any extra setup, just create a COS bucket, add the data, click a button to set up the SQL Query service, then start analyzing your data using SQL.
In this post, we’ll quickly set up COS and SQL Query, upload some police call logs to COS using IBM Aspera high-speed data transfer, and then run some SQL queries over multiple log files to gain insights into police calls spanning a couple years. Let’s get started.
Speaking of structure and clouds, here’s the Coney Island Parachute Jump. Photo by S. on Unsplash
Setting up COS and SQL Query
To create the SQL Query service, you’ll need to have Cloud Object Storage (COS) first. In my introduction to analyzing data with SQL Query, I’ve outlined how to set up both COS and SQL Query from the IBM Cloud UI. To give you another flavor of how to set up COS and SQL Query, we’ll create the services using the IBM Cloud Developer Tools CLI.
If you haven’t installed the command line tool, check out the documentation for installation instructions to get started. Once it’s installed, log into your account with the following:
You’ll be prompted to enter your email address and password for authentication. Next, target your Cloud Foundry account which connects you to your space on IBM Cloud:
Next, create the COS and SQL Query services using the following ibmcloud commands. If you already have COS, you can skip the first command:
Now, add SQL Query:
To view all the services that we just created, run:
This will provide you with a list of services tied to your account:
Now that we have COS and SQL Query set up, let’s get started by creating a COS bucket to store the data. Use the UI to create a COS bucket.
Once in the COS UI, click the Create Bucket button, which will open a window for you to choose a name, select the bucket resiliency, and select its location. An important point to remember when naming buckets is that they must be unique across all existing bucket names in COS:
Let’s now move on to the data.
Uploading the data
The logs that we’ll use for this example are publicly available Seattle Police Department call logs from 2009 to 2018. It’s nine years of data in a 653 MB file. At the moment, COS buckets in every region limit the uploading of files to 200 MB. However, buckets with “Regional” resiliency that are located in “us-south” can leverage Aspera, which will allow you to exceed the file upload limit.
We could load all of the data as a single file with Aspera, but in this example, I’ve split the 653 MB file into smaller files based on the month and year of the calls. This is to simulate how you might store your log files in COS, and it will also make your queries a lot faster if you don’t have to scan the entire dataset when querying it. Also, it provides us with the opportunity to show how to query over numerous files in COS with SQL Query.
First, download the file. When it is downloaded, we’ll split it up. To help us split the CSV file, we’ll use a command-line tool called csvkit. It comes with a variety of tools that help you scrape CSV data and transform it into other formats. You can install it using Python’s pip package manager:
Since the Call_Data.csv file includes the dates of each event, I’ve created a folder for each year that will store twelve CSV files according to the month of each call event. To do that, we’ll use csvkit’s csvgrep command. This command will search in column ORIGINAL TIME QUEUED, which is the time a call was logged, and look for the year and the month for each call. Using a regular expression, we’ll select the month and year from that column and store the results by month according to the year. Each file will be saved with a format like 1-2018.csv, which represents the call log for January 2018.
To automate this process, we’ll use bash as well as the csvgrep command. This will iterate over our CSV file, create the appropriate files, and place them in the correct folder by year.
This will take a few minutes to run.
Once the files have been saved in their folders, we can move them to your COS bucket with Aspera. In your COS bucket, click on Add Objects then select Aspera high-speed transfer. You need Aspera Connect installed to transfer files:
There is a Download Aspera Installer link to do that. Also, you’ll need to set Aspera to upload files. You can set that by clicking the Transfer Options link and turning on Uploads.
Once you’ve saved these options, click Add Objects again. With Aspera selected, click Add folder. From there, select all your folders from 2009 to 2018. Then, just wait until you see the “Completed” message after all your files have been uploaded.
Querying the log files
When querying the files in SQL Query, you’ll need the URL for the COS bucket that contains your data. From the COS UI, you can get this URL by clicking on the kebab icon on the right side of the bucket name:
Select the Bucket SQL URL option:
This will open up a window to select the bucket’s URL. Copy it:
Next, you’ll need to go back to your services page and select the SQL Query service. Once selected, you’ll be taken to the service’s page. Click the Open UI button, which will take you to the SQL Query data browser.
One of the simplest ways to test queries and get a snapshot of the results before running more complex SQL queries is to use the data browser.
In the data browser, use your COS bucket’s URL to create a simple query, like the following, that will give you the headers and a snapshot of the data within the January 2018 log file:
It’s pretty simple to run a query because you don’t have to set up tables beforehand. All you need to do is dump your data into COS, and SQL Query will create the schema and scan the data behind the scenes. If we wanted to run the same query over all of the 2018 data, we just need to use the 2018 prefix, and SQL Query will scan all the files from that year:
For combining multiple log files in separate partitions, you can use the UNION ALL SQL clause to combine multiple files with different prefixes together. For example, if we want the logs for January 2017 and 2018, we’d write:
For more complex queries, we can use some of Spark SQL’s built-in functions to get the call logs for Christmas 2016 and 2017. We’ll use a common table expression (CTE) to combine the data for December 2016 and 2017. Then, we’ll use the column ORIGINAL TIME QUEUED, which is a string containing a call’s date and time, and convert it to a date and time type that we can use in our queries and place that in a column we defined as EVENT TIME.
The unix_timestamp function converts the date and time like “12/25/2017 08:44:10 PM” into a Unix timestamp 1514234650. The function from_unixtime then converts the Unix timestamp into an ISO-8601-formatted date with 24-hour clock time “2017–12–25 20:44:10”.
Next, we’ll use the xmas_calls CTE to list the types of calls that occurred on Christmas for 2016 and 2017 and group them together by call priority to give us a total for each call type. Using the HAVING SQL clause, we can filter the results to give us only totals over five.
If we wanted to only look for call logs pertaining to traffic incidents, we could also include INITIAL CALL TYPE like “%TRAFFIC%” within the query like:
Another interesting query to run is the ranking of police call logs related to traffic incidents in 2016. You could use the RANK() OVER() syntax in SQL Query to accomplish this.
The only limit you have with SQL Query is your ability to use SQL. Standard SQL will work for any query, and you can also use Spark SQL functions to fine-tune queries even more.
Analyzing data over multiple files in COS is easy using SQL Query. All that’s required is for you to point to the files or COS bucket prefix, and then you can start using standard SQL to gain insights into your data.
For a tutorial on how you could use SQL Query to analyze logs in real-time, take a look at the “Big data logs with streaming analytics” tutorial in the IBM Cloud docs.
To learn more visit the SQL Query product page.