Begin analyzing data with SQL Query
At IBM Think 2018, IBM Cloud SQL Query (SQL Query) was introduced. SQL Query provides a serverless, no-ETL solution to easily query data stored in IBM Cloud Object Storage (COS). SQL Query uses Apache Spark SQL as its underlying query engine, and users can compose queries using standard SQL to analyze and transform structured and semi-structured data in object storage.
No additional setup is involved. Once you have SQL Query running on IBM Cloud, you can immediately start querying your data using the SQL Query user interface, or you can query programmatically using either the REST API or the Python ibmcloudsql library.
In this article, I’ll show you how to set up and start using SQL Query from the UI. After that, we’ll look at a practical example using ibmcloudsql to query traffic accident data around Los Angeles that’s stored in Cloud Object Storage and then visualize accidents in a Jupyter Notebook with PixieDust using Mapbox.
Getting set up
In order to get started using SQL Query, you’ll need to have a Cloud Object Storage service first. If you don’t have an object storage service, you can provision a Lite plan instance in the IBM Cloud catalog. It’s a limited plan that’s free, but it will provide more than enough storage for what we need.
You’ll need to create a bucket in COS to store the data that we’ll use for our example. To create a bucket, click Create Bucket and a window will appear asking you to select a name and region for the bucket. If you’re interested in what regions and endpoints are available or curious about each region in particular, there’s some documentation explaining regions and endpoints for Cloud Object Storage:
You can create any name for your bucket, but keep the rest of the default settings as they appear. I’ve called this bucket “data-bucket-032018.” Since bucket names have to be unique across the IBM Cloud platform, create a unique name for your bucket.
A note on universally unique buckets: If you plan on reproducing the queries below using your own IBM Cloud Object Storage bucket, be sure to replace where it appears in my variable and clause with your own unique bucket name.
The data we’ll use for this example is a CSV file of Los Angeles Traffic Collision Data from 2010 to Present. Download the file and upload it to the object storage bucket you just created.
Setting up SQL Query
Select the SQL Query service from the IBM Cloud catalog. It’s currently in beta and free to use:
Currently, SQL Query can run queries on data that are stored as CSV, Parquet, or JSON in Cloud Object Storage. To interact with the SQL Query, you can write SQL queries using its UI, write programmatically using the REST API or the ibmcloudsql Python library, or write a serverless function using IBM Cloud Functions. After provisioning SQL Query from the catalog, click on the Manage tab in the left sidebar. Click Open UI from that page to access the SQL Query UI to start making queries:
Start querying your data
When you click Open UI, the SQL Query service will automatically generate a unique Cloud Object Storage bucket that will store all of the results as CSV files from your SQL queries. You can also specify other buckets to store results in, but to keep things simple, we’ll use the generated object storage bucket. The SQL Query UI is an editor that lets you immediately start composing SQL queries. Since SQL Query uses Spark SQL, you can use Spark SQL functions and ANSI SQL to compose both simple and complex queries that involve large amounts of data:
At the bottom of the editor, you’ll see the target Cloud Object Storage URL. This URL points to the object storage bucket generated by SQL Query to store your results:
/resultsuffix is added by default as the suffix of the URL.
result will be attached to each file name the contains the results of a SQL query inside this bucket. To try out the editor, let’s compose a query that gets data from the traffic collision data. The query selects the traffic collision ID, date and time the collision occurred, the area where the collision happened, victim age and sex, and location coordinates. Additionally, we’ll filter the data so that it only provides us with victims whose age is between 20–35 and an incident time between 5pm to 8pm. The SQL code is below:
Since SQL Query creates a table from CSV, Parquet, or JSON files behind the scenes, in the FROM statement you have to provide a table unique resource identifier in the form of:
That points to the file(s) you’re querying—similar to defining the table you’re querying in SQL. The identifier comprises the bucket endpoint, bucket name, and the object file name. SQL Query accepts wildcards; so, for example, if you have lots of files in a bucket with the same name and distinguished by a month and year (e.g.,
example-032018), you could combine them all into a single table by selecting
Now, click Run to start the SQL query. Once it runs, you’ll see two columns in the results panel appear. The Jobs column lists all the SQL queries that have been completed or failed. To the right of that, you’ll have two tabs: Result and Query details. Result shows the result of the query as a table:
Query details will include a copy of the SQL query, the query status, the run time (or time that it took to get the result), the user who made the query, and the URL location of the results in the object storage bucket:
If you click on the square icon next to the Result Location URL, it will take you to the Cloud Object Storage bucket and provide you with the CSV file with the results, which you can download or use as the table unique resource identifier for another query.
That’s all you have to do to get started using SQL Query! Now that we have the basics of running a query, let’s look at how we can use SQL Query and PixieDust to create visualizations for the data we have in Cloud Object Storage.
Using SQL Query in Watson Studio with PixieDust
Querying data within the SQL Query UI is a great way to explore what your data looks like and to experiment whether your SQL query generates the right results. However, if you want to visualize the results, you’ll have to interact with SQL Query programmatically. One way to do this is using Jupyter Notebooks to run the code for the SQL queries and PixieDust to visualize the results. In this part, we’ll use a Jupyter Notebook running Python 3.x and PixieDust to visualize data provided by SQL Query.
We’ll modify a version of the SQL query above to divide the Location column into longitude and latitude coordinates. Then, we’ll use PixieDust to visualize these points using Mapbox to see where most of the traffic accidents occur in Los Angeles. For this example, I am using a Jupyter Notebook on Watson Studio. However, if you have a local instance of a Jupyter Notebook, that should be fine as well, but I can’t guarantee you’ll have all the updated packages that come with Watson Studio. If you want to follow along with the pre-built example, you can download the notebook on Github. PixieDust is an open-source tool that is an extension to Jupyter Notebooks. It adds a wide range of functionality to easily create customized visualizations from your data sets with little code involved. The Welcome to PixieDust notebook on Watson Studio is a great primer to help you become acquainted with the tool and its capabilities.
To interact with SQL Query, we’ll use ibmcloudsql, an open-source Python library that sends SQL queries to SQL Cloud and retrieves the results from your object storage bucket. A starter notebook for SQL Query is also available, which includes the steps to get started, set up the appropriate authentication configuration you’ll need, and run PixieDust to create charts based on the data in that notebook. When you open up a notebook, you’ll need to install and import PixieDust and ibmcloudsql. To install them, you can run the following in the first cell:
!pip install --upgrade ibmcloudsql !conda install -c conda-forge pyarrow !pip install --upgrade pixiedust
Now, import the libraries in the next cell:
You’ll need to get an API key from IBM Cloud to have access to Cloud Object Storage and SQL Query. This is needed to write SQL Query results back to object storage. Go to Manage > Security > Platform API Keys at the top of your IBM Cloud account. Click the Create button and a window will appear asking you to enter a custom name for the API key:
Click Create at the bottom of that window. After that, you can choose to Download or Show the API key. You’ll need to save it so that you can use it in the notebook:
To save the Cloud API Key securely in the notebook, we’ll use the getpass module. Import the module and create a variable to store the API Key. When the cell runs, it will give you the prompt, and you can just paste in your API Key and click return on your keyboard:
The next requirement is getting the SQL Query service instance cloud resource number (CRN). You can select that from the Manage panel in your SQL Query service page. Click on the Instance CRN button under RESTAPI. It will copy the CRN for you:
In another notebook cell, create a variable and add your copied CRN number:
Now, all you need is the full URL of the generated object storage bucket that will store and retrieve the results of the queries. An easy way to do this is to go back to your Cloud Object Storage service and find the bucket that was created with your SQL Query service. Click on the kebab menu icon and select Bucket SQL URL:
A window will open up with the bucket’s URL. Copy that URL:
Create a variable in your notebook to store that URL, like the following:
Note that I’ve added the suffix
/accidents at the end of my URL. This is so we can distinguish these queries from others that we might have made that will appear in this bucket. The suffix
/accidents will become the prefix for the results generated by SQL Query. With the API Key, CRN, and Cloud Object Storage endpoint added, all we need to do is run SQL Query. To do that, we’ll invoke the ibmcloudsql library’s
SQLQueryfunction and provide the API Key, CRN, and object storage endpoint:
Just copy the code and run it with your object storage bucket that contains the CSV file. You’ll get the same results that you got in the UI, but as a data frame in a notebook:
Since we want to get the location of each accident, we’ll need to refine the data a little more so that the latitude and longitude coordinates are in separate columns, rather than grouped together in the locationcolumn. To do that, I’ve written a CTE (common table expression). Copy and paste the following code and run it in a cell:
The CTE splits the coordinates into two rows (latitude and longitude) and uses the accident ID for reference. Using a table called location that’s generated from the CTE, we can then join the CSV data and that table together using the accident ID. We’ll then filter the accident time (5pm-8pm) and age of victims (20–35), as well as remove accidents that have 0.000 as coordinates. A sample of the results of that query should look like the following:
Now that we have the coordinates in separate rows, we can view the locations of each accident on a map inside another cell by using PixieDust. To do that, we’ll run PixieDust’s
display() function with our
By running this function, PixieDust will start and all of the results will show as a table:
Click the chart button at the top and select Map:
Once Map is selected, select the id and age as the values and latitude and longitude as the keys. By default, PixieDust will show only 100 results on the map, but you can add all 30,000+ results without any issues:
Once you click OK, you should see a map showing each accident:
If there is an error, make sure that the renderer is mapbox. PixieDust comes with a Mapbox API Key built in that you can use. Once you see the map, you can then zoom into it and view individual points which will, for instance, give you:
We began this article with a brief introduction to getting set up with IBM Cloud SQL Query and using its UI. Then, we composed and ran queries in a Jupyter Notebook on Watson Studio with the SQL Query Python library and PixieDust. If you want to download the notebook to follow along or play with the SQL queries we’ve made here, click here for the Github repository. Once you get these services working together, lots of possibilities open up. For example, in my sample notebook, you could keep experimenting with the
querystring to fully explore the dataset. Granted, we’re not building predictive models or crunching petabytes of data, but for situations where you need to share a sizable amount of data and test some SQL queries, I can see it being very useful.