Currently, SQL Query can run queries on data that are stored as CSV, Parquet, or JSON in Cloud Object Storage. To interact with the SQL Query, you can write SQL queries using its UI, write programmatically using the REST API or the ibmcloudsql Python library, or write a serverless function using IBM Cloud Functions. After provisioning SQL Query from the catalog, click on the Manage tab in the left sidebar. Click Open UI from that page to access the SQL Query UI to start making queries:
Start querying your data
When you click Open UI, the SQL Query service will automatically generate a unique Cloud Object Storage bucket that will store all of the results as CSV files from your SQL queries. You can also specify other buckets to store results in, but to keep things simple, we’ll use the generated object storage bucket. The SQL Query UI is an editor that lets you immediately start composing SQL queries. Since SQL Query uses Spark SQL, you can use Spark SQL functions and ANSI SQL to compose both simple and complex queries that involve large amounts of data:
At the bottom of the editor, you’ll see the target Cloud Object Storage URL. This URL points to the object storage bucket generated by SQL Query to store your results:
/resultsuffix is added by default as the suffix of the URL.
result will be attached to each file name the contains the results of a SQL query inside this bucket. To try out the editor, let’s compose a query that gets data from the traffic collision data. The query selects the traffic collision ID, date and time the collision occurred, the area where the collision happened, victim age and sex, and location coordinates. Additionally, we’ll filter the data so that it only provides us with victims whose age is between 20–35 and an incident time between 5pm to 8pm. The SQL code is below:
Since SQL Query creates a table from CSV, Parquet, or JSON files behind the scenes, in the FROM statement you have to provide a table unique resource identifier in the form of:
That points to the file(s) you’re querying—similar to defining the table you’re querying in SQL. The identifier comprises the bucket endpoint, bucket name, and the object file name. SQL Query accepts wildcards; so, for example, if you have lots of files in a bucket with the same name and distinguished by a month and year (e.g.,
example-032018), you could combine them all into a single table by selecting
Now, click Run to start the SQL query. Once it runs, you’ll see two columns in the results panel appear. The Jobs column lists all the SQL queries that have been completed or failed. To the right of that, you’ll have two tabs: Result and Query details. Result shows the result of the query as a table:
Query details will include a copy of the SQL query, the query status, the run time (or time that it took to get the result), the user who made the query, and the URL location of the results in the object storage bucket:
If you click on the square icon next to the Result Location URL, it will take you to the Cloud Object Storage bucket and provide you with the CSV file with the results, which you can download or use as the table unique resource identifier for another query.
That’s all you have to do to get started using SQL Query! Now that we have the basics of running a query, let’s look at how we can use SQL Query and PixieDust to create visualizations for the data we have in Cloud Object Storage.
Using SQL Query in Watson Studio with PixieDust
Querying data within the SQL Query UI is a great way to explore what your data looks like and to experiment whether your SQL query generates the right results. However, if you want to visualize the results, you’ll have to interact with SQL Query programmatically. One way to do this is using Jupyter Notebooks to run the code for the SQL queries and PixieDust to visualize the results. In this part, we’ll use a Jupyter Notebook running Python 3.x and PixieDust to visualize data provided by SQL Query.
We’ll modify a version of the SQL query above to divide the Location column into longitude and latitude coordinates. Then, we’ll use PixieDust to visualize these points using Mapbox to see where most of the traffic accidents occur in Los Angeles. For this example, I am using a Jupyter Notebook on Watson Studio. However, if you have a local instance of a Jupyter Notebook, that should be fine as well, but I can’t guarantee you’ll have all the updated packages that come with Watson Studio. If you want to follow along with the pre-built example, you can download the notebook on Github. PixieDust is an open-source tool that is an extension to Jupyter Notebooks. It adds a wide range of functionality to easily create customized visualizations from your data sets with little code involved. The Welcome to PixieDust notebook on Watson Studio is a great primer to help you become acquainted with the tool and its capabilities.
To interact with SQL Query, we’ll use ibmcloudsql, an open-source Python library that sends SQL queries to SQL Cloud and retrieves the results from your object storage bucket. A starter notebook for SQL Query is also available, which includes the steps to get started, set up the appropriate authentication configuration you’ll need, and run PixieDust to create charts based on the data in that notebook. When you open up a notebook, you’ll need to install and import PixieDust and ibmcloudsql. To install them, you can run the following in the first cell:
!pip install --upgrade ibmcloudsql
!conda install -c conda-forge pyarrow
!pip install --upgrade pixiedust
Now, import the libraries in the next cell:
You’ll need to get an API key from IBM Cloud to have access to Cloud Object Storage and SQL Query. This is needed to write SQL Query results back to object storage. Go to Manage > Security > Platform API Keys at the top of your IBM Cloud account. Click the Create button and a window will appear asking you to enter a custom name for the API key:
Click Create at the bottom of that window. After that, you can choose to Download or Show the API key. You’ll need to save it so that you can use it in the notebook:
To save the Cloud API Key securely in the notebook, we’ll use the getpass module. Import the module and create a variable to store the API Key. When the cell runs, it will give you the prompt, and you can just paste in your API Key and click return on your keyboard:
The next requirement is getting the SQL Query service instance cloud resource number (CRN). You can select that from the Manage panel in your SQL Query service page. Click on the Instance CRN button under REST API. It will copy the CRN for you:
In another notebook cell, create a variable and add your copied CRN number:
Now, all you need is the full URL of the generated object storage bucket that will store and retrieve the results of the queries. An easy way to do this is to go back to your Cloud Object Storage service and find the bucket that was created with your SQL Query service. Click on the kebab menu icon and select Bucket SQL URL:
A window will open up with the bucket’s URL. Copy that URL:
Create a variable in your notebook to store that URL, like the following:
Note that I’ve added the suffix
/accidents at the end of my URL. This is so we can distinguish these queries from others that we might have made that will appear in this bucket. The suffix
/accidents will become the prefix for the results generated by SQL Query. With the API Key, CRN, and Cloud Object Storage endpoint added, all we need to do is run SQL Query. To do that, we’ll invoke the ibmcloudsql library’s
SQLQuery function and provide the API Key, CRN, and object storage endpoint:
Just copy the code and run it with your object storage bucket that contains the CSV file. You’ll get the same results that you got in the UI, but as a data frame in a notebook:
Since we want to get the location of each accident, we’ll need to refine the data a little more so that the latitude and longitude coordinates are in separate columns, rather than grouped together in the location column. To do that, I’ve written a CTE (common table expression). Copy and paste the following code and run it in a cell:
The CTE splits the coordinates into two rows (latitude and longitude) and uses the accident ID for reference. Using a table called location that’s generated from the CTE, we can then join the CSV data and that table together using the accident ID. We’ll then filter the accident time (5pm-8pm) and age of victims (20–35), as well as remove accidents that have 0.000 as coordinates. A sample of the results of that query should look like the following:
Now that we have the coordinates in separate rows, we can view the locations of each accident on a map inside another cell by using PixieDust. To do that, we’ll run PixieDust’s
display() function with our
By running this function, PixieDust will start and all of the results will show as a table: