May 18, 2020 By Henrik Loeser 3 min read

How advanced SQL helps to dig through the numbers.

All of us have been impacted by COVID-19 in one way or another. Looking at daily case numbers, the basic reproduction number, or mortality rates has become a routine. But what is behind those numbers? How are they computed and how does SQL help? 

In this post, I want to discuss how a few SQL analytics clauses can help to dig through the data. When I teach database systems, I always try to show how they apply to life. Here, SQL may not directly help to cure the disease, but it can help to compute and understand the statistics.

SQL for working with data

SQL, the Structured Query Language, is used as query language by most database systems (e.g., Db2, MySQL, Oracle, PostgreSQL, SQL Server, SQLite, etc.). Moreover, using programming libraries, SQL expressions can be applied to analyze in-memory data in apps or Jupyter notebooks, like in IBM Watson Studio. There are also services like the SQL Query service to analyze data stored in Cloud Object Storage (COS). To conclude, SQL is ubiquitous and, as we will see, versatile.

SQL analytics

Most of us know how to write a simple SELECT … FROM … WHERE statement. But how do you easily put data from different days side by side? How do you compute rolling averages to smooth out reported values (e.g., to account for “administrative weekends”)? How do you make sure that all countries are represented, even if they did not report at the latest available date (“MAX(date)”)?

The answer is to apply SQL window functions and to utilize common table expressions. They allow you to partition and order data, then apply additional functions on top. 

Suppose we have two tables: STATISTICS, which holds the daily case data (confirmed cases, deaths, the reporting date, etc.) from around the world, and DEMOGRAPHICS, that has country-specific data on population, area, population density, age groups, and more. The following SQL statements (among others) can be used to enrich and to analyze the data.

Delta from previous day

The following SQL statement uses LAG to access data from the previous day. The data is PARTITIONed BY country to refer to the same country and is sorted by the reporting date:

SELECT
   confirmed_cases - LAG(confirmed_cases,1) OVER (PARTITION BY country_id ORDER BY dt) AS confirmed_cases_delta,
   confirmed_cases,
   country_id,
   dt
FROM statistics

Smoothed rolling average

The SQL clause AVG() OVER with PRECEDING and FOLLOWING rows can be applied to smooth out reported numbers. This is necessary to account for “administrative weekends” with fewer reports or missing data which may be added later on:

SELECT
   AVG(confirmed_cases) OVER (PARTITION BY country_id ORDER BY dt ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS cases_smoothed5,
   confirmed_cases,
   country_id,
   dt
FROM STATISTICS

Computed rolling average can be used in figures like this for Germany. Zooming in, you see daily numbers zig-zagging over smoothed averages:

Confirmed cases and deaths reported for Germany.

Latest available per country data

Not all countries may have reported on the latest overall date, so if you are using MAX(date), those countries could be missed. By utilizing ROW_NUMBER() to enumerate available data in a DESCending way by date and the selecting only the first row, the most recent available data is returned. 

WITH latest_deaths_data AS
   ( SELECT country_id,
            dt,
            deaths,
            confirmed_cases,
            ROW_NUMBER() OVER (PARTITION BY country_id ORDER BY dt DESC) as rn
     FROM STATISTICS)
SELECT country_id,
       dt,
       deaths,
       confirmed_cases,
       rn
FROM latest_deaths_data
WHERE rn=1

A query like the one above can be part of a more complex query to compute the mortality per 100,000 population as discussed by Johns Hopkins University. Taking into account all countries (not just those with the latest numbers from today), here is my generated table:

Conclusions

SQL is ubiquitous and quite versatile, and many of us know how to write simple SQL statements. More advanced SQL helps to analyze the daily COVID-19 data from around the world. I have shown few statements in the blog above. 

In addition, I have created a gist on GitHub with sample SQL statements to dig into COVID-19 data. Feel free to comment on GitHub or to add to it. The two screenshots in this post are based on data processed with such SQL statements.

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn

Was this article helpful?
YesNo

More from Cloud

How digital solutions increase efficiency in warehouse management

3 min read - In the evolving landscape of modern business, the significance of robust maintenance, repair and operations (MRO) systems cannot be overstated. Efficient warehouse management helps businesses to operate seamlessly, ensure precision and drive productivity to new heights. In our increasingly digital world, bar coding stands out as a cornerstone technology, revolutionizing warehouses by enabling meticulous data tracking and streamlined workflows. With this knowledge, A3J Group is focused on using IBM® Maximo® Application Suite and the Red Hat® Marketplace to help bring…

How fintechs are helping banks accelerate innovation while navigating global regulations

4 min read - Financial institutions are partnering with technology firms—from cloud providers to fintechs—to adopt innovations that help them stay competitive, remain agile and improve the customer experience. However, the biggest hurdle to adopting new technologies is security and regulatory compliance. While third and fourth parties have the potential to introduce risk, they can also be the solution. As enterprises undergo their modernization journeys, fintechs are redefining digital transformation in ways that have never been seen before. This includes using hybrid cloud and…

IBM Cloud expands its VPC operations in Dallas, Texas

3 min read - Everything is bigger in Texas—including the IBM Cloud® Network footprint. Today, IBM Cloud opened its 10th data center in Dallas, Texas, in support of their virtual private cloud (VPC) operations. DAL14, the new addition, is the fourth availability zone in the IBM Cloud area of Dallas, Texas. It complements the existing setup, which includes two network points of presence (PoPs), one federal data center, and one single-zone region (SZR). The facility is designed to help customers use technology such as…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters