How advanced SQL helps to dig through the numbers.
All of us have been impacted by COVID-19 in one way or another. Looking at daily case numbers, the basic reproduction number, or mortality rates has become a routine. But what is behind those numbers? How are they computed and how does SQL help?
In this post, I want to discuss how a few SQL analytics clauses can help to dig through the data. When I teach database systems, I always try to show how they apply to life. Here, SQL may not directly help to cure the disease, but it can help to compute and understand the statistics.
SQL for working with data
SQL, the Structured Query Language, is used as query language by most database systems (e.g., Db2, MySQL, Oracle, PostgreSQL, SQL Server, SQLite, etc.). Moreover, using programming libraries, SQL expressions can be applied to analyze in-memory data in apps or Jupyter notebooks, like in IBM Watson Studio. There are also services like the SQL Query service to analyze data stored in Cloud Object Storage (COS). To conclude, SQL is ubiquitous and, as we will see, versatile.
SQL analytics
Most of us know how to write a simple SELECT … FROM … WHERE statement. But how do you easily put data from different days side by side? How do you compute rolling averages to smooth out reported values (e.g., to account for “administrative weekends”)? How do you make sure that all countries are represented, even if they did not report at the latest available date (“MAX(date)”)?
The answer is to apply SQL window functions and to utilize common table expressions. They allow you to partition and order data, then apply additional functions on top.
Suppose we have two tables: STATISTICS, which holds the daily case data (confirmed cases, deaths, the reporting date, etc.) from around the world, and DEMOGRAPHICS, that has country-specific data on population, area, population density, age groups, and more. The following SQL statements (among others) can be used to enrich and to analyze the data.
Delta from previous day
The following SQL statement uses LAG to access data from the previous day. The data is PARTITIONed BY country to refer to the same country and is sorted by the reporting date:
Smoothed rolling average
The SQL clause AVG() OVER with PRECEDING and FOLLOWING rows can be applied to smooth out reported numbers. This is necessary to account for “administrative weekends” with fewer reports or missing data which may be added later on:
Computed rolling average can be used in figures like this for Germany. Zooming in, you see daily numbers zig-zagging over smoothed averages:
Latest available per country data
Not all countries may have reported on the latest overall date, so if you are using MAX(date), those countries could be missed. By utilizing ROW_NUMBER() to enumerate available data in a DESCending way by date and the selecting only the first row, the most recent available data is returned.
A query like the one above can be part of a more complex query to compute the mortality per 100,000 population as discussed by Johns Hopkins University. Taking into account all countries (not just those with the latest numbers from today), here is my generated table:
Conclusions
SQL is ubiquitous and quite versatile, and many of us know how to write simple SQL statements. More advanced SQL helps to analyze the daily COVID-19 data from around the world. I have shown few statements in the blog above.
In addition, I have created a gist on GitHub with sample SQL statements to dig into COVID-19 data. Feel free to comment on GitHub or to add to it. The two screenshots in this post are based on data processed with such SQL statements.
If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.