September 6, 2018 | Written by: IBM Cloud Geospatial Team
Categorized: Data Science | Storage
Share this post:
Geospatial functions of IBM Cloud SQL Query: Part 1
This is the first in a series of articles explaining the key highlights of the geospatial functions of IBM Cloud SQL Query. We will cover the Full Earth feature of the geospatial functions in this article—i.e., every topological (contains, intersects, union, etc.) and metric (distance, area) function operating on the Full Earth or the ellipsoidal Earth model. These operations are performed without using projections, resulting in accurate answers and without loss of performance when compared to the traditional Flat Earth models.
SQL Query provides serverless, interactive querying for analyzing data in IBM Cloud Object Storage (COS). It allows users to query data—such as CSV, JSON, Parquet, and ORC—directly where it is stored. There’s no ETL, no databases, and no infrastructure to manage.
What are projections?
Projections are perhaps the most common term for anyone using geospatial libraries; for those who are not aware, a map projection is a systematic transformation of the latitudes and longitudes from the surface of a sphere or an ellipsoid onto xy coordinates on a plane.
The mathematics of projection do not permit any particular map projection to be “best” for everything. Depending on the nature of the use case, one will need to choose the right projection. Projections come in different types—area, shape, orientation (angles), distance, and scale preserving are some examples. Each of these projections, as the name suggests, preserves certain properties while distorting the others. For example, a shape-preserving projection will distort the distance (a classic example is that of why Greenland appears larger than Africa on a map). The choice of projection gets harder as the projections have to be parameterized correctly depending on the location in which one is operating. Further, certain regions on the Earth (e.g., anti-meridian, poles) will require specialized projections.
Why should I care as a geospatial developer?
As a geospatial developer, there are several libraries (e.g., Python, Java, C++) to choose from. These libraries fall into one of two categories: one that does not use projections and one that uses projections. In my experience, Google S2 is the only open source library that provides topological operations on the Spherical/Ellipsoidal Earth. Libraries such as ESRI Geometry API and GeoTools use projections with planar algorithms.
As a developer, when we make the choice of a planar library, we need to be acutely aware of the projection and the region of operation or else we run into errors, some which throw an exception and some that will silently return incorrect results. The correct choice of a projection requires a deep understanding of the GIS science. The interested reader can refer to the famous publication by John Snyder available at the USGS website. A very practical example of projections and their failures if used improperly (along with the correct choice) is described in this blog article.
A world without projections
So, why are there two types of libraries in the GIS world? Our answer to this question is that it is historical: topological operations date back to several decades (ESRI was founded in 1969), when computational power was not as abundant as today. Also, data was localized (e.g., water bodies in New York’s Westchester County, housing parcels in California’s Orange County). Geospatial analytics was the domain of few GIS experts. A planar library was sufficient to address the world’s problems then.
However, the same approach does not work well in the current time as data becomes increasingly global and readily available cloud resources are available to process such large global datasets. Furthermore, with the pervasive availability of location data, it is quite appealing for the non-GIS expert to consume such data, whether to do basic statistical analysis or to create new deep learning models.
Our approach to this problem has been to develop a library that leaves the legacy behind and addresses the needs of large-scale geospatial analytics and machine learning. It does not make use of projections and accurately performs all the operations on the Ellipsoidal Earth. The library is designed to provide performance guarantees on par with the planar geometries so that developers can economically process large datasets. The non-GIS developer can then realize their use cases without having to ever think of which projection is right or whether some calculation might inadvertently be performed near the poles or across the anti-meridian. They can focus on the solution instead of having to deal with such quirks.
IBM Cloud SQL Query using an example
SQL Query provides SQL/MM style geospatial functions using the Full Earth functionality. You can find the full documentation of the functions here.
Let us examine a query that was described in the blog. The question was to find the intersection of two polygons, and it so happens that these polygons are very close to the pole. For the sake of completeness, here are the two polygons in their WKT form:
POLYGON((170 50,170 72,-130 72,-130 50,170 50))
POLYGON((-170 68,-170 90,-141 90,-141 68,-170 68))
Let us run this query on the SQL Query UI as below:
ST_WKTToSQL(‘POLYGON((170 50,170 72,-130 72,-130 50,170 50))’),
ST_WKTToSQL(‘POLYGON((-170 68,-170 90,-141 90,-141 68,-170 68))’)
)) FROM cos://us-geo/sql/hospitals.parquet STORED AS PARQUET h
The result is a polygon, and the reader can use their imagination to check the result (plotting on Google maps will not be accurate as you will run into projection issues):
POLYGON ((-170.0 74.05379364533803, -170.0 68.0, -141.0 68.0, -141.0 73.42686213789044, -170.0 74.05379364533803))
In conclusion, this example highlights one of the key advantages provided by IBM Cloud SQL Query for performing geospatial analysis—a non-GIS expert developer can use familiar constructs and get the job done quickly and confidently.
You can get started with SQL Query today for free on the IBM Cloud. Simply deploy the “Lite” plan, which allows you to query up to 30 Gigabytes/day, or select the Standard plan for unlimited access at $5/Terabyte scanned per month.