In the last few years, the volume of remote sensing data has increased significantly.
This has primarily been driven by an increase in the number of satellite launches and drastic improvements in sensing technology (e.g., better cameras with hyper-spectral imaging, synthetic aperture radar, and lidar). With increased quality in resolution and spatiotemporal coverage, there is a demand for consuming insights derived in business-critical systems across a wide range of industries, including insurance, urban planning, agriculture, climate change, and flood prevention. This creates a need for a spatiotemporal technology that can seamlessly ingest, store, and query terabytes to petabytes of remote sensed data in a cost-efficient and developer-friendly manner.
A cost-efficient and developer-friendly solution
Although several IBM products (e.g., IBM Event Streams, IBM Db2, and IBM Db2 Warehouse) provide spatial and time series functionalities, the cloud poses a unique challenge due to the disaggregation of compute and (low-cost) storage layers. Typical cloud data lakes support indexing and queries on single-dimensional data. However, this falls short of supporting multi-dimensional data indexing and queries, as is the case with geospatial data.
Satellite data: Background
Remote sensing data comes in two flavors: (a) raster data and (b) vector data. Raster data is "image" and vector data is geometries (where each geometry — typically a point, line segment, or a polygon — has a set of values associated with it). An example raster image from the popular MODIS Terra product is shown below, which depicts the NDVI (Normalized Difference Vegetation Index, a simple graphical indicator for green vegetation) values for the New York region:
Raster data is consumed in the form of "images" in downstream applications. For example, a deep learning pipeline that extracts building footprints from RGB (visual) imagery from satellites.
Vector data is extremely useful when performing analytic queries on the geometries and their values. For example, aligning the MODIS Aqua (water) layer with the cloud cover layer, with the goal of enhancing data quality. Such operations require the ability to evaluate functions like interpolation and forecasting on these point geometries (pixel) values.
Both these queries are necessary to support the consumption of satellite data on a cloud platform.
We design a uniform serverless pipeline for ingesting, storing, and querying remote sensed satellite data using IBM Cloud. This approach utilizes the cheapest form of cloud storage and compute, while providing a simple developer experience to manage terabytes to petabytes of remote sensing data. The proposed architecture is depicted below:
Satellite data is a continuous data stream collected periodically through ground stations. After the signals are processed, they are typically in a standard scientific data format (e.g., HDF, netCDF, and GeoTIFF). As satellite imagery is global, they are usually in WGS84 coordinate system and timestamped. Storing this raw data in object storage is not amenable for fast and efficient querying. The data needs to be transformed into a queryable format, such as Parquet or cloud-optimized GeoTIFF.
Our approach is to use IBM Cloud Object Storage for storing the cloud-optimized GeoTIFF and Parquet data such that it is amenable to efficient queries downstream. The transformation from raw data formats to these big data formats is performed using IBM Cloud Code Engine to accommodate for elastic scaling as well as the periodic "spiky" activity.
IBM Cloud Code Engine is IBM's Knative serverless platform for executing massively parallel jobs. Once the data is transformed, a cloud function can be invoked to query raster imagery, given a region and time-range. This leverages the ability of Cloud Object Storage to do byte range reads and return the relevant data (on both cloud-optimized GeoTIFF and Parquet), thus reducing download costs as well as the query times.
Finally, in order to perform analytic queries like aligning the layers and aggregations, we use IBM Cloud SQL Query and its geospatial data skipping features. SQL Query is a serverless ETL tool for analytic queries on big data at rest. SQL Query supports geospatial and time series analytic queries natively and provides the ability to read only the relevant spatial and temporal data based on the query, which also reduces the download costs as well as the query times.
Below, we highlight code snippets that demonstrate the various steps of the above workflow.
Transform raw data to COG
Transform raw data to Parquet
Create spatial index on Parquet
When layers are naturally aligned, we can join multiple layers directly:
When layers are NOT naturally aligned, we can join multiple layers by utilizing geohash. The geohash bit depth gives us full flexibility on the spatial resolution that we want to align with:
In this post, we have covered how to efficiently ingest, query, and analyze remote sensing data in a cloud native way on IBM Cloud. Remote sensing data is playing a more and more important role in industries on business insights, and we believe leveraging cloud is the way of processing and analyzing remote sensing data, as we could easily ingest and query an arbitrary amount of data in a serverless way.
Ingesting and analyzing na arbitrary volume is crucial because remote sensing data can be extremely large and grow significantly on a daily basis, while querying in a serverless fashion is also vitally important; you pay only the data you need, no matter how much underlying data you have, which is very cost efficient (e.g., $5/TB for SQL Query).