IBM Cloud for Serverless Remote Sensing Data

4 min read

In the last few years, the volume of remote sensing data has increased significantly.

This has primarily been driven by an increase in the number of satellite launches and drastic improvements in sensing technology (e.g., better cameras with hyper-spectral imaging, synthetic aperture radar, and lidar). With increased quality in resolution and spatiotemporal coverage, there is a demand for consuming insights derived in business-critical systems across a wide range of industries, including insurance, urban planning, agriculture, climate change, and flood prevention. This creates a need for a spatiotemporal technology that can seamlessly ingest, store, and query terabytes to petabytes of remote sensed data in a cost-efficient and developer-friendly manner.

A cost-efficient and developer-friendly solution

Although several IBM products (e.g., IBM Event Streams, IBM Db2, and IBM Db2 Warehouse) provide spatial and time series functionalities, the cloud poses a unique challenge due to the disaggregation of compute and (low-cost) storage layers. Typical cloud data lakes support indexing and queries on single-dimensional data. However, this falls short of supporting multi-dimensional data indexing and queries, as is the case with geospatial data. 

Satellite data: Background

Remote sensing data comes in two flavors: (a) raster data and (b) vector data. Raster data is "image" and vector data is geometries (where each geometry — typically a point, line segment, or a polygon — has a set of values associated with it). An example raster image from the popular MODIS Terra product is shown below, which depicts the NDVI (Normalized Difference Vegetation Index, a simple graphical indicator for green vegetation) values for the New York region:

An example raster image from the popular MODIS Terra product is shown below, which depicts the NDVI (Normalized Difference Vegetation Index, a simple graphical indicator for green vegetation) values for the New York region

Raster data is consumed in the form of "images" in downstream applications. For example, a deep learning pipeline that extracts building footprints from RGB (visual) imagery from satellites. 

Vector data is extremely useful when performing analytic queries on the geometries and their values. For example, aligning the MODIS Aqua (water) layer with the cloud cover layer, with the goal of enhancing data quality. Such operations require the ability to evaluate functions like interpolation and forecasting on these point geometries (pixel) values. 

Both these queries are necessary to support the consumption of satellite data on a cloud platform.

Serverless architecture

We design a uniform serverless pipeline for ingesting, storing, and querying remote sensed satellite data using IBM Cloud. This approach utilizes the cheapest form of cloud storage and compute, while providing a simple developer experience to manage terabytes to petabytes of remote sensing data. The proposed architecture is depicted below:

The proposed architecture is depicted below:

Satellite data is a continuous data stream collected periodically through ground stations. After the signals are processed, they are typically in a standard scientific data format (e.g., HDF, netCDF, and GeoTIFF). As satellite imagery is global, they are usually in WGS84 coordinate system and timestamped. Storing this raw data in object storage is not amenable for fast and efficient querying. The data needs to be transformed into a queryable format, such as Parquet or cloud-optimized GeoTIFF.

Our approach is to use IBM Cloud Object Storage for storing the cloud-optimized GeoTIFF and Parquet data such that it is amenable to efficient queries downstream. The transformation from raw data formats to these big data formats is performed using IBM Cloud Code Engine to accommodate for elastic scaling as well as the periodic "spiky" activity. 

IBM Cloud Code Engine is IBM's Knative serverless platform for executing massively parallel jobs. Once the data is transformed, a cloud function can be invoked to query raster imagery, given a region and time-range. This leverages the ability of Cloud Object Storage to do byte range reads and return the relevant data (on both cloud-optimized GeoTIFF and Parquet), thus reducing download costs as well as the query times. 

Finally, in order to perform analytic queries like aligning the layers and aggregations, we use IBM Cloud SQL Query and its geospatial data skipping features. SQL Query is a serverless ETL tool for analytic queries on big data at rest. SQL Query supports geospatial and time series analytic queries natively and provides the ability to read only the relevant spatial and temporal data based on the query, which also reduces the download costs as well as the query times.

Examples

Below, we highlight code snippets that demonstrate the various steps of the above workflow.

Transform raw data to COG

# import libraries
from pyrip.transform import hdf_to_tif
from pyrip.cogeo import optimize
# all various of raster formats (GeoTiff, HDF, netCDF, SAFE, etc.) are supported and raw files are converted to GeoTiff first
raw_tif_files = hdf_to_tif(raw_file, match_substrs=['NDVI', 'EVI'])
for raw_tif_file in raw_tif_files:
    # project image to EPSG:4326
    reprojected_tif_file = reproject(raw_tif_file)
    # optimize GeoTiff to Cloud Optimized GeoTiff
    optimized_tif_file = optimize(reprojected_tif_file)

Transform raw data to Parquet

# import libraries
from pyrip.transform import tif_to_df
# easy transformations between raster format and vector format using tif_to_df and df_to_tif
df = tif_to_df(reprojected_tif_file)
# save to parquet file
df.to_parquet(parquet_file, engine='pyarrow', compression='snappy', index=False)

Create spatial index on Parquet

CREATE METAINDEX
MINMAX FOR lat,
MINMAX FOR lon
ON TABLE ndvi

Imagery query

# specify layer, date/date range, and region of interest
layer = 'MOD13Q1_250m_16_days_NDVI'
date = '20200609'
bbox = (-74.4, 40.6, -73.6, 41.4)
result_tif = 'result.tif'

vrt_file = '/vsis3/{}/{}/layer={}/date={}/index.vrt'.format(bucket, prefix, layer, date)
query(vrt_file, result_tif, bbox, access_key_id=access_key_id, secret_access_key=secret_access_key, endpoint=endpoint)

Analytical query

When layers are naturally aligned, we can join multiple layers directly:

# specify date range and region of interest
start_date = '20180915'
end_date = '20180930'
bbox = (-112, 41.5, -111, 42.5)

# this query can be run either with Spark sql or IBM Cloud SQL Query
query = """
SELECT intr.lat, intr.lon, intr.date, 
CASE
    WHEN mask.value == 0 THEN intr.value
    WHEN mask.value > 7 THEN 0
    ELSE 9
END AS value
FROM 
    (SELECT *
    FROM landsat
    WHERE layer = 'INTR' AND date >= {0} AND date <= {1} AND lon > {2} AND lat > {3} AND lon < {4} AND lat < {5}) intr, 
    (SELECT *
    FROM landsat
    WHERE layer = 'MASK' AND date >= {0} AND date <= {1} AND lon > {2} AND lat > {3} AND lon < {4} AND lat < {5}) mask 
WHERE intr.lat = mask.lat AND intr.lon = mask.lon AND intr.date = mask.date
""".format(start_date, end_date, *bbox)

When layers are NOT naturally aligned, we can join multiple layers by utilizing geohash. The geohash bit depth gives us full flexibility on the spatial resolution that we want to align with:

# specify geohash bit depth, which controls the spatial resolution of layer merging
bit_depth = 35

# this query can be run either with Spark sql or IBM Cloud SQL Query
query = """
SELECT
ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(scl.geohash)))) AS lat,
ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(scl.geohash)))) AS lon,
CASE
    WHEN scl.is_vegetation THEN (nir.value - red.value) / (nir.value + red.value)
    ELSE -1
END AS ndvi
FROM
    (SELECT substr(geohash_binary_str, 0, {0}) as geohash, array_contains(collect_list(value), 4) as is_vegetation
    FROM scl
    GROUP BY substr(geohash_binary_str, 0, {0})) scl,
    
    (SELECT substr(geohash_binary_str, 0, {0}) as geohash, avg(value) as value
    FROM nir
    GROUP BY substr(geohash_binary_str, 0, {0})) nir,
    
    (SELECT substr(geohash_binary_str, 0, {0}) as geohash, avg(value) as value
    FROM red
    GROUP BY substr(geohash_binary_str, 0, {0})) red
WHERE scl.geohash = nir.geohash AND nir.geohash = red.geohash
""".format(bit_depth)

Conclusion

In this post, we have covered how to efficiently ingest, query, and analyze remote sensing data in a cloud native way on IBM Cloud. Remote sensing data is playing a more and more important role in industries on business insights, and we believe leveraging cloud is the way of processing and analyzing remote sensing data, as we could easily ingest and query an arbitrary amount of data in a serverless way. 

Ingesting and analyzing na arbitrary volume is crucial because remote sensing data can be extremely large and grow significantly on a daily basis, while querying in a serverless fashion is also vitally important; you pay only the data you need, no matter how much underlying data you have, which is very cost efficient (e.g., $5/TB for SQL Query).

Further reading

Be the first to hear about news, product updates, and innovation from IBM Cloud