In the last few years, the volume of remote sensing data has increased significantly.

This has primarily been driven by an increase in the number of satellite launches and drastic improvements in sensing technology (e.g., better cameras with hyper-spectral imaging, synthetic aperture radar, and lidar). With increased quality in resolution and spatiotemporal coverage, there is a demand for consuming insights derived in business-critical systems across a wide range of industries, including insurance, urban planning, agriculture, climate change, and flood prevention. This creates a need for a spatiotemporal technology that can seamlessly ingest, store, and query terabytes to petabytes of remote sensed data in a cost-efficient and developer-friendly manner.

A cost-efficient and developer-friendly solution

Although several IBM products (e.g., IBM Event Streams, IBM Db2, and IBM Db2 Warehouse) provide spatial and time series functionalities, the cloud poses a unique challenge due to the disaggregation of compute and (low-cost) storage layers. Typical cloud data lakes support indexing and queries on single-dimensional data. However, this falls short of supporting multi-dimensional data indexing and queries, as is the case with geospatial data. 

Satellite data: Background

Remote sensing data comes in two flavors: (a) raster data and (b) vector data. Raster data is “image” and vector data is geometries (where each geometry — typically a point, line segment, or a polygon — has a set of values associated with it). An example raster image from the popular MODIS Terra product is shown below, which depicts the NDVI (Normalized Difference Vegetation Index, a simple graphical indicator for green vegetation) values for the New York region:

Raster data is consumed in the form of “images” in downstream applications. For example, a deep learning pipeline that extracts building footprints from RGB (visual) imagery from satellites. 

Vector data is extremely useful when performing analytic queries on the geometries and their values. For example, aligning the MODIS Aqua (water) layer with the cloud cover layer, with the goal of enhancing data quality. Such operations require the ability to evaluate functions like interpolation and forecasting on these point geometries (pixel) values. 

Both these queries are necessary to support the consumption of satellite data on a cloud platform.

Serverless architecture

We design a uniform serverless pipeline for ingesting, storing, and querying remote sensed satellite data using IBM Cloud. This approach utilizes the cheapest form of cloud storage and compute, while providing a simple developer experience to manage terabytes to petabytes of remote sensing data. The proposed architecture is depicted below:

Satellite data is a continuous data stream collected periodically through ground stations. After the signals are processed, they are typically in a standard scientific data format (e.g., HDF, netCDF, and GeoTIFF). As satellite imagery is global, they are usually in WGS84 coordinate system and timestamped. Storing this raw data in object storage is not amenable for fast and efficient querying. The data needs to be transformed into a queryable format, such as Parquet or cloudoptimized GeoTIFF.

Our approach is to use IBM Cloud Object Storage for storing the cloud-optimized GeoTIFF and Parquet data such that it is amenable to efficient queries downstream. The transformation from raw data formats to these big data formats is performed using IBM Cloud Code Engine to accommodate for elastic scaling as well as the periodic “spiky” activity. 

IBM Cloud Code Engine is IBM’s Knative serverless platform for executing massively parallel jobs. Once the data is transformed, a cloud function can be invoked to query raster imagery, given a region and time-range. This leverages the ability of Cloud Object Storage to do byte range reads and return the relevant data (on both cloud-optimized GeoTIFF and Parquet), thus reducing download costs as well as the query times. 

Finally, in order to perform analytic queries like aligning the layers and aggregations, we use IBM Cloud SQL Query and its geospatial data skipping features. SQL Query is a serverless ETL tool for analytic queries on big data at rest. SQL Query supports geospatial and time series analytic queries natively and provides the ability to read only the relevant spatial and temporal data based on the query, which also reduces the download costs as well as the query times.

Examples

Below, we highlight code snippets that demonstrate the various steps of the above workflow.

Transform raw data to COG

# import libraries
from pyrip.transform import hdf_to_tif
from pyrip.cogeo import optimize
# all various of raster formats (GeoTiff, HDF, netCDF, SAFE, etc.) are supported and raw files are converted to GeoTiff first
raw_tif_files = hdf_to_tif(raw_file, match_substrs=['NDVI', 'EVI'])
for raw_tif_file in raw_tif_files:
    # project image to EPSG:4326
    reprojected_tif_file = reproject(raw_tif_file)
    # optimize GeoTiff to Cloud Optimized GeoTiff
    optimized_tif_file = optimize(reprojected_tif_file)

Transform raw data to Parquet

# import libraries
from pyrip.transform import tif_to_df
# easy transformations between raster format and vector format using tif_to_df and df_to_tif
df = tif_to_df(reprojected_tif_file)
# save to parquet file
df.to_parquet(parquet_file, engine='pyarrow', compression='snappy', index=False)

Create spatial index on Parquet

CREATE METAINDEX
MINMAX FOR lat,
MINMAX FOR lon
ON TABLE ndvi

Imagery query

# specify layer, date/date range, and region of interest
layer = 'MOD13Q1_250m_16_days_NDVI'
date = '20200609'
bbox = (-74.4, 40.6, -73.6, 41.4)
result_tif = 'result.tif'

vrt_file = '/vsis3/{}/{}/layer={}/date={}/index.vrt'.format(bucket, prefix, layer, date)
query(vrt_file, result_tif, bbox, access_key_id=access_key_id, secret_access_key=secret_access_key, endpoint=endpoint)

Analytical query

When layers are naturally aligned, we can join multiple layers directly:

# specify date range and region of interest
start_date = '20180915'
end_date = '20180930'
bbox = (-112, 41.5, -111, 42.5)

# this query can be run either with Spark sql or IBM Cloud SQL Query
query = """
SELECT intr.lat, intr.lon, intr.date, 
CASE
    WHEN mask.value == 0 THEN intr.value
    WHEN mask.value > 7 THEN 0
    ELSE 9
END AS value
FROM 
    (SELECT *
    FROM landsat
    WHERE layer = 'INTR' AND date >= {0} AND date <= {1} AND lon > {2} AND lat > {3} AND lon < {4} AND lat < {5}) intr, 
    (SELECT *
    FROM landsat
    WHERE layer = 'MASK' AND date >= {0} AND date <= {1} AND lon > {2} AND lat > {3} AND lon < {4} AND lat < {5}) mask 
WHERE intr.lat = mask.lat AND intr.lon = mask.lon AND intr.date = mask.date
""".format(start_date, end_date, *bbox)

When layers are NOT naturally aligned, we can join multiple layers by utilizing geohash. The geohash bit depth gives us full flexibility on the spatial resolution that we want to align with:

# specify geohash bit depth, which controls the spatial resolution of layer merging
bit_depth = 35

# this query can be run either with Spark sql or IBM Cloud SQL Query
query = """
SELECT
ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(scl.geohash)))) AS lat,
ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(scl.geohash)))) AS lon,
CASE
    WHEN scl.is_vegetation THEN (nir.value - red.value) / (nir.value + red.value)
    ELSE -1
END AS ndvi
FROM
    (SELECT substr(geohash_binary_str, 0, {0}) as geohash, array_contains(collect_list(value), 4) as is_vegetation
    FROM scl
    GROUP BY substr(geohash_binary_str, 0, {0})) scl,
    
    (SELECT substr(geohash_binary_str, 0, {0}) as geohash, avg(value) as value
    FROM nir
    GROUP BY substr(geohash_binary_str, 0, {0})) nir,
    
    (SELECT substr(geohash_binary_str, 0, {0}) as geohash, avg(value) as value
    FROM red
    GROUP BY substr(geohash_binary_str, 0, {0})) red
WHERE scl.geohash = nir.geohash AND nir.geohash = red.geohash
""".format(bit_depth)

Conclusion

In this post, we have covered how to efficiently ingest, query, and analyze remote sensing data in a cloud native way on IBM Cloud. Remote sensing data is playing a more and more important role in industries on business insights, and we believe leveraging cloud is the way of processing and analyzing remote sensing data, as we could easily ingest and query an arbitrary amount of data in a serverless way. 

Ingesting and analyzing na arbitrary volume is crucial because remote sensing data can be extremely large and grow significantly on a daily basis, while querying in a serverless fashion is also vitally important; you pay only the data you need, no matter how much underlying data you have, which is very cost efficient (e.g., $5/TB for SQL Query).

Further reading

Was this article helpful?
YesNo

More from Cloud

Announcing Dizzion Desktop as a Service for IBM Virtual Private Cloud (VPC)

2 min read - For more than four years, Dizzion and IBM Cloud® have strategically partnered to deliver incredible digital workspace experiences to our clients. We are excited to announce that Dizzion has expanded their Desktop as a Service (DaaS) offering to now support IBM Cloud Virtual Private Cloud (VPC). Powered by Frame, Dizzion’s cloud-native DaaS platform, clients can now deploy their Windows and Linux® virtual desktops and applications on IBM Cloud VPC and enjoy fast, dynamic, infrastructure provisioning and a true consumption-based model.…

Microcontrollers vs. microprocessors: What’s the difference?

6 min read - Microcontroller units (MCUs) and microprocessor units (MPUs) are two kinds of integrated circuits that, while similar in certain ways, are very different in many others. Replacing antiquated multi-component central processing units (CPUs) with separate logic units, these single-chip processors are both extremely valuable in the continued development of computing technology. However, microcontrollers and microprocessors differ significantly in component structure, chip architecture, performance capabilities and application. The key difference between these two units is that microcontrollers combine all the necessary elements…

Seven top central processing unit (CPU) use cases

7 min read - The central processing unit (CPU) is the computer’s brain, assigning and processing tasks and managing essential operational functions. Computers have been so seamlessly integrated with modern life that sometimes we’re not even aware of how many CPUs are in use around the world. It’s a staggering amount—so many CPUs that a conclusive figure can only be approximated. How many CPUs are now in use? It’s been estimated that there may be as many as 200 billion CPU cores (or more)…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters