A look at IBM's approach to a challenge on using satellite imagery to identify buildings and rate the amount of damage they suffered after a natural disaster.
The xView2 Data Challenge was hosted by the Defense Innovation Unit (DIU) with the goal of identifying buildings and rating the amount of damage they sustained from a natural disaster by using satellite images that were taken before and then after the disaster.
Currently, after a natural disaster, satellite and aerial images are annotated for building damage manually in a time-intensive and laborious process that takes weeks. Using computer vision and machine learning algorithms offers the possibility of cutting this process down—possibly from taking weeks to taking hours—thus greatly expediting recovery responses.
The data used for the challenge was the xBD dataset. The xBD dataset contains 850,736 annotated buildings and spans 45,362 square km of satellite imagery. For training models, there are 9,168 pre-disaster/post-disaster 1024x1024 high-resolution color images.
The images capture 19 natural disasters of 5 different types from all over the world.
In the post-disaster images, each building is assigned a class label based on how badly it was damaged.
One of the reasons this challenge was difficult was the high amount of variability in the data. The dataset, in general, is highly biased towards the "no damage" class, as can be seen here.
When considering the numbers of images that contain at least one damaged building, it varies greatly depending on the natural disaster.
Another area of variance in the dataset is the number of buildings affected by each disaster.
Further adding to the difficulty of the task is the fact that the visual indicators of damage— say between minor damage and major damage—can be quite subtle, making it difficult for models to distinguish between the two.
For each pre- and post-disaster image pair, we had to produce two resulting PNG files. The first PNG file needed to contain the localization predictions (i.e., where the buildings are in the image) by having a 1 in a pixel if there is a building in the corresponding pre-disaster image and a 0 if there is no building in that pixel.
The second PNG file needed to contain the damage classification predictions, where each pixel has an integer value between 0 and 4, reflecting the damage level prediction for the corresponding pixel in the post disaster image.
The evaluation metric for the challenge was a weighted F1-score of the localization and damage classification predictions.
The F1 score measures a balance between the model's precision and its recall. This was a more appropriate evaluation metric than, say, just the accuracy since a model that predicted "no building" at every pixel would still be around 80% accurate.
All of our analysis, model training, and model selection was performed on an IBM Cloud Bare Metal Server with 16 CPU cores, 128GB of RAM, and 2 NVIDIA Tesla V100 GPUs. For this challenge, we utilized the 2.0 release of the TensorFlow machine learning framework, which was very newly released when we began our work. The major differences we noticed in TensorFlow 2.0 versus previous versions are as follows:
- Eager execution is enabled by default. This provides for a more intuitive workflow that is easier to debug and reason about.
- Keras is the preferred method for defining models. Keras is a high-level, declarative DSL for defining models that allows for easy and fast prototyping.
tf.datais the recommend path for building data pipelines. Using the
tf.dataAPI lets TensorFlow optimize data pipelines to improve training performance.
Our main algorithm of choice during the competition was the U-Net convolutional neural network (CNN) architecture, which is a popular algorithm for semantic image segmentation.
The U-Net architecture augments the fully connected CNN by having a series of contracting layers that capture context followed by a symmetric group of expanding layers that allow the model to learn accurate segmentation boundaries. The contracting layers are called the down-sampler or encoder, and the expanding layers are called the up-sampler or the decoder.
Our initial baseline approach was to try and do both the building localization and damage classification tasks in a single U-Net model. Not too surprisingly, this approach did not perform well due to the model having to express both the coarse-grain task of separating buildings from the background combined with the finer-grain task of rating the building damage. Thus, we decided to split the problem into the two subproblems of localization and damage classification.
For the localization problem, our goal was to classify each pixel in the pre-disaster image as either "building" or "no building."
First, we used the
tf.image libraries to create a data pipeline to load each pair of pre-/post-disaster images as a pair of tensors—each with dimension
(1024, 1024, 3) that contained the RGB values for each pixel in the image.
Next, in our pipeline, we concatenated the two tensors into one tensor of dimension
(1024, 1024, 6), with the idea being that even though the post-disaster image can be dramatically different from the pre-disaster image, there is still information added in deciding what is a building versus not a building. We then applied several data augmentation techniques, such as rotations and reflections, at random, which allowed us to expand the number of training examples we had for each epoch.
The localization model was a single U-Net model setup to do binary semantic image segmentation. The optimal model had nine down-sampling layers and nine up-sampling layers. The model was trained using cross entropy as the lost function, and training took approximately five days to complete
Once again we utilized the
tf.image libraries to create tensors of dimension
(1024, 1024, 6), representing the concatenated pre-/post-disaster images. Then, we used the output of the localization model to mask the tensor so that all non-building pixels had a zero value. Next, the tensors were randomly cropped to be of dimension
(256, 256, 6), and only crops that contained at least 20% of non-zero pixels were used in training.
Therefore, a model should be penalized more when the predicted class is further away than the actual class. As an example, a prediction of "minor damage" when the ground truth class is "destroyed" should be penalized more than a prediction of "major damage."
This is an example of an ordinal regression problem and we explored several techniques for using neural networks to solve it. Our optimal solution ended up using an ensemble of U-Net models to perform the task.
To do this, we trained three U-Net binary classifiers such that the first model predicts the probability that the class label is greater than 1 (i.e, P(class>1)) the second model predicts P(class>2), and the third model predicts P(class>3). Instead of encoding the class labels using the usual one-hot encoder, we encoded the target values into vectors according to the following ordinal scheme:
The output layer of the ensemble uses a sigmoid activation function to produce a vector of length 4, and in order to make a prediction from this vector, we scan the values and stop when the value is below a threshold (0.5 in our case) or there are no more values in the vector. The index i of the last value that is bigger than the threshold is the predicted damage class.
As noted previously, the metric used for this competition was a combination of the localization and damage classification F1 scores. Our best submission received the following F1 scores on the competition validation dataset:
When looking at our internal testing set, the majority of our localization errors came from images that had areas of high building density. On the damage classification problem, our U-Net ensemble model had the most difficulty in distinguishing minor damage from major damage.