The primary difference between instance segmentation tasks and conventional object detection is that instance segmentation predicts pixel-level boundaries of each object while object detection predicts only an object’s approximate location.
Conventional object detection methods are an evolved combination of image classification and object localization. Trained with various machine learning algorithms to recognize the visual patterns of relevant categories of objects—for example, an autonomous driving model might be trained to recognize things like “car” or “pedestrian”—an object detection model analyzes the visual data of an input image to annotate any relevant object instances and generate rectangular regions, called “bounding boxes,” in which each instance is located.
Instance segmentation systems likewise detect objects in an image, but in much greater detail: rather than a bounding box that approximates the location of an object instance, instance segmentation algorithms generate a pixel-by-pixel “segmentation mask” of the precise shape and area of each instance.
Many leading instance segmentation model architectures, like Mask R-CNN, perform conventional objection detection as a preliminary step in the process of generating segmentation masks. Such “two-stage” models typically provide state-of-the-art accuracy, albeit with a tradeoff in speed.