Deep-MAC

Deep-MAC is a new type of anchor-free instance segmentation model that is based on CenterNet. The objective of this innovation is to deal with the "partially supervised" instance segmentation problem, where all classes have bounding box annotations, but only a subset of classes have mask annotations.

Box Prediction in CenterNet

CenterNet is a model that predicts bounding boxes using three tensors. Firstly, it produces a class-specific heatmap that represents the probability of the center of the bounding box being present at each location. Secondly, a class-agnostic 2-channel tensor is generated, indicating the height and width of the bounding box at each center pixel. Lastly, to resolve any discretization error between the output feature map and the image, an x and y offset is computed for every center pixel.

Adding Pixel Embedding Branch and Mask Prediction in Deep-MAC

Deep-MAC has an additional pixel embedding branch P that is used for mask prediction. The model crops a section P_b from P corresponding to the bounding box b and feeds it to a mask-head, resulting in a 32 x 32 tensor. The model then applies a sigmoid to produce per-pixel probabilities, and the final prediction is a class-agnostic, 32 x 32 tensor. A per-pixel cross-entropy loss is used during training, along with post-processing that resizes the predicted mask according to the predicted bounding box.

Improving Mask-Head Stability

Two inputs are added to improve the stability of some mask-heads: Instance embedding and Coordinate embedding. The instance embedding head predicts a per-pixel embedding and is used to extract the embedding for each bounding box. This embedding is then tiled to a size of 32 x 32 and concatenated with the pixel embedding crop. This helps condition the mask-head on a specific instance to differentiate it from other instances. The coordinate embedding is a 32 x 32 x 2 tensor holding normalized x and y coordinates relative to the bounding box b, inspired by CoordConv.

In summary, Deep-MAC is an effective way to deal with partially supervised instance segmentation problems while reducing the costs associated with mask annotations. The model is based on CenterNet but adds a pixel embedding branch for mask prediction, along with two inputs to improve the stability of mask-heads. The resulting model is highly effective and could be used in a wide range of computer vision applications.