PointRend: Image Segmentation as Rendering (CVPR 2020)

3 min readJul 2, 2021

前言

透過選定區塊的方式解決 CNN 網路在高解析度 segmentation 需要龐大計算量和記憶體的問題，像下面紅色區塊就只需要 coarse resolution 即可，且任何 pixel-level 的任務都可以用類似的概念去實作。

像是這種 high frequency region 就無法從 lower resolution 的 output 得到，因此需要更高 resolution 的 output。

其架構及步驟如下
(i) A point selection strategy chooses a small number of real-value points to make predictions on, avoiding excessive computation for all pixels in the high-resolution output grid.
(ii) For each selected point, a point-wise feature representation is extracted. Features for a real-value point are computed by bilinear interpolation off, using the point’s 4 nearest neighbors that are on the regular grid off. As a result, it is able to utilize sub-pixel information encoded in the channel dimension off to predict a segmentation that has higher resolution than f.
(iii) A point head: a small neural network trained to predict a label from this point-wise feature representation, independently for each point.

Point Selection for Inference and Training

計算高機率是邊緣地方的 location，也就是跟附近 pixel 的值相差很多的地方，至於其他地方就直接 interpolating 即可。
當透過 CNN based 的 segmentation model 輸出一個低解析度的 mask，對其做 bilinear interpolation 的 upsample 之後取 N 個 most uncertain points (e.g., those with probabilities closest to 0.5 for a binary mask)

Point-wise Representation and Point Head

Fine-grained: 不同 channel 的位置，因此需要 Coarse prediction 的 global context 和 Label，不然不知道物件重疊區域是屬於誰的，不能有效判斷誰前誰後。
Coarse prediction: K channel 的 Instance Label。
Point head: MLP shares weights across all points Since the MLP predicts a segmentation label for each point, it can be trained by standard task-specific segmentation losses。