ReSTR: Convolution-free Referring Image Segmentation Using Transformers

March 29, 2022

Share this page

Segmenting objects in images is a fundamental step that researchers have taken towards achieving scene understanding and has played a crucial role in numerous vision systems. The most common approach in this line of research is to classify individual pixels into predefined classes (e.g., car or person). Although this approach has achieved remarkable success recently, its application to real-world downstream tasks is limited, since users often require the segmentation of undefined classes and specific objects of their interest (e.g., a red Ferrari or a man wearing a blue hat). Referring image segmentation resolves this limitation by segmenting an image region corresponding to a natural language expression that is given as a query. As the task is no longer restricted by predefined classes, it enables a large variety of applications such as human-robot interaction and interactive photo editing. Referring image segmentation is, however, more challenging, since it demands to comprehend individual objects and their relations expressed in the language query (e.g., a car behind the taxi next to the building) and to fully exploit such structured and relational information in the segmentation process. For this reason, models for the task should be capable of capturing relations between objects in both modalities as well as joint reasoning over the two different modalities.

Existing methods for referring image segmentation have adopted convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to extract visual and linguistic features respectively. In general, these features are integrated into a multimodal feature map through convolution layers applied to a concatenation of the two features, so-called concatenation-convolution operation; the final multimodal features are then fed as input into a segmentation module. These methods share the following limitations. First, they have trouble handling long-range interactions between objects within each modality. Referring image segmentation requires capturing such interactions, since language expressions often involve complicated relations between objects to precisely indicate the target object. In this respect, both CNNs and RNNs are limited due to the locality of their basic building blocks. Second, existing models have difficulty modeling sophisticated interactions between the two modalities. They aggregate visual and linguistic features through the concatenation-convolution operation, which is a fixed and handcrafted way of feature fusion and thus cannot be sufficiently flexible and effective in handling a large variety of referring image segmentation scenarios.

To overcome the aforementioned limitations, Professor Suha Kwak at POSTECH in South Korea and his collaborators, Cuiling Lan and Wenjun Zeng of Microsoft Research Asia, together with POSTECH PhD students Namyup Kim and Dongwon Kim, have proposed a convolution-free model, dubbed ReSTR, for referring image segmentation using transformers. Its architecture is illustrated in Figure 1. ReSTR first extracts visual and linguistic features through two encoders, namely the vision encoder and the language encoder. They compute features of image patches and individual words respectively, while considering their long-range interactions within each modality using transformers. The encoders enable ReSTR to capture global context from the beginning of feature extraction and unify network topology for the two modalities. Next, a self-attention encoder aggregates the visual and linguistic features into a patch-wise multimodal feature. This fusion encoder enables sophisticated and flexible interactions between features of the two modalities thanks to its self-attention layers. Further, it adaptively transforms a learnable embedding into a classifier for the target object described in the language expression. Given the patch-wise multimodal features and the target classifier, a segmentation decoder finally predicts a segmentation map in a coarse-to-fine manner. The classifier is first applied to each multimodal feature to examine whether each image patch contains a part of the target object. The patch-level prediction is then converted into a pixel-level segmentation map through a series of upsampling and linear layers.

Paper link: http://approjects.co.za/?big=en-us/research/publication/restr-convolution-free-referring-image-segmentation-using-transformers/

Figure 1. Overall architecture of ReSTR. (a) The two transformer encoders extracting features of the two modalities. (b) The feature fusion encoder integrating features of the two modalities and generating a target classifier adaptively. (c) The coarse-to-fine decoder for the final segmentation prediction. The fusion encoder and the segmentation decoder are trained respectively for patch-level classification and pixel-level classification.

Figure 2. Results on the Gref dataset. (a) Input image. (b) Patch-level prediction. (c) ReSTR. (d) Ground-truth.

Figure 3. Segmentation results of an image with different language expressions on the Gref dataset.

On top of being the first convolution-free architecture for referring image segmentation, ReSTR has shown outstanding performance and has achieved state-of-the-art results on four public benchmarks without bells and whistles like computationally heavy post-processing techniques. Its qualitative results are presented in Figures 2 and 3. The researchers also demonstrated that ReSTR is not a simple extension of vision transformers but that all of its components are carefully designed for referring segmentation and fairly contribute to the performance. More details on ReSTR and its results are presented in the paper “ReSTR: Convolution-free Referring Image Segmentation Using Transformers,” which will appear at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) this year.

Microsoft Research Lab – Asia

ReSTR: Convolution-free Referring Image Segmentation Using Transformers

Share this page