{"id":666120,"date":"2020-06-17T10:00:50","date_gmt":"2020-06-17T17:00:50","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=666120"},"modified":"2020-07-15T09:30:02","modified_gmt":"2020-07-15T16:30:02","slug":"high-resolution-network-a-universal-neural-architecture-for-visual-recognition","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/high-resolution-network-a-universal-neural-architecture-for-visual-recognition\/","title":{"rendered":"High-Resolution Network: A universal neural architecture for visual recognition"},"content":{"rendered":"
\"A

Figure 1: Milestone network architectures (2012 \u2013 present)<\/p><\/div>\n

Since AlexNet was invented in 2012, there has been rapid development in convolutional neural network architectures in computer vision. Representative architectures (Figure 1) include GoogleNet (2014), VGGNet (2014), ResNet (2015), and DenseNet (2016), which are developed initially from image classification. It\u2019s a golden rule that classification architecture is the backbone for other computer vision tasks.<\/p>\n

What\u2019s next for a new architecture that is broadly applicable to general computer vision tasks? Can we design a universal architecture from general computer vision tasks rather than from classification tasks?<\/p>\n

We pursued these questions and developed HRNet, a network that comes from general vision tasks and wins on many fronts of computer vision, including semantic segmentation, human pose estimation, and object detection. We\u2019ve also released the code for HRNet on GitHub (opens in new tab)<\/span><\/a>, and the paper on an extension of HRNet, called \u201cHigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation,\u201d (opens in new tab)<\/span><\/a> has been published at CVPR 2020.<\/p>\n

What\u2019s essential for tasks beyond classification? The typical tasks, such as those mentioned in the paragraph above, require spatially fine representations. Before HRNet, most techniques extend classification networks, that is, they add an extra stage to raise the spatial granularity (Figure 2) or use dilated convolutions.<\/p>\n

\"Structure

Figure 2: The structure of recovering high resolution from low resolution. (a) A low-resolution representation learning subnetwork (such as AlexNet, GoogleNet, VGGNet, ResNet, DenseNet), which is formed by connecting high-to-low convolutions in series. (b) A high-resolution representation recovering subnetwork, which is formed by connecting low-to-high convolutions in series. Representative examples include SegNet, DeconvNet, U-Net and Hourglass, encoder-decoder, and SimpleBaseline.<\/p><\/div>\n

How does HRNet do this? It is conceptually different from the classification architecture. HRNet is designed from scratch, rather than from the classification architecture, and it breaks the dominant design rule, connecting the convolutions in series from high resolution to low resolution, which goes back to LeNet-5 (LeCun et al., 1998) (opens in new tab)<\/span><\/a>.<\/p>\n

High-Resolution Network: Design and its four stages<\/h3>\n

The HRNet maintains high-resolution representations through the whole process. We start from a high-resolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several (four in the current design) stages as depicted in Figure 3, and the n<\/em>th stage contains n<\/em> streams corresponding to n<\/em> resolutions. We conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.<\/p>\n

The high-resolution representations learned from HRNet are not only semantically strong, but also spatially precise. This comes from two aspects. First, our approach connects high-to-low resolution convolution streams in parallel rather than in series. Therefore, our approach is able to maintain the high resolution instead of recovering high resolution from low resolution, and the learned representation is spatially more precise accordingly. Second, most existing fusion schemes aggregate high-resolution low-level and upsampled low-resolution high-level representations. Instead, we repeat multi-resolution fusions to boost the high-resolution representations with the help of the low-resolution representations, and vice versa. As a result, all the high-to-low resolution representations are semantically stronger.<\/p>\n

 <\/p>\n

\"Arrows

Figure 3: An example HRNet. Only the main body is illustrated, and the stem is not included. There are four stages. The 1st stage consists of high-resolution convolutions. The 2nd (3rd, 4th) stage repeats two-resolution (three-resolution, four-resolution) blocks several (that is, 1, 4, 3) times.<\/p><\/div>\n

Applications<\/h3>\n

The HRNet is a universal architecture for visual recognition. The HRNet has become a standard for human pose estimation since the paper was published in CVPR 2019 (opens in new tab)<\/span><\/a>. It has been receiving increasing attention in semantic segmentation due to its high performance. HRNet shows superior or competitive performance on a wide-range of position-sensitive tasks, including object detection, face detection, and facial landmark detection, elaborated on in this paper published at IEEE TPAMI 2020 (opens in new tab)<\/span><\/a>. In our CVPR 2020 paper (opens in new tab)<\/span><\/a>, we recently extended it to learn higher-resolution multi-scale representations for handling the scale diversity in bottom-up pose estimation and obtained the state-of-the-art result.<\/p>\n

\"(A)

Figure 4: (a) HRNetV1: only output the representation from the high-resolution convolution stream. (b) HRNetV2: Concatenate the (upsampled) representations that are from all the resolutions. (c) HRNetV2p: form a feature pyramid from the representation by HRNetV2. The four-resolution representations at the bottom in each sub-figure are outputted from the network in Figure 3. The gray box indicates how the output representation is obtained from the input four-resolution representations.<\/p><\/div>\n

Human Pose Estimation<\/h3>\n

Human pose estimation, also known as keypoint detection, aims to detect the locations of keypoints or parts (for example, elbow, wrist, and so on) from an image. The HRNet applied to human pose estimation uses the representation head shown in Figure 4(a), called HRNetV1. Visual example results are shown in Figure 5.<\/p>\n

The comparison with ResNet-based methods is shown in Figure 6. We can see that HRNet outperforms ResNet in terms of estimation performance (AP), parameter complexity (#parameters), and computation complexity (GFLOPS). The detailed comparison is given in Table 1.<\/p>\n

\"4

Figure 5: Qualitative COCO human pose estimation results over representative images with various human size, different poses, or clutter background.<\/p><\/div>\n

 <\/p>\n

\"ResNet

Figure 6: Comparison on COCO human pose estimation between ResNet and HRNet under the same setting. HRNet performs better in terms of AP, #parameters, and computation complexity. 32 (48) in W32 (48) is the width of the high-resolution convolution.<\/p><\/div>\n

 <\/p>\n

\"

Table 1: Comparison with state-of-the-arts on COCO test-dev.<\/p><\/div>\n

Sematic Segmentation<\/h3>\n

Semantic segmentation is a problem of assigning a class label to each pixel. The HRNet applied to semantic segmentation uses the representation head shown in Figure 4(b), called HRNetV2. Some visual example results are given in Figure 7.<\/p>\n

\"4

Figure 7: Qualitative segmentation on Cityscapes images.<\/p><\/div>\n

The HRNet compared to state-of-the-art methods, U-Net++, DeepLab and PSPNet on the Cityscapes validation data is given in Table 2. We can see that the HRNet achieves better results with even less parameter and computation complexities. Comparison to existing state-of-the-arts on Cityscapes test is provided in Table 3. The results on other datasets can be found in this IEEE TPAMI 2020 paper (opens in new tab)<\/span><\/a>.<\/p>\n

\"Table

Table 2: Comparison with representative segmentation methods on Cityscapes validation. HRNet performs superiorly in terms of parameter complexity, computation complexity, and segmentation quality.<\/p><\/div>\n

\"

Table 3: Comparison to existing state-of-the-arts on Cityscapes test. OCR is the abbreviation of object-contextual representation we proposed.<\/p><\/div>\n

Object Detection and Instance Segmentation<\/h3>\n

Object detection aims to identify the bounding box of the object instance in an image, and instance segmentation aims to identify the pixels belonging to an object instance. Examples are shown in Figure 8. We apply the multi-level representations, HRNetV2p, shown in Figure 4(c) to object detection and instance segmentation. The comparison shown in Tables 4 and 5 shows that HRNet outperforms with ResNet and ResNeXt.<\/p>\n

\"6

Figure 8: Qualitative examples for COCO object detection (left three images) and instance segmentation (right three images).<\/p><\/div>\n

\"HRNet

Table 4: Object detection comparison with ResNet and ResNeXt with similar parameter and computation complexes under the Faster R-CNN and Cascade R-CNN frameworks on COCO test-dev without mutli-scale training and testing. This shows that HRNet HRNet performs better than ResNet and ResNeXt<\/p><\/div>\n

 <\/p>\n

\"HRNet

Table 5: Object detection (bbox) and instance segmentation (mask) Comparison with ResNet with similar parameter and computation complexes under the Mask R-CNN framework on COCO val. without mutli-scale training and testing. This shows that HRNet HRNet performs better than ResNet and ResNeXt.<\/p><\/div>\n

Runtime Cost<\/h3>\n

What about runtime costs for HRNet? Is HRNet expensive in terms of memory and computation complexity? The answer is an emphatic no. The comparison is given in Table 6 for the runtime cost comparison on the PyTorch 1.0 platform. In human pose estimation, HRNet gets superior estimation score with much lower training and inference memory cost and slightly larger training time cost and inference time cost. In semantic segmentation, HRNet overwhelms PSPNet and DeepLabV3 in terms of all the metrics, and the inference-time cost is less than half of PSPNet and DeepLabV3. In object detection, HRNet is also better than ResNet and ResNeXt.<\/p>\n

\"

Table 6.1: Memory and time cost for human pose estimation on COCO val and semantic segmentation on Cityscapes val.<\/p><\/div>\n

We report inference time for pose estimation on MXNet 1.5.1, which supports static graph inference that multi-branch convolutions used in the HRNet benefits from. The numbers for training are obtained on a machine with 4 V100 GPU cards. During training, the input sizes are 256\u00d7192$, 512\u00d71024, and 800\u00d71333, and the batch sizes are 128, 8 and 8 for pose estimation, segmentation and detection respectively. The numbers for inference are obtained on a single V100 GPU card. The input sizes are 256\u00d7192, 1024\u00d72048, and 800\u00d71333, respectively. The score means AP for pose estimation on COCO val and detection on COCO val, and mIoU for cityscapes val segmentation. PSPNet and DeepLabV3 use dilated ResNet-101 as the backbone. \u00a0(See Tables 6.1 and 6.2.)<\/p>\n

\"HRNet

Table 6.2: Train and inference Memory and time cost for object detection on COCO segmentation.<\/p><\/div>\n

ImageNet Pretraining<\/h3>\n

We pretrain HRNet, augmented by a classification head, shown in Figure 9. We do not aim to push the state-of-the-art result for ImageNet classification, and so we do not utilize some tricks to improve training. The pretraining results and the comparison with ResNet are given in Table 7. The results are similar with and slightly better than ResNet.<\/p>\n

\"Two

Figure 9: Representation for ImageNet classification. The input of the box is the representations of four resolutions.<\/p><\/div>\n

 <\/p>\n

\"HRNet-W44-C

Table 7: ImageNet Classification results of HRNet and ResNet. The proposed method is named HRNet-Wx-C. In this case, x means the width.<\/p><\/div>\n

Conclusions<\/h3>\n

The high-resolution network (HRNet) is a universal architecture for visual recognition.\u00a0The applications of the HRNet are not limited to what we have shown above, and they are suitable to other position-sensitive vision applications, such as face alignment, face detection, super-resolution, optical flow estimation, depth estimation, and so on. There are already follow-up works, looking into using HRNet for image stylization, inpainting, image enhancement, image dehazing, temporal pose estimation, drone object detection.<\/p>\n

It is reported in this paper (opens in new tab)<\/span><\/a> that a slightly-modified HRNet combined with ASPP achieved the best performance for Mapillary panoptic segmentation in the single model case. In the COCO and Mapillary Joint Recognition Challenge Workshop with ICCV 2019, the COCO Dense Pose challenge winner and almost all the COCO keypoint detection challenge participants adopted the HRNet. The OpenImage instance segmentation challenge winner (ICCV 2019) also used the HRNet.<\/p>\n","protected":false},"excerpt":{"rendered":"

Since AlexNet was invented in 2012, there has been rapid development in convolutional neural network architectures in computer vision. Representative architectures (Figure 1) include GoogleNet (2014), VGGNet (2014), ResNet (2015), and DenseNet (2016), which are developed initially from image classification. It\u2019s a golden rule that classification architecture is the backbone for other computer vision tasks. […]<\/p>\n","protected":false},"author":38838,"featured_media":667812,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13562],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-666120","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-computer-vision","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[661083],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"\"\"","byline":"","formattedDate":"June 17, 2020","formattedExcerpt":"Since AlexNet was invented in 2012, there has been rapid development in convolutional neural network architectures in computer vision. Representative architectures (Figure 1) include GoogleNet (2014), VGGNet (2014), ResNet (2015), and DenseNet (2016), which are developed initially from image classification. It\u2019s a golden rule that…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/666120"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=666120"}],"version-history":[{"count":17,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/666120\/revisions"}],"predecessor-version":[{"id":695151,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/666120\/revisions\/695151"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/667812"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=666120"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=666120"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=666120"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=666120"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=666120"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=666120"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=666120"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=666120"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=666120"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=666120"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=666120"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}