{"id":894768,"date":"2022-11-01T19:05:34","date_gmt":"2022-11-02T02:05:34","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=894768"},"modified":"2022-11-04T21:32:57","modified_gmt":"2022-11-05T04:32:57","slug":"focalnets-focusing-the-eyes-with-focal-modulation","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/focalnets-focusing-the-eyes-with-focal-modulation\/","title":{"rendered":"FocalNets: Focus Eyes with Focal Modulation"},"content":{"rendered":"\n
Human eyes have a dynamic focusing system that adjusts the focal regions in order to see the surroundings at all distances. When we look far away, up close, and back again, our eyes change focus rapidly to allow us to perceive things finely and coarsely. In computer vision (CV), It remains an open question how to build a neural network that can mimic this behavior and feasibly focus on various granularities of visual inputs towards different tasks.<\/p>\n\n\n\n
In the past few years, Transformers (opens in new tab)<\/span><\/a> and Vision Transformers (opens in new tab)<\/span><\/a> have led to unprecedented AI breakthroughs in NLP and vision, respectively. For vision particularly, what makes the Transformers stand out is arguably the self-attention (SA) mechanism, which enables each query token to adaptively gather information from others. It learns the dependencies across different visual tokens, which induces better generalization ability than the canonical convolution layer of static kernels. In the visual world, the input signal is often continuous and comes with an arbitrary granularity and scope. Nevertheless, SA is typically used to modeling over a fixed set of predetermined tokens in a specific scope and granularity, and the interactions among individual tokens are usually dense and heavy, which limits their usability in understanding the complicated visual world.<\/p>\n\n\n\n In this blog, we introduce our recent efforts on building neural networks with focal modulation, leading to the new architecture family: FocalNets (opens in new tab)<\/span><\/a><\/strong>. The highlight moments include: <\/p>\n\n\n\n We also released the paper on arXiv (opens in new tab)<\/span><\/a>, PyTorch codebase on the project GitHub page (opens in new tab)<\/span><\/a>, and a HuggingFace demo (opens in new tab)<\/span><\/a>. Feel feel to give it a try.<\/p>\n\n\n\n
(Left) Comparison with SoTA on COCO object detection. Circle size indicates the model size. (Right) Modulation focus maps at the early, middle, and final stages of visual perception with our FocalNet<\/sub><\/em><\/p>\n\n\n\n