Scattering Vision Transformer: Spectral Mixing Matters

  • Badri Patro ,
  • Vijay Agneeswaran

NeurIPS 2023 |

Several adaptations of transformers have been proposed for different computer vision tasks such as image classification, instance segmentation, and object detection. A few issues in vision transformers include attention complexity and the ability to capture fine-grained information on an image. Existing solutions use down-sampling operations (like pooling) to reduce the computational cost. But such operations are not invertible and may lead to loss of information. We propose a Scattering Vision Transformer (SVT) which uses a spectrally scattering network to capture fine-grained information about an image. SVT achieves the separation of low-frequency and high-frequency components of an image to address the invertibility issue. SVT also uses a novel spectral mixing technique using Einstein multiplication in token and channel mixing to reduce complexity. We show that SVT achieves state-of-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS. SVT shows 2\% improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy, while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L reaches 85.7\% (again state-of-art for large versions). SVT also shows comparable results in other vision tasks such as instance segmentation. SVT also outperforms other transformers in transfer learning on standard datasets such as CIFAR10, CIFAR100, Flower, and Car datasets.