Learning to ignore: rethinking attention in CNNs

by   Firas Laakom, et al.

Recently, there has been an increasing interest in applying attention mechanisms in Convolutional Neural Networks (CNNs) to solve computer vision tasks. Most of these methods learn to explicitly identify and highlight relevant parts of the scene and pass the attended image to further layers of the network. In this paper, we argue that such an approach might not be optimal. Arguably, explicitly learning which parts of the image are relevant is typically harder than learning which parts of the image are less relevant and, thus, should be ignored. In fact, in vision domain, there are many easy-to-identify patterns of irrelevant features. For example, image regions close to the borders are less likely to contain useful information for a classification task. Based on this idea, we propose to reformulate the attention mechanism in CNNs to learn to ignore instead of learning to attend. Specifically, we propose to explicitly learn irrelevant information in the scene and suppress it in the produced representation, keeping only important attributes. This implicit attention scheme can be incorporated into any existing attention mechanism. In this work, we validate this idea using two recent attention methods Squeeze and Excitation (SE) block and Convolutional Block Attention Module (CBAM). Experimental results on different datasets and model architectures show that learning to ignore, i.e., implicit attention, yields superior performance compared to the standard approaches.


ConvFormer: Closing the Gap Between CNN and Vision Transformers

Vision transformers have shown excellent performance in computer vision ...

Hyperspectral image classification based on multi-scale residual network with attention mechanism

Compared with traditional machine learning methods, deep learning method...

Parameter-Free Average Attention Improves Convolutional Neural Network Performance (Almost) Free of Charge

Visual perception is driven by the focus on relevant aspects in the surr...

Irrelevant Pixels are Everywhere: Find and Exclude Them for More Efficient Computer Vision

Computer vision is often performed using Convolutional Neural Networks (...

Self-Supervised Implicit Attention: Guided Attention by The Model Itself

We propose Self-Supervised Implicit Attention (SSIA), a new approach tha...

Learning Frame Level Attention for Environmental Sound Classification

Environmental sound classification (ESC) is a challenging problem due to...

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

Vision transformers (ViTs) have significantly changed the computer visio...

Please sign up or login with your details

Forgot password? Click here to reset