FER-former: Multi-modal Transformer for Facial Expression Recognition

by   Yande Li, et al.

The ever-increasing demands for intuitive interactions in Virtual Reality has triggered a boom in the realm of Facial Expression Recognition (FER). To address the limitations in existing approaches (e.g., narrow receptive fields and homogenous supervisory signals) and further cement the capacity of FER tools, a novel multifarious supervision-steering Transformer for FER in the wild is proposed in this paper. Referred as FER-former, our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision. In specific, to dig deep into the merits of the combination of features provided by prevailing CNNs and Transformers, a hybrid stem is designed to cascade two types of learning paradigms simultaneously. Wherein, a FER-specific transformer mechanism is devised to characterize conventional hard one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. To ease the issue of annotation ambiguity, a heterogeneous domains-steering supervision module is proposed to make image features also have text-space semantic correlations by supervising the similarity between image features and text features. On top of the collaboration of multifarious token heads, diverse global receptive fields with multi-modal semantic cues are captured, thereby delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.


page 1

page 6

page 7


Multi-Modal Facial Expression Recognition with Transformer-Based Fusion Networks and Dynamic Sampling

Facial expression recognition is important for various purpose such as e...

Facial Expression Recognition with Swin Transformer

The task of recognizing human facial expressions plays a vital role in v...

Multi-modal Affect Analysis using standardized data within subjects in the Wild

Human affective recognition is an important factor in human-computer int...

MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition

Vision transformer (ViT) has been widely applied in many areas due to it...

Vision Transformer Equipped with Neural Resizer on Facial Expression Recognition Task

When it comes to wild conditions, Facial Expression Recognition is often...

Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition

Due to the subjective annotation and the inherent interclass similarity ...

Quaternion Orthogonal Transformer for Facial Expression Recognition in the Wild

Facial expression recognition (FER) is a challenging topic in artificial...

Please sign up or login with your details

Forgot password? Click here to reset