Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation

by   Md Mahedi Hasan, et al.

We introduce caption-guided face recognition (CGFR) as a new framework to improve the performance of commercial-off-the-shelf (COTS) face recognition (FR) systems. In contrast to combining soft biometrics (eg., facial marks, gender, and age) with face images, in this work, we use facial descriptions provided by face examiners as a piece of auxiliary information. However, due to the heterogeneity of the modalities, improving the performance by directly fusing the textual and facial features is very challenging, as both lie in different embedding spaces. In this paper, we propose a contextual feature aggregation module (CFAM) that addresses this issue by effectively exploiting the fine-grained word-region interaction and global image-caption association. Specifically, CFAM adopts a self-attention and a cross-attention scheme for improving the intra-modality and inter-modality relationship between the image and textual features, respectively. Additionally, we design a textual feature refinement module (TFRM) that refines the textual features of the pre-trained BERT encoder by updating the contextual embeddings. This module enhances the discriminative power of textual features with a cross-modal projection loss and realigns the word and caption embeddings with visual features by incorporating a visual-semantic alignment loss. We implemented the proposed CGFR framework on two face recognition models (ArcFace and AdaFace) and evaluated its performance on the Multi-Modal CelebA-HQ dataset. Our framework significantly improves the performance of ArcFace in both 1:1 verification and 1:N identification protocol.


Domain Private and Agnostic Feature for Modality Adaptive Face Recognition

Heterogeneous face recognition is a challenging task due to the large mo...

CLIP-Driven Fine-grained Text-Image Person Re-identification

TIReID aims to retrieve the image corresponding to the given text query ...

Heterogeneous Visible-Thermal and Visible-Infrared Face Recognition using Unit-Class Loss and Cross-Modality Discriminator

Visible-to-thermal face image matching is a challenging variate of cross...

Fine-grained Attention-based Video Face Recognition

This paper aims to learn a compact representation of a video for video f...

Dual-path CNN with Max Gated block for Text-Based Person Re-identification

Text-based person re-identification(Re-id) is an important task in video...

Improving Heterogeneous Face Recognition with Conditional Adversarial Networks

Heterogeneous face recognition between color image and depth image is a ...

Relational Deep Feature Learning for Heterogeneous Face Recognition

Heterogeneous Face Recognition (HFR) is a task that matches faces across...

Please sign up or login with your details

Forgot password? Click here to reset