Dividing and Conquering Cross-Modal Recipe Retrieval: from Nearest Neighbours Baselines to SoTA

11/28/2019
by   Mikhail Fain, et al.
0

We propose a novel non-parametric method for cross-modal retrieval which is applied on top of precomputed image and text embeddings. By combining our method with standard approaches for building image and text encoders, trained independently with a self-supervised classification objective, we create a baseline model which outperforms most existing methods on a challenging image-to-recipe task. We also use our method for comparing image and text encoders trained using different modern approaches, thus addressing the issues hindering the developments of novel methods for cross-modal recipe retrieval. We demonstrate how to use the insights from model comparison and extend our baseline model with standard triplet loss that improves SoTA on the Recipe1M dataset by a large margin, while using only precomputed features and with much less complexity than existing methods.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset