Cross-modal Retrieval and Synthesis (X-MRS): Closing the modality gap in shared subspace

by   Ricardo Guerrero, et al.

Computational food analysis (CFA), a broad set of methods that attempt to automate food understanding, naturally requires analysis of multi-modal evidence of a particular food or dish, e.g. images, recipe text, preparation video, nutrition labels, etc. A key to making CFA possible is multi-modal shared subspace learning, which in turn can be used for cross-modal retrieval and/or synthesis, particularly, between food images and their corresponding textual recipes. In this work we propose a simple yet novel architecture for shared subspace learning, which is used to tackle the food image-to-recipe retrieval problem. Our proposed method employs an effective transformer based multilingual recipe encoder coupled with a traditional image embedding architecture. Experimental analysis on the public Recipe1M dataset shows that the subspace learned via the proposed method outperforms the current state-of-the-arts (SoTA) in food retrieval by a large margin, obtaining recall@1 of 0.64. Furthermore, in order to demonstrate the representational power of the learned subspace, we propose a generative food image synthesis model conditioned on the embeddings of recipes. Synthesized images can effectively reproduce the visual appearance of paired samples, achieving R@1 of 0.68 in the image-to-recipe retrieval experiment, thus effectively capturing the semantics of the textual recipe.


page 8

page 13

page 15

page 16

page 17

page 18

page 19

page 20


CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval

Despite the abundance of multi-modal data, such as image-text pairs, the...

Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism

Cross-modal food retrieval is an important task to perform analysis of f...

A Rich Recipe Representation as Plan to Support Expressive Multi Modal Queries on Recipe Content and Preparation Process

Food is not only a basic human necessity but also a key factor driving a...

Picture-to-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images

Increased awareness of the impact of food consumption on health and life...

Multi-modal Cooking Workflow Construction for Food Recipes

Understanding food recipe requires anticipating the implicit causal effe...

The Art of Food: Meal Image Synthesis from Ingredients

In this work we propose a new computational framework, based on generati...

Please sign up or login with your details

Forgot password? Click here to reset