Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

08/06/2023
by   Zheng Ma, et al.
0

Vision-language models (VLMs) have shown impressive performance in substantial downstream multi-modal tasks. However, only comparing the fine-tuned performance on downstream tasks leads to the poor interpretability of VLMs, which is adverse to their future improvement. Several prior works have identified this issue and used various probing methods under a zero-shot setting to detect VLMs' limitations, but they all examine VLMs using general datasets instead of specialized ones. In practical applications, VLMs are usually applied to specific scenarios, such as e-commerce and news fields, so the generalization of VLMs in specific domains should be given more attention. In this paper, we comprehensively investigate the capabilities of popular VLMs in a specific field, the food domain. To this end, we build a food caption dataset, Food-500 Cap, which contains 24,700 food images with 494 categories. Each image is accompanied by a detailed caption, including fine-grained attributes of food, such as the ingredient, shape, and color. We also provide a culinary culture taxonomy that classifies each food category based on its geographic origin in order to better analyze the performance differences of VLM in different regions. Experiments on our proposed datasets demonstrate that popular VLMs underperform in the food domain compared with their performance in the general domain. Furthermore, our research reveals severe bias in VLMs' ability to handle food items from different geographic regions. We adopt diverse probing methods and evaluate nine VLMs belonging to different architectures to verify the aforementioned observations. We hope that our study will bring researchers' attention to VLM's limitations when applying them to the domain of food or culinary cultures, and spur further investigations to address this issue.

READ FULL TEXT

page 2

page 5

page 7

page 12

research
07/01/2022

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Vision-Language Pretraining (VLP) models have recently successfully faci...
research
07/14/2019

FoodX-251: A Dataset for Fine-grained Food Classification

Food classification is a challenging problem due to the large number of ...
research
03/30/2021

Large Scale Visual Food Recognition

Food recognition plays an important role in food choice and intake, whic...
research
05/12/2021

A Large-Scale Benchmark for Food Image Segmentation

Food image segmentation is a critical and indispensible task for develop...
research
08/28/2023

FIRE: Food Image to REcipe generation

Food computing has emerged as a prominent multidisciplinary field of res...
research
04/12/2023

NutritionVerse-Thin: An Optimized Strategy for Enabling Improved Rendering of 3D Thin Food Models

With the growth in capabilities of generative models, there has been gro...
research
11/17/2021

Fine-grained prediction of food insecurity using news streams

Anticipating the outbreak of a food crisis is crucial to efficiently all...

Please sign up or login with your details

Forgot password? Click here to reset