Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

by   Jianwei Yang, et al.

In an open-world setting, it is inevitable that an intelligent agent (e.g., a robot) will encounter visual objects, attributes or relationships it does not recognize. In this work, we develop an agent empowered with visual curiosity, i.e. the ability to ask questions to an Oracle (e.g., human) about the contents in images (e.g., What is the object on the left side of the red cube?) and build visual recognition model based on the answers received (e.g., Cylinder). In order to do this, the agent must (1) understand what it recognizes and what it does not, (2) formulate a valid, unambiguous and informative language query (a question) to ask the Oracle, (3) derive the parameters of visual classifiers from the Oracle response and (4) leverage the updated visual classifiers to ask more clarified questions. Specifically, we propose a novel framework and formulate the learning of visual curiosity as a reinforcement learning problem. In this framework, all components of our agent, visual recognition module (to see), question generation policy (to ask), answer digestion module (to understand) and graph memory module (to memorize), are learned entirely end-to-end to maximize the reward derived from the scene graph obtained by the agent as a consequence of the dialog with the Oracle. Importantly, the question generation policy is disentangled from the visual recognition system and specifics of the environment. Consequently, we demonstrate a sort of double generalization. Our question generation policy generalizes to new environments and a new pair of eyes, i.e., new visual system. Trained on a synthetic dataset, our results show that our agent learns new visual concepts significantly faster than several heuristic baselines, even when tested on synthetic environments with novel objects, as well as in a realistic environment.


page 6

page 14

page 17

page 18


Ask Before You Act: Generalising to Novel Environments by Asking Questions

Solving temporally-extended tasks is a challenge for most reinforcement ...

Active Sparse Conversations for Improved Audio-Visual Embodied Navigation

Efficient navigation towards an audio-goal necessitates an embodied agen...

Improve the efficiency of deep reinforcement learning through semantic exploration guided by natural language

Reinforcement learning is a powerful technique for learning from trial a...

Modeling question asking using neural program generation

People ask questions that are far richer, more informative, and more cre...

Jointly Learning to See, Ask, and GuessWhat

We are interested in understanding how the ability to ground language in...

Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

GuessWhat?! is a two-player visual dialog guessing game where player A a...

Visual Concept-Metaconcept Learning

Humans reason with concepts and metaconcepts: we recognize red and green...

Please sign up or login with your details

Forgot password? Click here to reset