Humans and deep networks largely agree on which kinds of variation make object recognition harder

by   Saeed Reza Kheradpisheh, et al.

View-invariant object recognition is a challenging problem, which has attracted much attention among the psychology, neuroscience, and computer vision communities. Humans are notoriously good at it, even if some variations are presumably more difficult to handle than others (e.g. 3D rotations). Humans are thought to solve the problem through hierarchical processing along the ventral stream, which progressively extracts more and more invariant visual features. This feed-forward architecture has inspired a new generation of bio-inspired computer vision systems called deep convolutional neural networks (DCNN), which are currently the best algorithms for object recognition in natural images. Here, for the first time, we systematically compared human feed-forward vision and DCNNs at view-invariant object recognition using the same images and controlling for both the kinds of transformation as well as their magnitude. We used four object categories and images were rendered from 3D computer models. In total, 89 human subjects participated in 10 experiments in which they had to discriminate between two or four categories after rapid presentation with backward masking. We also tested two recent DCNNs on the same tasks. We found that humans and DCNNs largely agreed on the relative difficulties of each kind of variation: rotation in depth is by far the hardest transformation to handle, followed by scale, then rotation in plane, and finally position. This suggests that humans recognize objects mainly through 2D template matching, rather than by constructing 3D object models, and that DCNNs are not too unreasonable models of human feed-forward vision. Also, our results show that the variation levels in rotation in depth and scale strongly modulate both humans' and DCNNs' recognition performances. We thus argue that these variations should be controlled in the image datasets used in vision research.


page 4

page 12

page 30

page 36


Object Recognition in Deep Convolutional Neural Networks is Fundamentally Different to That in Humans

Object recognition is a primary function of the human visual system. It ...

CIFAR10 to Compare Visual Recognition Performance between Deep Neural Networks and Humans

Visual object recognition plays an essential role in human daily life. T...

What takes the brain so long: Object recognition at the level of minimal images develops for up to seconds of presentation time

Rich empirical evidence has shown that visual object recognition in the ...

CortexNet: a Generic Network Family for Robust Visual Temporal Representations

In the past five years we have observed the rise of incredibly well perf...

Reconstruction-guided attention improves the robustness and shape processing of neural networks

Many visual phenomena suggest that humans use top-down generative or rec...

New Graph-based Features For Shape Recognition

Shape recognition is the main challenging problem in computer vision. Di...

Compensating for Large In-Plane Rotations in Natural Images

Rotation invariance has been studied in the computer vision community pr...

Please sign up or login with your details

Forgot password? Click here to reset