Object-centric Learning with Cyclic Walks between Parts and Whole
Learning object-centric representations from complex natural environments enables both humans and machines with reasoning abilities from low-level perceptual features. To capture compositional entities of the scene, we proposed cyclic walks between perceptual features extracted from CNN or transformers and object entities. First, a slot-attention module interfaces with these perceptual features and produces a finite set of slot representations. These slots can bind to any object entities in the scene via inter-slot competitions for attention. Next, we establish entity-feature correspondence with cyclic walks along high transition probability based on pairwise similarity between perceptual features (aka "parts") and slot-binded object representations (aka "whole"). The whole is greater than its parts and the parts constitute the whole. The part-whole interactions form cycle consistencies, as supervisory signals, to train the slot-attention module. We empirically demonstrate that the networks trained with our cyclic walks can extract object-centric representations on seven image datasets in three unsupervised learning tasks. In contrast to object-centric models attached with a decoder for image or feature reconstructions, our cyclic walks provide strong supervision signals, avoiding computation overheads and enhancing memory efficiency.
READ FULL TEXT