Substance or Style: What Does Your Image Embedding Know?

by   Cyrus Rashtchian, et al.

Probes are small networks that predict properties of underlying data from embeddings, and they provide a targeted, effective way to illuminate the information contained in embeddings. While analysis through the use of probes has become standard in NLP, there has been much less exploration in vision. Image foundation models have primarily been evaluated for semantic content. Better understanding the non-semantic information in popular embeddings (e.g., MAE, SimCLR, or CLIP) will shed new light both on the training algorithms and on the uses for these foundation models. We design a systematic transformation prediction task and measure the visual content of embeddings along many axes, including image style, quality, and a range of natural and artificial transformations. Surprisingly, six embeddings (including SimCLR) encode enough non-semantic information to identify dozens of transformations. We also consider a generalization task, where we group similar transformations and hold out several for testing. We find that image-text models (CLIP and ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN and MAE). Overall, our results suggest that the choice of pre-training algorithm impacts the types of information in the embedding, and certain models are better than others for non-semantic downstream tasks.


page 2

page 4

page 25

page 26

page 27

page 28


Line Search-Based Feature Transformation for Fast, Stable, and Tunable Content-Style Control in Photorealistic Style Transfer

Photorealistic style transfer is the task of synthesizing a realistic-lo...

TSSAT: Two-Stage Statistics-Aware Transformation for Artistic Style Transfer

Artistic style transfer aims to create new artistic images by rendering ...

Information Leakage in Embedding Models

Embeddings are functions that map raw input data to low-dimensional vect...

Towards General Game Representations: Decomposing Games Pixels into Content and Style

On-screen game footage contains rich contextual information that players...

ALADIN-NST: Self-supervised disentangled representation learning of artistic style through Neural Style Transfer

Representation learning aims to discover individual salient features of ...

Probing Taxonomic and Thematic Embeddings for Taxonomic Information

Modelling taxonomic and thematic relatedness is important for building A...

Improving Narrative Relationship Embeddings by Training with Additional Inverse-Relationship Constraints

We consider the problem of embedding character-entity relationships from...

Please sign up or login with your details

Forgot password? Click here to reset