Linear Guardedness and its Implications

10/18/2022
by   Shauli Ravfogel, et al.
2

Previous work on concept identification in neural representations has focused on linear concept subspaces and their neutralization. In this work, we formulate the notion of linear guardedness – the inability to directly predict a given concept from the representation – and study its implications. We show that, in the binary case, the neutralized concept cannot be recovered by an additional linear layer. However, we point out that – contrary to what was implicitly argued in previous works – multiclass softmax classifiers can be constructed that indirectly recover the concept. Thus, linear guardedness does not guarantee that linear classifiers do not utilize the neutralized concepts, shedding light on theoretical limitations of linear information removal methods.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset