On approximating ∇ f with neural networks
Consider a feedforward neural network ψ: R^d→R^d such that ψ≈∇ f, where f:R^d →R is a smooth function, therefore ψ must satisfy ∂_j ψ_i = ∂_i ψ_j pointwise. We prove a theorem that for any such ψ networks, and for any depth L>2, all the input weights must be parallel to each other. In other words, ψ can only represent one feature in its first hidden layer. The proof of the theorem is straightforward, where two backward paths (from i to j and j to i) and a weight-tying matrix (connecting the last and first hidden layers) play the key roles. We thus make a strong theoretical case in favor of the implicit parametrization, where the neural network is ϕ: R^d →R and ∇ϕ≈∇ f. Throughout, we revisit two recent unnormalized probabilistic models that are formulated as ψ≈∇ f and also discuss the denoising autoencoders in the end.
READ FULL TEXT