Depth Separations in Neural Networks: What is Actually Being Separated?
Existing depth separation results for constant-depth networks essentially show that certain radial functions in R^d, which can be easily approximated with depth 3 networks, cannot be approximated by depth 2 networks, even up to constant accuracy, unless their size is exponential in d. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension d (or equivalently, by scaling the function, the hardness result applies to O(1)-Lipschitz functions only when the target accuracy ϵ is at most poly(1/d)). In this paper, we study whether such depth separations might still hold in the natural setting of O(1)-Lipschitz radial functions, when ϵ does not scale with d. Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it is possible to approximate O(1)-Lipschitz radial functions with depth 2, size poly(d) networks, for every constant ϵ. We complement it by showing that approximating such functions is also possible with depth 2, size poly(1/ϵ) networks, for every constant d. Finally, we show that it is not possible to have polynomial dependence in both d,1/ϵ simultaneously. Overall, our results indicate that in order to show depth separations for expressing O(1)-Lipschitz functions with constant accuracy -- if at all possible -- one would need fundamentally different techniques than existing ones in the literature.
READ FULL TEXT