Pragmatic Constraint on Distributional Semantics

11/20/2022
by   Elizaveta Zhemchuzhina, et al.
0

This paper studies the limits of language models' statistical learning in the context of Zipf's law. First, we demonstrate that Zipf-law token distribution emerges irrespective of the chosen tokenization. Second, we show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics. Namely, the tokens that have a one-to-one correspondence with one semantic concept have different statistical properties than those with semantic ambiguity. Finally, we demonstrate how these properties interfere with statistical learning procedures motivated by distributional semantics.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset