Spoofing Speaker Verification Systems with Deep Multi-speaker Text-to-speech Synthesis
This paper proposes a deep multi-speaker text-to-speech (TTS) model for spoofing speaker verification (SV) systems. The proposed model employs one network to synthesize time-downsampled mel-spectrograms from text input and another network to convert them to linear-frequency spectrograms, which are further converted to the time domain using the Griffin-Lim algorithm. Both networks are trained separately under the generative adversarial networks (GAN) framework. Spoofing experiments on two state-of-the-art SV systems (i-vectors and Google's GE2E) show that the proposed system can successfully spoof these systems with a high success rate. Spoofing experiments on anti-spoofing systems (i.e., binary classifiers for discriminating real and synthetic speech) also show a high spoof success rate when such anti-spoofing systems' structures are exposed to the proposed TTS system.
READ FULL TEXT