Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation
We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device. We propose a streaming transformer (TF) encoder architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks using only acoustic features. The proposed joint model yields an average 18 for the VTD task at a given false alarm rate. Moreover, our model suppresses 95 Finally, on-device measurements show 32 reduction in inference time compared to non-streaming version of the model.
READ FULL TEXT