Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection

08/08/2020
by   Takuya Higuchi, et al.
0

We propose a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system. Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, we propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection. An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. We show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0 relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By using longer time delays, the S1DCNN further improve the FRR up to 12.2

READ FULL TEXT
research
10/09/2021

Streaming on-device detection of device directed speech from voice and touch-based invocation

When interacting with smart devices such as mobile phones or wearables, ...
research
11/28/2016

An End-to-End Architecture for Keyword Spotting and Voice Activity Detection

We propose a single neural network architecture for two tasks: on-line k...
research
10/26/2020

MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection

We present MarbleNet, an end-to-end neural network for Voice Activity De...
research
08/29/2022

Streaming Intended Query Detection using E2E Modeling for Continued Conversation

In voice-enabled applications, a predetermined hotword isusually used to...
research
08/09/2020

Accurate Detection of Wake Word Start and End Using a CNN

Small footprint embedded devices require keyword spotters (KWS) with sma...
research
10/29/2020

Progressive Voice Trigger Detection: Accuracy vs Latency

We present an architecture for voice trigger detection for virtual assis...
research
10/06/2022

WakeUpNet: A Mobile-Transformer based Framework for End-to-End Streaming Voice Trigger

End-to-end models have gradually become the main technical stream for vo...

Please sign up or login with your details

Forgot password? Click here to reset