A Multi-tasking Model of Speaker-Keyword Classification for Keeping Human in the Loop of Drone-assisted Inspection

by   Yu Li, et al.

Audio commands are a preferred communication medium to keep inspectors in the loop of civil infrastructure inspection performed by a semi-autonomous drone. To understand job-specific commands from a group of heterogeneous and dynamic inspectors, a model needs to be developed cost-effectively for the group and easily adapted when the group changes. This paper is motivated to build a multi-tasking deep learning model that possesses a Share-Split-Collaborate architecture. This architecture allows the two classification tasks to share the feature extractor and then split subject-specific and keyword-specific features intertwined in the extracted features through feature projection and collaborative training. A base model for a group of five authorized subjects is trained and tested on the inspection keyword dataset collected by this study. The model achieved a 95.3 of any authorized inspectors. Its mean accuracy in speaker classification is 99.2 pooled training data, adapting the base model to a new inspector requires only a little training data from that inspector, like five utterances per keyword. Using the speaker classification scores for inspector verification can achieve a success rate of at least 93.9 in detecting unauthorized ones. Further, the paper demonstrates the applicability of the proposed model to larger-size groups on a public dataset. This paper provides a solution to addressing challenges facing AI-assisted human-robot interaction, including worker heterogeneity, worker dynamics, and job heterogeneity.


A Virtual Reality-based Training and Assessment System for Bridge Inspectors with an Assistant Drone

Over 600,000 bridges in the U.S. must be inspected every two years to id...

Learning Decoupling Features Through Orthogonality Regularization

Keyword spotting (KWS) and speaker verification (SV) are two important t...

Few Shot Text-Independent speaker verification using 3D-CNN

Facial recognition system is one of the major successes of Artificial in...

Predicting health inspection results from online restaurant reviews

Informatics around public health are increasingly shifting from the prof...

Collaborative Learning with a Drone Orchestrator

In this paper, the problem of drone-assisted collaborative learning is c...

End-to-end Keyword Spotting using Xception-1d

The field of conversational agents is growing fast and there is an incre...

Almost Zero-Resource ASR-free Keyword Spotting using Multilingual Bottleneck Features and Correspondence Autoencoders

We compare features for dynamic time warping based keyword spotting in a...

Please sign up or login with your details

Forgot password? Click here to reset