Convolutional Neural Network-Based Age Estimation Using B-Mode Ultrasound Tongue Image
Ultrasound tongue imaging is widely used for speech production research, and it has attracted increasing attention as its potential applications seem to be evident in many different fields, such as the visual biofeedback tool for second language acquisition and silent speech interface. Unlike previous studies, here we explore the feasibility of age estimation using the ultrasound tongue image of the speakers. Motivated by the success of deep learning, this paper leverages deep learning on this task. We train a deep convolutional neural network model on the UltraSuite dataset. The deep model achieves mean absolute error (MAE) of 2.03 for the data from typically developing children, while MAE is 4.87 for the data from the children with speech sound disorders, which suggest that age estimation using ultrasound is more challenging for the children with speech sound disorder. The developed method can be used a tool to evaluate the performance of speech therapy sessions. It is also worthwhile to notice that, although we leverage the ultrasound tongue imaging for our study, the proposed methods may also be extended to other imaging modalities (e.g. MRI) to assist the studies on speech production.
READ FULL TEXT