Language Aided Speaker Diarization Using Speaker Role Information
Speaker diarization relies on the assumption that acoustic embeddings from speech segments corresponding to a particular speaker share common characteristics. Thus, they are concentrated in a specific region of the speaker space; a region which represents that speaker's identity. Those identities however are not known a priori, so a clustering algorithm is employed, which is typically based solely on audio. In this work we explore conversational scenarios where the speakers play distinct roles and are expected to follow different linguistic patterns. We aim to exploit this distinct linguistic variability and build a language-based segmenter and a role recognizer which can be used to construct the speaker identities. That way, we are able to boost the diarization performance by converting the clustering task to a classification one. The proposed method is applied in real-world dyadic psychotherapy interactions between a provider and a patient and demonstrated to show improved results.
READ FULL TEXT