PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Hexin Liu, Leibny Paola Garcia Perera, Andy W.H. Khong, Suzy J. Styles, Sanjeev Khudanpur

Research output: Contribution to journalConference articlepeer-review

12 Citations (Scopus)

Abstract

We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multitask optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.

Original languageEnglish
Pages (from-to)2233-2237
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2022-September
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: Sept 18 2022Sept 22 2022

Bibliographical note

Publisher Copyright:
Copyright © 2022 ISCA.

ASJC Scopus Subject Areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Keywords

  • acoustic phonetics
  • Language identification
  • phoneme segmentation
  • phonotactics
  • self-supervised learning

Fingerprint

Dive into the research topics of 'PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification'. Together they form a unique fingerprint.

Cite this