PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Hexin Liu; Leibny Paola Garcia Perera; Andy W.H. Khong; Suzy J. Styles; Sanjeev Khudanpur

doi:10.21437/Interspeech.2022-354

PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Hexin Liu, Leibny Paola Garcia Perera, Andy W.H. Khong, Suzy J. Styles, Sanjeev Khudanpur

Research output: Contribution to journal › Conference article › peer-review

12 Citations (Scopus)

Abstract

We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multitask optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.

Original language	English
Pages (from-to)	2233-2237
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2022-September
DOIs	https://doi.org/10.21437/Interspeech.2022-354
Publication status	Published - 2022
Externally published	Yes
Event	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: Sept 18 2022 → Sept 22 2022

Bibliographical note

Publisher Copyright:
Copyright © 2022 ISCA.

ASJC Scopus Subject Areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modelling and Simulation

Keywords

acoustic phonetics
Language identification
phoneme segmentation
phonotactics
self-supervised learning

Access to Document

10.21437/Interspeech.2022-354

Cite this

Liu, H., Perera, L. P. G., Khong, A. W. H., Styles, S. J., & Khudanpur, S. (2022). PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022-September, 2233-2237. https://doi.org/10.21437/Interspeech.2022-354

@article{5ae99f974539459480a50678621ff68d,

title = "PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification",

abstract = "We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multitask optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.",

keywords = "acoustic phonetics, Language identification, phoneme segmentation, phonotactics, self-supervised learning",

author = "Hexin Liu and Perera, {Leibny Paola Garcia} and Khong, {Andy W.H.} and Styles, {Suzy J.} and Sanjeev Khudanpur",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-354",

language = "English",

volume = "2022-September",

pages = "2233--2237",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification. / Liu, Hexin; Perera, Leibny Paola Garcia; Khong, Andy W.H. et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2022-September, 2022, p. 2233-2237.

Research output: Contribution to journal › Conference article › peer-review

TY - JOUR

T1 - PHO-LID

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

AU - Liu, Hexin

AU - Perera, Leibny Paola Garcia

AU - Khong, Andy W.H.

AU - Styles, Suzy J.

AU - Khudanpur, Sanjeev

PY - 2022

Y1 - 2022

N2 - We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multitask optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.

AB - We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multitask optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.

KW - acoustic phonetics

KW - Language identification

KW - phoneme segmentation

KW - phonotactics

KW - self-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85140086967&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85140086967&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-354

DO - 10.21437/Interspeech.2022-354

M3 - Conference article

AN - SCOPUS:85140086967

SN - 2308-457X

VL - 2022-September

SP - 2233

EP - 2237

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 18 September 2022 through 22 September 2022

ER -

PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Fingerprint

Cite this