Efficient Self-Supervised Learning Representations for Spoken Language Identification

Hexin Liu*, Leibny Paola Garcia Perera, Andy W.H. Khong, Eng Siong Chng, Suzy J. Styles, Sanjeev Khudanpur

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

25 Citations (Scopus)

Abstract

Self-supervised learning has been widely exploited to learn powerful speech representations. The premise of this paper is that these learned self-supervised representations contain irrelevant information for a particular downstream task. Hence, we investigate efficient methods to compute reliable representations and discard redundant information for language identification (LID) using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, we compare the performance of wav2vec features extracted from different inner layers of the context network. For this approach, the x-vector self-attention LID (XSA-LID) model forms the backbone used to discriminate between distinct languages. We then propose to employ two mechanisms to reduce irrelevant information of the representations in LID. The first is the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second is the linear bottleneck (LBN) block that reduces irrelevant information by nonlinear dimension reduction. We incorporate these two methods in the XSA-LID model and conduct experiments on AP19-OLR data and the MLS14 data in NIST LRE 2017. By replacing the previous input features with wav2vec 2.0 features, the XSA-LID model achieves 63.79% relative improvement in terms of the average cost on AP19-OLR data, and 40.42%, 41.54% and 18.97% relative improvement on 3 s, 10 s and 30 s test speech in the MLS14 data in NIST LRE 2017, respectively. In addition, the proposed LBN-XSA model achieves 9.85% relative improvement on AP19-OLR data and over 10% overall improvement on the MLS14 data with a modest number of additional parameters compared to the XSA-LID model. Finally, in terms of average cost and accuracy, the proposed LBN-XSA model outperforms the XSA-LID model which adopts the fine-tuned features on the AP19-OLR data.

Original languageEnglish
Pages (from-to)1296-1307
Number of pages12
JournalIEEE Journal on Selected Topics in Signal Processing
Volume16
Issue number6
DOIs
Publication statusPublished - Oct 1 2022
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2007-2012 IEEE.

ASJC Scopus Subject Areas

  • Signal Processing
  • Electrical and Electronic Engineering

Keywords

  • Downstream
  • language identification
  • represen-tation
  • self-supervised learning

Fingerprint

Dive into the research topics of 'Efficient Self-Supervised Learning Representations for Spoken Language Identification'. Together they form a unique fingerprint.

Cite this