Efficient Self-Supervised Learning Representations for Spoken Language Identification

Hexin Liu; Leibny Paola Garcia Perera; Andy W.H. Khong; Eng Siong Chng; Suzy J. Styles; Sanjeev Khudanpur

doi:10.1109/JSTSP.2022.3201445

Efficient Self-Supervised Learning Representations for Spoken Language Identification

Hexin Liu^*, Leibny Paola Garcia Perera, Andy W.H. Khong, Eng Siong Chng, Suzy J. Styles, Sanjeev Khudanpur

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

26 Citations (Scopus)

Abstract

Self-supervised learning has been widely exploited to learn powerful speech representations. The premise of this paper is that these learned self-supervised representations contain irrelevant information for a particular downstream task. Hence, we investigate efficient methods to compute reliable representations and discard redundant information for language identification (LID) using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, we compare the performance of wav2vec features extracted from different inner layers of the context network. For this approach, the x-vector self-attention LID (XSA-LID) model forms the backbone used to discriminate between distinct languages. We then propose to employ two mechanisms to reduce irrelevant information of the representations in LID. The first is the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second is the linear bottleneck (LBN) block that reduces irrelevant information by nonlinear dimension reduction. We incorporate these two methods in the XSA-LID model and conduct experiments on AP19-OLR data and the MLS14 data in NIST LRE 2017. By replacing the previous input features with wav2vec 2.0 features, the XSA-LID model achieves 63.79% relative improvement in terms of the average cost on AP19-OLR data, and 40.42%, 41.54% and 18.97% relative improvement on 3 s, 10 s and 30 s test speech in the MLS14 data in NIST LRE 2017, respectively. In addition, the proposed LBN-XSA model achieves 9.85% relative improvement on AP19-OLR data and over 10% overall improvement on the MLS14 data with a modest number of additional parameters compared to the XSA-LID model. Finally, in terms of average cost and accuracy, the proposed LBN-XSA model outperforms the XSA-LID model which adopts the fine-tuned features on the AP19-OLR data.

Original language	English
Pages (from-to)	1296-1307
Number of pages	12
Journal	IEEE Journal on Selected Topics in Signal Processing
Volume	16
Issue number	6
DOIs	https://doi.org/10.1109/JSTSP.2022.3201445
Publication status	Published - Oct 1 2022
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 2007-2012 IEEE.

ASJC Scopus Subject Areas

Signal Processing
Electrical and Electronic Engineering

Keywords

Downstream
language identification
represen-tation
self-supervised learning

Access to Document

10.1109/JSTSP.2022.3201445

Cite this

@article{0f6d91df5b5d4e528eb43c0166b76d61,

title = "Efficient Self-Supervised Learning Representations for Spoken Language Identification",

abstract = "Self-supervised learning has been widely exploited to learn powerful speech representations. The premise of this paper is that these learned self-supervised representations contain irrelevant information for a particular downstream task. Hence, we investigate efficient methods to compute reliable representations and discard redundant information for language identification (LID) using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, we compare the performance of wav2vec features extracted from different inner layers of the context network. For this approach, the x-vector self-attention LID (XSA-LID) model forms the backbone used to discriminate between distinct languages. We then propose to employ two mechanisms to reduce irrelevant information of the representations in LID. The first is the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second is the linear bottleneck (LBN) block that reduces irrelevant information by nonlinear dimension reduction. We incorporate these two methods in the XSA-LID model and conduct experiments on AP19-OLR data and the MLS14 data in NIST LRE 2017. By replacing the previous input features with wav2vec 2.0 features, the XSA-LID model achieves 63.79\% relative improvement in terms of the average cost on AP19-OLR data, and 40.42\%, 41.54\% and 18.97\% relative improvement on 3 s, 10 s and 30 s test speech in the MLS14 data in NIST LRE 2017, respectively. In addition, the proposed LBN-XSA model achieves 9.85\% relative improvement on AP19-OLR data and over 10\% overall improvement on the MLS14 data with a modest number of additional parameters compared to the XSA-LID model. Finally, in terms of average cost and accuracy, the proposed LBN-XSA model outperforms the XSA-LID model which adopts the fine-tuned features on the AP19-OLR data.",

keywords = "Downstream, language identification, represen-tation, self-supervised learning",

author = "Hexin Liu and Perera, \{Leibny Paola Garcia\} and Khong, \{Andy W.H.\} and Chng, \{Eng Siong\} and Styles, \{Suzy J.\} and Sanjeev Khudanpur",

note = "Publisher Copyright: {\textcopyright} 2007-2012 IEEE.",

year = "2022",

month = oct,

day = "1",

doi = "10.1109/JSTSP.2022.3201445",

language = "English",

volume = "16",

pages = "1296--1307",

journal = "IEEE Journal on Selected Topics in Signal Processing",

issn = "1932-4553",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "6",

}

TY - JOUR

T1 - Efficient Self-Supervised Learning Representations for Spoken Language Identification

AU - Liu, Hexin

AU - Perera, Leibny Paola Garcia

AU - Khong, Andy W.H.

AU - Chng, Eng Siong

AU - Styles, Suzy J.

AU - Khudanpur, Sanjeev

PY - 2022/10/1

Y1 - 2022/10/1

N2 - Self-supervised learning has been widely exploited to learn powerful speech representations. The premise of this paper is that these learned self-supervised representations contain irrelevant information for a particular downstream task. Hence, we investigate efficient methods to compute reliable representations and discard redundant information for language identification (LID) using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, we compare the performance of wav2vec features extracted from different inner layers of the context network. For this approach, the x-vector self-attention LID (XSA-LID) model forms the backbone used to discriminate between distinct languages. We then propose to employ two mechanisms to reduce irrelevant information of the representations in LID. The first is the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second is the linear bottleneck (LBN) block that reduces irrelevant information by nonlinear dimension reduction. We incorporate these two methods in the XSA-LID model and conduct experiments on AP19-OLR data and the MLS14 data in NIST LRE 2017. By replacing the previous input features with wav2vec 2.0 features, the XSA-LID model achieves 63.79% relative improvement in terms of the average cost on AP19-OLR data, and 40.42%, 41.54% and 18.97% relative improvement on 3 s, 10 s and 30 s test speech in the MLS14 data in NIST LRE 2017, respectively. In addition, the proposed LBN-XSA model achieves 9.85% relative improvement on AP19-OLR data and over 10% overall improvement on the MLS14 data with a modest number of additional parameters compared to the XSA-LID model. Finally, in terms of average cost and accuracy, the proposed LBN-XSA model outperforms the XSA-LID model which adopts the fine-tuned features on the AP19-OLR data.

AB - Self-supervised learning has been widely exploited to learn powerful speech representations. The premise of this paper is that these learned self-supervised representations contain irrelevant information for a particular downstream task. Hence, we investigate efficient methods to compute reliable representations and discard redundant information for language identification (LID) using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, we compare the performance of wav2vec features extracted from different inner layers of the context network. For this approach, the x-vector self-attention LID (XSA-LID) model forms the backbone used to discriminate between distinct languages. We then propose to employ two mechanisms to reduce irrelevant information of the representations in LID. The first is the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second is the linear bottleneck (LBN) block that reduces irrelevant information by nonlinear dimension reduction. We incorporate these two methods in the XSA-LID model and conduct experiments on AP19-OLR data and the MLS14 data in NIST LRE 2017. By replacing the previous input features with wav2vec 2.0 features, the XSA-LID model achieves 63.79% relative improvement in terms of the average cost on AP19-OLR data, and 40.42%, 41.54% and 18.97% relative improvement on 3 s, 10 s and 30 s test speech in the MLS14 data in NIST LRE 2017, respectively. In addition, the proposed LBN-XSA model achieves 9.85% relative improvement on AP19-OLR data and over 10% overall improvement on the MLS14 data with a modest number of additional parameters compared to the XSA-LID model. Finally, in terms of average cost and accuracy, the proposed LBN-XSA model outperforms the XSA-LID model which adopts the fine-tuned features on the AP19-OLR data.

KW - Downstream

KW - language identification

KW - represen-tation

KW - self-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85137608218&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85137608218&partnerID=8YFLogxK

U2 - 10.1109/JSTSP.2022.3201445

DO - 10.1109/JSTSP.2022.3201445

M3 - Article

AN - SCOPUS:85137608218

SN - 1932-4553

VL - 16

SP - 1296

EP - 1307

JO - IEEE Journal on Selected Topics in Signal Processing

JF - IEEE Journal on Selected Topics in Signal Processing

IS - 6

ER -

Efficient Self-Supervised Learning Representations for Spoken Language Identification

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Fingerprint

Studies from Nanyang Technological University in the Area of Technology Described (Efficient Self-supervised Learning Representations for Spoken Language Identification)

Cite this

Efficient Self-Supervised Learning Representations for Spoken Language Identification

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Fingerprint

Press/Media

Studies from Nanyang Technological University in the Area of Technology Described (Efficient Self-supervised Learning Representations for Spoken Language Identification)

Cite this