MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization

Victoria Y.H. Chua; Hexin Liu; Leibny Paola Garcia Perera; Fei Ting Woon; Jinyi Wong; Xiangyu Zhang; Sanjeev Khudanpur; Andy W.H. Khong; Justin Dauwels; Suzy J. Styles

doi:10.21437/Interspeech.2023-1446

MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization

Victoria Y.H. Chua, Hexin Liu, Leibny Paola Garcia Perera, Fei Ting Woon, Jinyi Wong, Xiangyu Zhang, Sanjeev Khudanpur, Andy W.H. Khong, Justin Dauwels, Suzy J. Styles

Research output: Contribution to journal › Conference article › peer-review

6 Citations (Scopus)

Abstract

To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child shared book reading, of over 30 hours with over 300 recordings, annotated by multilingual transcribers using a high-fidelity linguistic transcription protocol. The audio corpus features spontaneous and in-the-wild English-Mandarin code-switching, child-directed speech in non-standard accents with diverse language-mixing patterns recorded in a variety of home environments. This report describes the corpus, as well as LID and LD results for our baseline and several systems submitted to the MERLIon CCS challenge using the corpus.

Original language	English
Pages (from-to)	4109-4113
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2023-August
DOIs	https://doi.org/10.21437/Interspeech.2023-1446
Publication status	Published - 2023
Externally published	Yes
Event	24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland Duration: Aug 20 2023 → Aug 24 2023

Bibliographical note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.

ASJC Scopus Subject Areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modelling and Simulation

Keywords

child-directed speech
code-switching
language diarization
language identification

Access to Document

10.21437/Interspeech.2023-1446

Cite this

Chua, V. Y. H., Liu, H., Perera, L. P. G., Woon, F. T., Wong, J., Zhang, X., Khudanpur, S., Khong, A. W. H., Dauwels, J., & Styles, S. J. (2023). MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 4109-4113. https://doi.org/10.21437/Interspeech.2023-1446

@article{6e8fc5fba9ac4ec7a8f31a1a351507e2,

title = "MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization",

abstract = "To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child shared book reading, of over 30 hours with over 300 recordings, annotated by multilingual transcribers using a high-fidelity linguistic transcription protocol. The audio corpus features spontaneous and in-the-wild English-Mandarin code-switching, child-directed speech in non-standard accents with diverse language-mixing patterns recorded in a variety of home environments. This report describes the corpus, as well as LID and LD results for our baseline and several systems submitted to the MERLIon CCS challenge using the corpus.",

keywords = "child-directed speech, code-switching, language diarization, language identification",

author = "Chua, \{Victoria Y.H.\} and Hexin Liu and Perera, \{Leibny Paola Garcia\} and Woon, \{Fei Ting\} and Jinyi Wong and Xiangyu Zhang and Sanjeev Khudanpur and Khong, \{Andy W.H.\} and Justin Dauwels and Styles, \{Suzy J.\}",

note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

doi = "10.21437/Interspeech.2023-1446",

language = "English",

volume = "2023-August",

pages = "4109--4113",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

Chua, VYH, Liu, H, Perera, LPG, Woon, FT, Wong, J, Zhang, X, Khudanpur, S, Khong, AWH, Dauwels, J & Styles, SJ 2023, 'MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023-August, pp. 4109-4113. https://doi.org/10.21437/Interspeech.2023-1446

MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization. / Chua, Victoria Y.H.; Liu, Hexin; Perera, Leibny Paola Garcia et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2023-August, 2023, p. 4109-4113.

Research output: Contribution to journal › Conference article › peer-review

TY - JOUR

T1 - MERLIon CCS Challenge

T2 - 24th International Speech Communication Association, Interspeech 2023

AU - Chua, Victoria Y.H.

AU - Liu, Hexin

AU - Perera, Leibny Paola Garcia

AU - Woon, Fei Ting

AU - Wong, Jinyi

AU - Zhang, Xiangyu

AU - Khudanpur, Sanjeev

AU - Khong, Andy W.H.

AU - Dauwels, Justin

AU - Styles, Suzy J.

PY - 2023

Y1 - 2023

N2 - To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child shared book reading, of over 30 hours with over 300 recordings, annotated by multilingual transcribers using a high-fidelity linguistic transcription protocol. The audio corpus features spontaneous and in-the-wild English-Mandarin code-switching, child-directed speech in non-standard accents with diverse language-mixing patterns recorded in a variety of home environments. This report describes the corpus, as well as LID and LD results for our baseline and several systems submitted to the MERLIon CCS challenge using the corpus.

AB - To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child shared book reading, of over 30 hours with over 300 recordings, annotated by multilingual transcribers using a high-fidelity linguistic transcription protocol. The audio corpus features spontaneous and in-the-wild English-Mandarin code-switching, child-directed speech in non-standard accents with diverse language-mixing patterns recorded in a variety of home environments. This report describes the corpus, as well as LID and LD results for our baseline and several systems submitted to the MERLIon CCS challenge using the corpus.

KW - child-directed speech

KW - code-switching

KW - language diarization

KW - language identification

UR - http://www.scopus.com/inward/record.url?scp=85162819207&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85162819207&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-1446

DO - 10.21437/Interspeech.2023-1446

M3 - Conference article

AN - SCOPUS:85162819207

SN - 2308-457X

VL - 2023-August

SP - 4109

EP - 4113

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 20 August 2023 through 24 August 2023

ER -

MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Fingerprint

Cite this