MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization

Victoria Y.H. Chua, Hexin Liu, Leibny Paola Garcia Perera, Fei Ting Woon, Jinyi Wong, Xiangyu Zhang, Sanjeev Khudanpur, Andy W.H. Khong, Justin Dauwels, Suzy J. Styles

Research output: Contribution to journalConference articlepeer-review

6 Citations (Scopus)

Abstract

To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child shared book reading, of over 30 hours with over 300 recordings, annotated by multilingual transcribers using a high-fidelity linguistic transcription protocol. The audio corpus features spontaneous and in-the-wild English-Mandarin code-switching, child-directed speech in non-standard accents with diverse language-mixing patterns recorded in a variety of home environments. This report describes the corpus, as well as LID and LD results for our baseline and several systems submitted to the MERLIon CCS challenge using the corpus.

Original languageEnglish
Pages (from-to)4109-4113
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
Publication statusPublished - 2023
Externally publishedYes
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: Aug 20 2023Aug 24 2023

Bibliographical note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.

ASJC Scopus Subject Areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Keywords

  • child-directed speech
  • code-switching
  • language diarization
  • language identification

Fingerprint

Dive into the research topics of 'MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization'. Together they form a unique fingerprint.

Cite this