End-to-end speech emotion recognition using multi-scale convolution networks

Tatinati Sivanagaraja, Mun Kit Ho, Andy W.H. Khong, Yubo Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. Formulation of appropriate features that cater for all variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end-to-end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-crafted feature set. Existing methods in this scheme however did not take into account the fact that speech signals often exhibit distinct features at different time scales and frequencies than in the raw form. We propose the multi- scale convolution neural network (MCNN) to identify features at different time scales and frequencies from raw speech signals. This end-to-end model leverages on the multi-branch input layer and tunable convolution layers to learn the identified features which are subsequently employed to recognize the emotion cues accordingly. As a proof-of-concept, the MCNN method with a fixed transformation stage is evaluated using the SAVEE emotion database. Results showed that MCNN improves the emotion recognition performance when compared to existing methods, which underpins the necessity of learning features at different time scales.

Original languageEnglish
Title of host publicationProceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages189-192
Number of pages4
ISBN (Electronic)9781538615423
DOIs
Publication statusPublished - Jul 2 2017
Externally publishedYes
Event9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 - Kuala Lumpur, Malaysia
Duration: Dec 12 2017Dec 15 2017

Publication series

NameProceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
Volume2018-February

Conference

Conference9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
Country/TerritoryMalaysia
CityKuala Lumpur
Period12/12/1712/15/17

Bibliographical note

Publisher Copyright:
© 2017 IEEE.

ASJC Scopus Subject Areas

  • Artificial Intelligence
  • Human-Computer Interaction
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'End-to-end speech emotion recognition using multi-scale convolution networks'. Together they form a unique fingerprint.

Cite this