End-to-end speech emotion recognition using multi-scale convolution networks

Tatinati Sivanagaraja; Mun Kit Ho; Andy W.H. Khong; Yubo Wang

doi:10.1109/APSIPA.2017.8282026

End-to-end speech emotion recognition using multi-scale convolution networks

Tatinati Sivanagaraja, Mun Kit Ho, Andy W.H. Khong, Yubo Wang

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

12 Citations (Scopus)

Abstract

Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. Formulation of appropriate features that cater for all variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end-to-end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-crafted feature set. Existing methods in this scheme however did not take into account the fact that speech signals often exhibit distinct features at different time scales and frequencies than in the raw form. We propose the multi- scale convolution neural network (MCNN) to identify features at different time scales and frequencies from raw speech signals. This end-to-end model leverages on the multi-branch input layer and tunable convolution layers to learn the identified features which are subsequently employed to recognize the emotion cues accordingly. As a proof-of-concept, the MCNN method with a fixed transformation stage is evaluated using the SAVEE emotion database. Results showed that MCNN improves the emotion recognition performance when compared to existing methods, which underpins the necessity of learning features at different time scales.

Original language	English
Title of host publication	Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	189-192
Number of pages	4
ISBN (Electronic)	9781538615423
DOIs	https://doi.org/10.1109/APSIPA.2017.8282026
Publication status	Published - Jul 2 2017
Externally published	Yes
Event	9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 - Kuala Lumpur, Malaysia Duration: Dec 12 2017 → Dec 15 2017

Publication series

Name	Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
Volume	2018-February

Conference

Conference	9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
Country/Territory	Malaysia
City	Kuala Lumpur
Period	12/12/17 → 12/15/17

Bibliographical note

Publisher Copyright:
© 2017 IEEE.

ASJC Scopus Subject Areas

Artificial Intelligence
Human-Computer Interaction
Information Systems
Signal Processing

Access to Document

10.1109/APSIPA.2017.8282026

Cite this

Sivanagaraja, T., Ho, M. K., Khong, A. W. H., & Wang, Y. (2017). End-to-end speech emotion recognition using multi-scale convolution networks. In Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 (pp. 189-192). (Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017; Vol. 2018-February). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPA.2017.8282026

Sivanagaraja, Tatinati ; Ho, Mun Kit ; Khong, Andy W.H. et al. / End-to-end speech emotion recognition using multi-scale convolution networks. Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 189-192 (Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017).

@inproceedings{b0851a583ecf43ed9e75529341799d28,

title = "End-to-end speech emotion recognition using multi-scale convolution networks",

abstract = "Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. Formulation of appropriate features that cater for all variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end-to-end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-crafted feature set. Existing methods in this scheme however did not take into account the fact that speech signals often exhibit distinct features at different time scales and frequencies than in the raw form. We propose the multi- scale convolution neural network (MCNN) to identify features at different time scales and frequencies from raw speech signals. This end-to-end model leverages on the multi-branch input layer and tunable convolution layers to learn the identified features which are subsequently employed to recognize the emotion cues accordingly. As a proof-of-concept, the MCNN method with a fixed transformation stage is evaluated using the SAVEE emotion database. Results showed that MCNN improves the emotion recognition performance when compared to existing methods, which underpins the necessity of learning features at different time scales.",

author = "Tatinati Sivanagaraja and Ho, \{Mun Kit\} and Khong, \{Andy W.H.\} and Yubo Wang",

note = "Publisher Copyright: {\textcopyright} 2017 IEEE.; 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 ; Conference date: 12-12-2017 Through 15-12-2017",

year = "2017",

month = jul,

day = "2",

doi = "10.1109/APSIPA.2017.8282026",

language = "English",

series = "Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "189--192",

booktitle = "Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017",

address = "United States",

}

Sivanagaraja, T, Ho, MK, Khong, AWH & Wang, Y 2017, End-to-end speech emotion recognition using multi-scale convolution networks. in Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017. Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017, vol. 2018-February, Institute of Electrical and Electronics Engineers Inc., pp. 189-192, 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017, Kuala Lumpur, Malaysia, 12/12/17. https://doi.org/10.1109/APSIPA.2017.8282026

End-to-end speech emotion recognition using multi-scale convolution networks. / Sivanagaraja, Tatinati; Ho, Mun Kit; Khong, Andy W.H. et al.
Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 189-192 (Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017; Vol. 2018-February).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - End-to-end speech emotion recognition using multi-scale convolution networks

AU - Sivanagaraja, Tatinati

AU - Ho, Mun Kit

AU - Khong, Andy W.H.

AU - Wang, Yubo

PY - 2017/7/2

Y1 - 2017/7/2

N2 - Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. Formulation of appropriate features that cater for all variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end-to-end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-crafted feature set. Existing methods in this scheme however did not take into account the fact that speech signals often exhibit distinct features at different time scales and frequencies than in the raw form. We propose the multi- scale convolution neural network (MCNN) to identify features at different time scales and frequencies from raw speech signals. This end-to-end model leverages on the multi-branch input layer and tunable convolution layers to learn the identified features which are subsequently employed to recognize the emotion cues accordingly. As a proof-of-concept, the MCNN method with a fixed transformation stage is evaluated using the SAVEE emotion database. Results showed that MCNN improves the emotion recognition performance when compared to existing methods, which underpins the necessity of learning features at different time scales.

AB - Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. Formulation of appropriate features that cater for all variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end-to-end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-crafted feature set. Existing methods in this scheme however did not take into account the fact that speech signals often exhibit distinct features at different time scales and frequencies than in the raw form. We propose the multi- scale convolution neural network (MCNN) to identify features at different time scales and frequencies from raw speech signals. This end-to-end model leverages on the multi-branch input layer and tunable convolution layers to learn the identified features which are subsequently employed to recognize the emotion cues accordingly. As a proof-of-concept, the MCNN method with a fixed transformation stage is evaluated using the SAVEE emotion database. Results showed that MCNN improves the emotion recognition performance when compared to existing methods, which underpins the necessity of learning features at different time scales.

UR - http://www.scopus.com/inward/record.url?scp=85050391358&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050391358&partnerID=8YFLogxK

U2 - 10.1109/APSIPA.2017.8282026

DO - 10.1109/APSIPA.2017.8282026

M3 - Conference contribution

AN - SCOPUS:85050391358

T3 - Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017

SP - 189

EP - 192

BT - Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017

Y2 - 12 December 2017 through 15 December 2017

ER -

Sivanagaraja T, Ho MK, Khong AWH, Wang Y. End-to-end speech emotion recognition using multi-scale convolution networks. In Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 189-192. (Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017). doi: 10.1109/APSIPA.2017.8282026

End-to-end speech emotion recognition using multi-scale convolution networks

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Access to Document

Other files and links

Fingerprint

Cite this