Abstract
Bone-conducted speech is not susceptible to background noise but suffers from poor speech quality and intelligibility due to the limited bandwidth. This paper proposes a two-stage approach to restore the quality of bone-conducted speech, namely, bandwidth extension and speech vocoder. In the first stage, a deep neural network is trained to learn mappings from a low-resolution representation of the bone-conducted speech, i.e., log Mel-scale spectrogram, to that of the air-conducted speech, which extends the bandwidth of the bone-conducted speech. In the second stage, a speech vocoder is employed to transform the extended log Mel-scale spectrogram of the bone-conducted speech back to time-domain waveforms. Due to the many-to-many correspondence between the air-conducted and bone-conducted speech, supervised learning may not be the best training protocol for the bone-conducted/air-conducted feature mapping. We thus propose to leverage adversarial training to further improve the bandwidth extension performance in the first stage. The two stages are decoupled and can be trained independently. The vocoder is trained on a large multi-speaker dataset and can generalize well to unknown speakers. Also, the vocoder can help to remedy the spectral artifacts introduced in the bandwidth extension stage. Objective and subjective evaluations on ESMB dataset show that the proposed two-stage system substantially outperforms existing bone-conducted speech enhancement systems.
Original language | English |
---|---|
Pages (from-to) | 818-829 |
Number of pages | 12 |
Journal | IEEE/ACM Transactions on Audio Speech and Language Processing |
Volume | 32 |
DOIs | |
Publication status | Published - 2024 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
ASJC Scopus Subject Areas
- Computer Science (miscellaneous)
- Acoustics and Ultrasonics
- Computational Mathematics
- Electrical and Electronic Engineering
Keywords
- adversarial training
- bandwidth extension
- Bone conduction
- speech enhancement
- vocoder