Abstract
Bone-conducted speech is less susceptible to ambient noise interference, but it suffers from poor speech quality due to the limited bandwidth. In this letter, we propose a U-Net-like network for the restoration of bone-conducted speech in the time domain. The proposed network consists of residual-connected one-dimensional convolutions and shifted window-based attention modules, which can model long-Term dependencies crucial in speech processing. We find that the prevalent time-domain loss may be insufficient for the generation of high-frequency information absent in bone-conducted speech. To address this issue, we propose to utilize the generalized energy distance loss based on multi-scale Mel spectrograms as the objective function. Experimental results on the ESMB dataset validate the efficacy of our proposed method in restoration of bone-conducted speech. The proposed approach significantly outperforms two recent time-domain benchmarks, DPT-EGNet and EBEN, in terms of PESQ and STOI metrics.
Original language | English |
---|---|
Pages (from-to) | 166-170 |
Number of pages | 5 |
Journal | IEEE Signal Processing Letters |
Volume | 31 |
DOIs | |
Publication status | Published - 2024 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 1994-2012 IEEE.
ASJC Scopus Subject Areas
- Signal Processing
- Electrical and Electronic Engineering
- Applied Mathematics
Keywords
- attention
- Bone-conducted speech
- spectral energy distance
- speech enhancement
- speech synthesis