Abstract
The performance of existing synthetic speech detection systems is significantly degraded in real scenarios. This paper proposes a data augmentation method for cepstral features via frequency masking. First, linear filter banks (LFBs) of the input signal are masked on frequency channels for realistic speech distortion. Then, the linear frequency cepstral coefficients (LFCC) of the masked features are calculated to reduce the feature dimensionality and improve the detection performance. Using light convolutional neural network (LCNN), deep residual network (ResNet) and one-dimensional convolutional Transformer (OCT), three detection systems are established to verify the effectiveness of the proposed method. Experiments on the real scene datasets show that the proposed method can reduce the equal error rate (EER) of different synthetic speech detection systems by 6. 39% - 25. 95% compared with the baseline without augmentation. The proposed method with the codec-based augmentation can reduce the EER of different systems by 31. 71% - 42. 47% compared with the baseline, which further improves the generalization ability of the systems in real scenarios, and outperforms the existing data augmentation methods.
Translated title of the contribution | Real scene synthetic speech detection based on cepstral feature data augmentation |
---|---|
Original language | Chinese (Simplified) |
Pages (from-to) | 1013-1023 |
Number of pages | 11 |
Journal | Gaojishu Tongxin/High Technology Letters |
Volume | 34 |
Issue number | 10 |
DOIs | |
Publication status | Published - Oct 2024 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2024 Inst. of Scientific and Technical Information of China. All rights reserved.
ASJC Scopus Subject Areas
- General Engineering
Keywords
- data augmentation
- frequency masking
- generalization ability
- real scenes
- synthetic speech detection