基于倒谱特征数据增强的真实场景合成语音检测

Translated title of the contribution: Real scene synthetic speech detection based on cepstral feature data augmentation

Yi Wan, Chunguo Li, Feiran Yang*, Jun Yang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The performance of existing synthetic speech detection systems is significantly degraded in real scenarios. This paper proposes a data augmentation method for cepstral features via frequency masking. First, linear filter banks (LFBs) of the input signal are masked on frequency channels for realistic speech distortion. Then, the linear frequency cepstral coefficients (LFCC) of the masked features are calculated to reduce the feature dimensionality and improve the detection performance. Using light convolutional neural network (LCNN), deep residual network (ResNet) and one-dimensional convolutional Transformer (OCT), three detection systems are established to verify the effectiveness of the proposed method. Experiments on the real scene datasets show that the proposed method can reduce the equal error rate (EER) of different synthetic speech detection systems by 6. 39% - 25. 95% compared with the baseline without augmentation. The proposed method with the codec-based augmentation can reduce the EER of different systems by 31. 71% - 42. 47% compared with the baseline, which further improves the generalization ability of the systems in real scenarios, and outperforms the existing data augmentation methods.

Translated title of the contributionReal scene synthetic speech detection based on cepstral feature data augmentation
Original languageChinese (Simplified)
Pages (from-to)1013-1023
Number of pages11
JournalGaojishu Tongxin/High Technology Letters
Volume34
Issue number10
DOIs
Publication statusPublished - Oct 2024
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2024 Inst. of Scientific and Technical Information of China. All rights reserved.

ASJC Scopus Subject Areas

  • General Engineering

Keywords

  • data augmentation
  • frequency masking
  • generalization ability
  • real scenes
  • synthetic speech detection

Cite this