基于倒谱特征数据增强的真实场景合成语音检测

Yi Wan; Chunguo Li; Feiran Yang; Jun Yang

doi:10.3772/j.issn.1002-0470.2024.10.001

基于倒谱特征数据增强的真实场景合成语音检测

Translated title of the contribution: Real scene synthetic speech detection based on cepstral feature data augmentation

Yi Wan, Chunguo Li, Feiran Yang^*, Jun Yang

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

The performance of existing synthetic speech detection systems is significantly degraded in real scenarios. This paper proposes a data augmentation method for cepstral features via frequency masking. First, linear filter banks (LFBs) of the input signal are masked on frequency channels for realistic speech distortion. Then, the linear frequency cepstral coefficients (LFCC) of the masked features are calculated to reduce the feature dimensionality and improve the detection performance. Using light convolutional neural network (LCNN), deep residual network (ResNet) and one-dimensional convolutional Transformer (OCT), three detection systems are established to verify the effectiveness of the proposed method. Experiments on the real scene datasets show that the proposed method can reduce the equal error rate (EER) of different synthetic speech detection systems by 6. 39% - 25. 95% compared with the baseline without augmentation. The proposed method with the codec-based augmentation can reduce the EER of different systems by 31. 71% - 42. 47% compared with the baseline, which further improves the generalization ability of the systems in real scenarios, and outperforms the existing data augmentation methods.

Translated title of the contribution	Real scene synthetic speech detection based on cepstral feature data augmentation
Original language	Chinese (Simplified)
Pages (from-to)	1013-1023
Number of pages	11
Journal	Gaojishu Tongxin/High Technology Letters
Volume	34
Issue number	10
DOIs	https://doi.org/10.3772/j.issn.1002-0470.2024.10.001
Publication status	Published - Oct 2024
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 2024 Inst. of Scientific and Technical Information of China. All rights reserved.

ASJC Scopus Subject Areas

General Engineering

Keywords

data augmentation
frequency masking
generalization ability
real scenes
synthetic speech detection

Access to Document

10.3772/j.issn.1002-0470.2024.10.001

Cite this

@article{445412ec3fef42238d7cc45c4d8cd6c7,

title = "基于倒谱特征数据增强的真实场景合成语音检测",

abstract = "The performance of existing synthetic speech detection systems is significantly degraded in real scenarios. This paper proposes a data augmentation method for cepstral features via frequency masking. First, linear filter banks (LFBs) of the input signal are masked on frequency channels for realistic speech distortion. Then, the linear frequency cepstral coefficients (LFCC) of the masked features are calculated to reduce the feature dimensionality and improve the detection performance. Using light convolutional neural network (LCNN), deep residual network (ResNet) and one-dimensional convolutional Transformer (OCT), three detection systems are established to verify the effectiveness of the proposed method. Experiments on the real scene datasets show that the proposed method can reduce the equal error rate (EER) of different synthetic speech detection systems by 6. 39\% - 25. 95\% compared with the baseline without augmentation. The proposed method with the codec-based augmentation can reduce the EER of different systems by 31. 71\% - 42. 47\% compared with the baseline, which further improves the generalization ability of the systems in real scenarios, and outperforms the existing data augmentation methods.",

keywords = "data augmentation, frequency masking, generalization ability, real scenes, synthetic speech detection",

author = "Yi Wan and Chunguo Li and Feiran Yang and Jun Yang",

year = "2024",

month = oct,

doi = "10.3772/j.issn.1002-0470.2024.10.001",

language = "Chinese (Simplified)",

volume = "34",

pages = "1013--1023",

journal = "Gaojishu Tongxin/High Technology Letters",

issn = "1002-0470",

publisher = "Institute of Scientific and Technical Information of China",

number = "10",

}

TY - JOUR

T1 - 基于倒谱特征数据增强的真实场景合成语音检测

AU - Wan, Yi

AU - Li, Chunguo

AU - Yang, Feiran

AU - Yang, Jun

PY - 2024/10

Y1 - 2024/10

N2 - The performance of existing synthetic speech detection systems is significantly degraded in real scenarios. This paper proposes a data augmentation method for cepstral features via frequency masking. First, linear filter banks (LFBs) of the input signal are masked on frequency channels for realistic speech distortion. Then, the linear frequency cepstral coefficients (LFCC) of the masked features are calculated to reduce the feature dimensionality and improve the detection performance. Using light convolutional neural network (LCNN), deep residual network (ResNet) and one-dimensional convolutional Transformer (OCT), three detection systems are established to verify the effectiveness of the proposed method. Experiments on the real scene datasets show that the proposed method can reduce the equal error rate (EER) of different synthetic speech detection systems by 6. 39% - 25. 95% compared with the baseline without augmentation. The proposed method with the codec-based augmentation can reduce the EER of different systems by 31. 71% - 42. 47% compared with the baseline, which further improves the generalization ability of the systems in real scenarios, and outperforms the existing data augmentation methods.

AB - The performance of existing synthetic speech detection systems is significantly degraded in real scenarios. This paper proposes a data augmentation method for cepstral features via frequency masking. First, linear filter banks (LFBs) of the input signal are masked on frequency channels for realistic speech distortion. Then, the linear frequency cepstral coefficients (LFCC) of the masked features are calculated to reduce the feature dimensionality and improve the detection performance. Using light convolutional neural network (LCNN), deep residual network (ResNet) and one-dimensional convolutional Transformer (OCT), three detection systems are established to verify the effectiveness of the proposed method. Experiments on the real scene datasets show that the proposed method can reduce the equal error rate (EER) of different synthetic speech detection systems by 6. 39% - 25. 95% compared with the baseline without augmentation. The proposed method with the codec-based augmentation can reduce the EER of different systems by 31. 71% - 42. 47% compared with the baseline, which further improves the generalization ability of the systems in real scenarios, and outperforms the existing data augmentation methods.

KW - data augmentation

KW - frequency masking

KW - generalization ability

KW - real scenes

KW - synthetic speech detection

UR - http://www.scopus.com/inward/record.url?scp=85210140616&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85210140616&partnerID=8YFLogxK

U2 - 10.3772/j.issn.1002-0470.2024.10.001

DO - 10.3772/j.issn.1002-0470.2024.10.001

M3 - Article

AN - SCOPUS:85210140616

SN - 1002-0470

VL - 34

SP - 1013

EP - 1023

JO - Gaojishu Tongxin/High Technology Letters

JF - Gaojishu Tongxin/High Technology Letters

IS - 10

ER -

基于倒谱特征数据增强的真实场景合成语音检测

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Fingerprint

Cite this