Multi-scale Information Aggregation for Spoofing Detection

Changtao Li; Yi Wan; Feiran Yang; Jun Yang

doi:10.1186/s13636-024-00379-x

Multi-scale Information Aggregation for Spoofing Detection

Changtao Li, Yi Wan, Feiran Yang^*, Jun Yang

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Synthesis artifacts that span scales from small to large are important cues for spoofing detection. However, few spoofing detection models leverage artifacts across different scales together. In this paper, we propose a spoofing detection system built on SincNet and Deep Layer Aggregation (DLA), which leverages speech representations at different levels to distinguish synthetic speech. DLA is totally convolutional with an iterative tree-like structure. The unique topology of DLA makes possible compounding of speech features from convolution layers at different depths, and therefore the local and the global speech representations can be incorporated simultaneously. Moreover, SincNet is employed as the frontend feature extractor to circumvent manual feature extraction and selection. SincNet can learn fine-grained features directly from the input speech waveform, thus making the proposed spoofing detection system end-to-end. The proposed system outperforms the baselines when tested on ASVspoof LA and DF datasets. Notably, our single model surpasses all competing systems in ASVspoof DF competition with an equal error rate (EER) of 13.99%, which demonstrates the importance of multi-scale information aggregation for synthetic speech detection.

Original language	English
Article number	57
Journal	Eurasip Journal on Audio, Speech, and Music Processing
Volume	2024
Issue number	1
DOIs	https://doi.org/10.1186/s13636-024-00379-x
Publication status	Published - Dec 2024
Externally published	Yes

Bibliographical note

Publisher Copyright:
© The Author(s) 2024.

ASJC Scopus Subject Areas

Acoustics and Ultrasonics
Electrical and Electronic Engineering

Keywords

Convolutional neural network
Deep fake detection
Deep learning
Information aggregation
Voice anti-spoofing

Access to Document

10.1186/s13636-024-00379-x

Cite this

@article{83ba14f52b3541a8a4885fad3ff12492,

title = "Multi-scale Information Aggregation for Spoofing Detection",

abstract = "Synthesis artifacts that span scales from small to large are important cues for spoofing detection. However, few spoofing detection models leverage artifacts across different scales together. In this paper, we propose a spoofing detection system built on SincNet and Deep Layer Aggregation (DLA), which leverages speech representations at different levels to distinguish synthetic speech. DLA is totally convolutional with an iterative tree-like structure. The unique topology of DLA makes possible compounding of speech features from convolution layers at different depths, and therefore the local and the global speech representations can be incorporated simultaneously. Moreover, SincNet is employed as the frontend feature extractor to circumvent manual feature extraction and selection. SincNet can learn fine-grained features directly from the input speech waveform, thus making the proposed spoofing detection system end-to-end. The proposed system outperforms the baselines when tested on ASVspoof LA and DF datasets. Notably, our single model surpasses all competing systems in ASVspoof DF competition with an equal error rate (EER) of 13.99%, which demonstrates the importance of multi-scale information aggregation for synthetic speech detection.",

keywords = "Convolutional neural network, Deep fake detection, Deep learning, Information aggregation, Voice anti-spoofing",

author = "Changtao Li and Yi Wan and Feiran Yang and Jun Yang",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2024.",

year = "2024",

month = dec,

doi = "10.1186/s13636-024-00379-x",

language = "English",

volume = "2024",

journal = "Eurasip Journal on Audio, Speech, and Music Processing",

issn = "1687-4714",

publisher = "Springer Publishing Company",

number = "1",

}

TY - JOUR

T1 - Multi-scale Information Aggregation for Spoofing Detection

AU - Li, Changtao

AU - Wan, Yi

AU - Yang, Feiran

AU - Yang, Jun

N1 - Publisher Copyright: © The Author(s) 2024.

PY - 2024/12

Y1 - 2024/12

N2 - Synthesis artifacts that span scales from small to large are important cues for spoofing detection. However, few spoofing detection models leverage artifacts across different scales together. In this paper, we propose a spoofing detection system built on SincNet and Deep Layer Aggregation (DLA), which leverages speech representations at different levels to distinguish synthetic speech. DLA is totally convolutional with an iterative tree-like structure. The unique topology of DLA makes possible compounding of speech features from convolution layers at different depths, and therefore the local and the global speech representations can be incorporated simultaneously. Moreover, SincNet is employed as the frontend feature extractor to circumvent manual feature extraction and selection. SincNet can learn fine-grained features directly from the input speech waveform, thus making the proposed spoofing detection system end-to-end. The proposed system outperforms the baselines when tested on ASVspoof LA and DF datasets. Notably, our single model surpasses all competing systems in ASVspoof DF competition with an equal error rate (EER) of 13.99%, which demonstrates the importance of multi-scale information aggregation for synthetic speech detection.

AB - Synthesis artifacts that span scales from small to large are important cues for spoofing detection. However, few spoofing detection models leverage artifacts across different scales together. In this paper, we propose a spoofing detection system built on SincNet and Deep Layer Aggregation (DLA), which leverages speech representations at different levels to distinguish synthetic speech. DLA is totally convolutional with an iterative tree-like structure. The unique topology of DLA makes possible compounding of speech features from convolution layers at different depths, and therefore the local and the global speech representations can be incorporated simultaneously. Moreover, SincNet is employed as the frontend feature extractor to circumvent manual feature extraction and selection. SincNet can learn fine-grained features directly from the input speech waveform, thus making the proposed spoofing detection system end-to-end. The proposed system outperforms the baselines when tested on ASVspoof LA and DF datasets. Notably, our single model surpasses all competing systems in ASVspoof DF competition with an equal error rate (EER) of 13.99%, which demonstrates the importance of multi-scale information aggregation for synthetic speech detection.

KW - Convolutional neural network

KW - Deep fake detection

KW - Deep learning

KW - Information aggregation

KW - Voice anti-spoofing

UR - http://www.scopus.com/inward/record.url?scp=85208605375&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85208605375&partnerID=8YFLogxK

U2 - 10.1186/s13636-024-00379-x

DO - 10.1186/s13636-024-00379-x

M3 - Article

AN - SCOPUS:85208605375

SN - 1687-4714

VL - 2024

JO - Eurasip Journal on Audio, Speech, and Music Processing

JF - Eurasip Journal on Audio, Speech, and Music Processing

IS - 1

M1 - 57

ER -

Multi-scale Information Aggregation for Spoofing Detection

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Chinese Academy of Sciences Researchers Have Published New Data on Data Aggregation (Multi-scale Information Aggregation for Spoofing Detection)

Cite this

Multi-scale Information Aggregation for Spoofing Detection

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Press/Media

Chinese Academy of Sciences Researchers Have Published New Data on Data Aggregation (Multi-scale Information Aggregation for Spoofing Detection)

Cite this