The Role of Long-Term Dependency in Synthetic Speech Detection

Changtao Li, Feiran Yang*, Jun Yang*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

20 Citations (Scopus)

Abstract

Although much progress has been made in synthetic speech detection, there lacks comprehensive analysis of the essential differences between spoofed and genuine speech. We here utilize supervised contrastive loss originated from contrastive learning as an analytical tool to characterize the class similarity structure of ASVspoof 2019 logical access (LA) dataset, which shows that an ideal back-end classifier for synthetic speech detection should have the ability to capture long-term dependencies. Recently, Transformer has been found to have an excellent ability in learning long-term dependencies of input data. We hence propose a back-end classifier based on Transformer Encoder for synthetic speech detection. Convolution blocks are added before the Transformer Encoder, which leverages inductive biases to improve the generalization ability. Compared to two-dimensional convolution, one-dimensional convolution makes better architectural assumptions about the input speech features, which helps with modeling long-term dependencies and decreases the risk of overfitting. The proposed Transformer combined with one-dimensional convolution has fewer parameters than most existing back-end classifiers, and achieves an equal error rate of 1.06% and a minimum tandem detection cost function metric of 0.0345 when evaluated on ASVspoof 2019 LA dataset, which is one of the best models reported in the literature.

Original languageEnglish
Pages (from-to)1142-1146
Number of pages5
JournalIEEE Signal Processing Letters
Volume29
DOIs
Publication statusPublished - 2022
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 1994-2012 IEEE.

ASJC Scopus Subject Areas

  • Signal Processing
  • Electrical and Electronic Engineering
  • Applied Mathematics

Keywords

  • ASVspoof 2019 LA
  • Generalization ability
  • Speaker verification
  • Transformer
  • Voice anti-spoofing

Cite this