SMoLnet-T: An Efficient Complex-spectral Mapping Speech Enhancement Approach with Frame-wise CNN and Spectral Combination Transformer for Drone Audition

Zhi Wei Tan; Andy W.H. Khong

doi:10.1109/APSIPAASC63619.2025.10848922

SMoLnet-T: An Efficient Complex-spectral Mapping Speech Enhancement Approach with Frame-wise CNN and Spectral Combination Transformer for Drone Audition

Zhi Wei Tan^*, Andy W.H. Khong

^*Corresponding author for this work

Nanyang Technological University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Speech enhancement for drone audition applications is challenging due to the low SNR with large spectra feature overlap and limited computing resources. We propose SMoLnet-T, a complex spectral mapping approach with frame-wise CNN and newly-formulated spectral combination transformers. SMoLnet-T incorporates dilated CNN to extract spectral maps of high-frequency resolution for its transformers. This allows it to focus on a higher level of abstraction and determine the combination of spectral maps is crucial for enhancement across a large temporal context. Experiment results with noise recorded from a hovering drone highlight the efficacy of SMoLnet-T over DPTNet with significantly lower computational requirements and speech distortion while achieving improved speech intelligibility under SNR < −23 dB.

Original language	English
Title of host publication	APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9798350367331
DOIs	https://doi.org/10.1109/APSIPAASC63619.2025.10848922
Publication status	Published - 2024
Externally published	Yes
Event	2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024 - Macau, China Duration: Dec 3 2024 → Dec 6 2024

Publication series

Name	APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024

Conference

Conference	2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024
Country/Territory	China
City	Macau
Period	12/3/24 → 12/6/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

ASJC Scopus Subject Areas

Artificial Intelligence
Computer Science Applications
Hardware and Architecture
Signal Processing

Keywords

Convolution neural network
deep learning
drone audition
speech enhancement
transformer

Access to Document

10.1109/APSIPAASC63619.2025.10848922

Cite this

Tan, Z. W., & Khong, A. W. H. (2024). SMoLnet-T: An Efficient Complex-spectral Mapping Speech Enhancement Approach with Frame-wise CNN and Spectral Combination Transformer for Drone Audition. In APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024 (APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPAASC63619.2025.10848922

Tan, Zhi Wei ; Khong, Andy W.H. / SMoLnet-T : An Efficient Complex-spectral Mapping Speech Enhancement Approach with Frame-wise CNN and Spectral Combination Transformer for Drone Audition. APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024. Institute of Electrical and Electronics Engineers Inc., 2024. (APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024).

@inproceedings{c6a003a0b08f4c92a1ad20390eeea3a6,

title = "SMoLnet-T: An Efficient Complex-spectral Mapping Speech Enhancement Approach with Frame-wise CNN and Spectral Combination Transformer for Drone Audition",

abstract = "Speech enhancement for drone audition applications is challenging due to the low SNR with large spectra feature overlap and limited computing resources. We propose SMoLnet-T, a complex spectral mapping approach with frame-wise CNN and newly-formulated spectral combination transformers. SMoLnet-T incorporates dilated CNN to extract spectral maps of high-frequency resolution for its transformers. This allows it to focus on a higher level of abstraction and determine the combination of spectral maps is crucial for enhancement across a large temporal context. Experiment results with noise recorded from a hovering drone highlight the efficacy of SMoLnet-T over DPTNet with significantly lower computational requirements and speech distortion while achieving improved speech intelligibility under SNR < −23 dB.",

keywords = "Convolution neural network, deep learning, drone audition, speech enhancement, transformer",

author = "Tan, \{Zhi Wei\} and Khong, \{Andy W.H.\}",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024 ; Conference date: 03-12-2024 Through 06-12-2024",

year = "2024",

doi = "10.1109/APSIPAASC63619.2025.10848922",

language = "English",

series = "APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024",

address = "United States",

}

Tan, ZW & Khong, AWH 2024, SMoLnet-T: An Efficient Complex-spectral Mapping Speech Enhancement Approach with Frame-wise CNN and Spectral Combination Transformer for Drone Audition. in APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024. APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024, Institute of Electrical and Electronics Engineers Inc., 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024, Macau, China, 12/3/24. https://doi.org/10.1109/APSIPAASC63619.2025.10848922

SMoLnet-T: An Efficient Complex-spectral Mapping Speech Enhancement Approach with Frame-wise CNN and Spectral Combination Transformer for Drone Audition. / Tan, Zhi Wei; Khong, Andy W.H.
APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024. Institute of Electrical and Electronics Engineers Inc., 2024. (APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - SMoLnet-T

T2 - 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024

AU - Tan, Zhi Wei

AU - Khong, Andy W.H.

PY - 2024

Y1 - 2024

N2 - Speech enhancement for drone audition applications is challenging due to the low SNR with large spectra feature overlap and limited computing resources. We propose SMoLnet-T, a complex spectral mapping approach with frame-wise CNN and newly-formulated spectral combination transformers. SMoLnet-T incorporates dilated CNN to extract spectral maps of high-frequency resolution for its transformers. This allows it to focus on a higher level of abstraction and determine the combination of spectral maps is crucial for enhancement across a large temporal context. Experiment results with noise recorded from a hovering drone highlight the efficacy of SMoLnet-T over DPTNet with significantly lower computational requirements and speech distortion while achieving improved speech intelligibility under SNR < −23 dB.

AB - Speech enhancement for drone audition applications is challenging due to the low SNR with large spectra feature overlap and limited computing resources. We propose SMoLnet-T, a complex spectral mapping approach with frame-wise CNN and newly-formulated spectral combination transformers. SMoLnet-T incorporates dilated CNN to extract spectral maps of high-frequency resolution for its transformers. This allows it to focus on a higher level of abstraction and determine the combination of spectral maps is crucial for enhancement across a large temporal context. Experiment results with noise recorded from a hovering drone highlight the efficacy of SMoLnet-T over DPTNet with significantly lower computational requirements and speech distortion while achieving improved speech intelligibility under SNR < −23 dB.

KW - Convolution neural network

KW - deep learning

KW - drone audition

KW - speech enhancement

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85218205548&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85218205548&partnerID=8YFLogxK

U2 - 10.1109/APSIPAASC63619.2025.10848922

DO - 10.1109/APSIPAASC63619.2025.10848922

M3 - Conference contribution

AN - SCOPUS:85218205548

T3 - APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024

BT - APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 3 December 2024 through 6 December 2024

ER -

Tan ZW, Khong AWH. SMoLnet-T: An Efficient Complex-spectral Mapping Speech Enhancement Approach with Frame-wise CNN and Spectral Combination Transformer for Drone Audition. In APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024. Institute of Electrical and Electronics Engineers Inc. 2024. (APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024). doi: 10.1109/APSIPAASC63619.2025.10848922

SMoLnet-T: An Efficient Complex-spectral Mapping Speech Enhancement Approach with Frame-wise CNN and Spectral Combination Transformer for Drone Audition

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Fingerprint

Cite this