A Multi-view Fusion Approach for Enhancing Speech Signals via Short-time Fractional Fourier Transform

Authors: Zikun Jin, Yuhua Qian, Xinyan Liang, Haijun Geng

Abstract:


Deep learning-based speech enhancement (SE) methods focus on reconstructing speech from the time or frequency domain.  However, these domains cannot provide enough information to capture the dynamics of non-stationary signals accurately.  To enrich information, this work proposes a multi-view fusion SE method (MFSE).  Specifically, MFSE extends the representation space of speech to the dynamic domain (also called fractional domain) between the time and frequency domains by using the short-time fractional Fourier transform (STFrFT).  Subsequently, we construct inputs as modes of the primary short-time Fourier transform (STFT) spectrum and the auxiliary STFrFT spectrum views and adaptively identify the optimal fractional STFrFT spectrum from the infinitely continuous fractional domain by leveraging the average spectral centroids.  The framework extracts potential features through multiple designed convolutional modules and captures the correlation between different speech frequencies through multi-granularity attention.  Experimental results show that the proposed method significantly improves performance in several metrics compared to existing single-channel SE methods based on time and frequency domains.  Furthermore, the results of its generalizability evaluation show that the multi-view method outperforms the single-view method under a wide range of SNR conditions.

Keywords:

Thu Apr 03 15:39:00 CST 2025