Developing Robust Text Independent Speaker Recognition Using Deep Learning Models

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

ASTU

Abstract

Speaker recognition is the process of classifying/identifying a person from others based on speech characteristics. It has crucial applications in security, surveillance, forensics and financial transactions. The performance of the speaker recognition systems was good in the clean speech and without mismatch. However, the performance of the speaker recognition systems gets degraded under noisy and mismatched conditions. Several studies have been conducted in speaker recognition using machine learning methods to enhance performance in noisy environments. Recently, deep learning models outperformed machine learning methods in speaker recognition. Moreover, hybrid models of convolutional neural networks (CNN) and enhanced variants of recurrent neural networks (RNN) have shown better performance in image classification and natural language processing. However, only limited attempts have been conducted using hybrid CNN and RNN variants to enhance speaker recognition performance under noisy conditions. The features which have good performance in speaker recognition using machine learning methods were not as effective as spectrogram and cochleogram in deep learning-based speaker recognition. However, the noise robustness of the cochleogram and spectrogram was not analyzed in speaker recognition using deep learning models to employ the more robust feature in noisy conditions. In this study, a text-independent speaker recognition using deep learning models have been developed for noisy conditions. First, the noise robustness analysis of cochleogram and spectrogram in speaker recognition using deep learning were conducted to select the more robust feature. Then, the speaker recognition model using hybrid CNN and enhanced RNN variants have been developed to enhance the performance under noisy conditions. The enhanced RNN variants employed in this study include long short-term memory (LSTM), bidirectional LSTM (BiLSTM), gated recurrent unit (GRU) and bidirectional GRU (BiGRU). Cochleogram have shown better noise robustness t each signal-to-noise ratio (SNR) level and are used as an input in each of the speaker recognition models developed in this study. The experiments have been conducted on the VoxCeleb1 audio dataset with real-world and white Gaussian noises at the SNR level of -5dB to 20dB and without additive noises. The speaker recognition using hybrid CNN and BiGRU on the cochleogram input was proposed for noisy conditions in this study because of its higher performance. The proposed model has achieved speaker identification accuracy of 93.15% to 98.60% on the dataset with real-world noise at SNR of -5dB to 20dB, respectively and 98.85% on the dataset xvwithout additive noise. The equal error rate (EER) of the proposed model on the dataset with real-world noise at SNR of -5dB to 20dB ranges from 10.55% to 0.47%, respectively and 0.37% on the dataset without additive noise. The comparison with the existing works also confirmed that the proposed model has higher performance than existing works.

Description

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By