Audiovisual Speech Recognition using Multi-View Lip Movement for Amharic Language using Deep Learning

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

ASTU

Abstract

Speech is the most common form of interpersonal communication. The use of computers to automatically transcribe natural speech to enable human-machine interaction is known as automatic speech recognition. In order to achieve effective human-machine interactions, noise-robust voice recognition becomes essential. The performance of speech recognition systems can be improved by recognizing lip movements and describing their correlations with speech sounds, especially when operating in noisy environments. There are few researches done on Amharic speech recognition. Most of them are done on Automatic Speech Recognition and Visual Speech Recognition separately. Only Befikadu Belete conducted a research on Audion Visual Speech Recognition for Amharic Language(Belete, 2017). However, Befikadu Belete uses only frontal view to conduct his research. Which doesn‟t applicable in occasions where side view only is available. The aim of this study is to develop audio-visual speech recognition for Amharic Language using multi-view lip movement. By combining possible view angles, this study focuses on using bidirectional LSTM RNN to learn representation from audio data and CNN plus bidirectional LSTM RNN to extract sequential features for visual part and combining them in fusion part using multi-modal RNN to learn automatically from training data. In order to develop the proposed approach, The data is collected from YouTube and other Broad casting website including dramas and programs engaged in news presented using single and interviews with multiple peoples using Amharic language. The dataset contains 50 words with 5000 utterances which is divided into training and testing with the training stack receiving around 80% of the data, the testing stack receiving around 20% of the data. Finally, the result demonstrate that using multi-view enhance accuracy of word recognition in audio visual speech recognition of Amharic Language. Combining Frontal view with left three quarter demonstrate higher accuracy than others with accuracy of 97.5%.

Description

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By