Audiovisual Speech Recognition using Multi-View Lip Movement for Amharic Language using Deep Learning
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ASTU
Abstract
Speech is the most common form of interpersonal communication. The use of computers to
automatically transcribe natural speech to enable human-machine interaction is known as
automatic speech recognition. In order to achieve effective human-machine interactions,
noise-robust voice recognition becomes essential. The performance of speech recognition
systems can be improved by recognizing lip movements and describing their correlations
with speech sounds, especially when operating in noisy environments.
There are few researches done on Amharic speech recognition. Most of them are done on
Automatic Speech Recognition and Visual Speech Recognition separately. Only Befikadu
Belete conducted a research on Audion Visual Speech Recognition for Amharic
Language(Belete, 2017). However, Befikadu Belete uses only frontal view to conduct his
research. Which doesn‟t applicable in occasions where side view only is available.
The aim of this study is to develop audio-visual speech recognition for Amharic Language
using multi-view lip movement. By combining possible view angles, this study focuses on
using bidirectional LSTM RNN to learn representation from audio data and CNN plus
bidirectional LSTM RNN to extract sequential features for visual part and combining them in
fusion part using multi-modal RNN to learn automatically from training data.
In order to develop the proposed approach, The data is collected from YouTube and other
Broad casting website including dramas and programs engaged in news presented using
single and interviews with multiple peoples using Amharic language. The dataset contains
50 words with 5000 utterances which is divided into training and testing with the training
stack receiving around 80% of the data, the testing stack receiving around 20% of the data.
Finally, the result demonstrate that using multi-view enhance accuracy of word recognition in
audio visual speech recognition of Amharic Language. Combining Frontal view with left
three quarter demonstrate higher accuracy than others with accuracy of 97.5%.
