Speech recognition using historian multimodal approach

Other Title(s)

التعرف على الكلام باستخدام النهج التاريخي المتعدد الوسائط

Joint Authors

al-Maghribi, Islam Id Ali Muhammad
Judi, Amr Muhammad Rifat
Faruq, Hisham Muhammad

Source

The Egyptian Journal of Language Engineering

Issue

Vol. 6, Issue 2 (30 Sep. 2019), pp.44-58, 15 p.

Publisher

Egyptian Society of Language Engineering

Publication Date

2019-09-30

Country of Publication

Egypt

No. of Pages

15

Main Subjects

Information Technology and Computer Science

Topics

Abstract EN

This paper proposes an Audio-Visual Speech Recognition (AVSR) model using both audio and visual speech information
to improve recognition accuracy in a clean and noisy environment.

Mel frequency cepstral coefficient (MFCC) and Discrete
Cosine Transform (DCT) are used to extract the effective features from audio and visual speech signal respectively.

The
Classification process is performed on the combined feature vector by using one of main Deep Neural Network (DNN)
architecture, Bidirectional Long-Short Term Memory (BiLSTM), in contrast to the traditional Hidden Markov Models (HMMs).


The effectiveness of the proposed model is demonstrated on a multi-speakers AVSR benchmark dataset named GRID.

The
experimental results show that the early integration between audio and visual features achieved an obvious enhancement in the
recognition accuracy and prove that BiLSTM is the most effective classification technique when compared to HMM.

The obtained
results when using integrated audio-visual features achieved highest recognition accuracy of 99.07% , this result demonstrates an
enhancement of up to 9.28% over audio-only recognition for clean data.

While for noisy data, the highest recognition accuracy for
integrated audio-visual features is 98.47% with enhancement up to 12.05% over audio-only.

The main reason for BiLSTM
effectiveness is it takes into account the sequential characteristics of the speech signal.

The obtained results show the performance
enhancement compared to previously obtained highest audio visual recognition accuracies on GRID, and prove the robustness of
our AVSR model (BiLSTM-AVSR).

American Psychological Association (APA)

al-Maghribi, Islam Id Ali Muhammad& Judi, Amr Muhammad Rifat& Faruq, Hisham Muhammad. 2019. Speech recognition using historian multimodal approach. The Egyptian Journal of Language Engineering،Vol. 6, no. 2, pp.44-58.
https://search.emarefa.net/detail/BIM-1012009

Modern Language Association (MLA)

al-Maghribi, Islam Id Ali Muhammad…[et al.]. Speech recognition using historian multimodal approach. The Egyptian Journal of Language Engineering Vol. 6, no. 2 (Sep. 2019), pp.44-58.
https://search.emarefa.net/detail/BIM-1012009

American Medical Association (AMA)

al-Maghribi, Islam Id Ali Muhammad& Judi, Amr Muhammad Rifat& Faruq, Hisham Muhammad. Speech recognition using historian multimodal approach. The Egyptian Journal of Language Engineering. 2019. Vol. 6, no. 2, pp.44-58.
https://search.emarefa.net/detail/BIM-1012009

Data Type

Journal Articles

Language

English

Notes

-

Record ID

BIM-1012009