Open  Source Audio-Visual Continuous Speech Recognition

Introduction

Audio-visual continuous speech recognition (AVCSR) is an attractive technology that combines the audio and visual feature together to improve the accuracy of the speech recognition system in noisy acoustic environments. The AVCSR system is illustrated in Figure 1. First, the face and mouth of the speaker are detected and tracked in consecutive frames. Next, a set of visual features [Liang02] are extracted from the mouth region. The acoustic features obtained from the audio channel consist of Mel frequency cepstral coefficients (MFCC). The audio and visual observation sequences are modeled jointly using a coupled hidden Markov model (CHMM) [Nefian02, Liu02 ].


                                                                        Figure 1. The AVCSR system

This demo is an offline system in which the recorded audio-visual data are loaded from AVI files and processed. Besides audio-visual speech recognition (AVCSR) decoder, two more recognition engines, the audio-only decoder (the audio-only speech recognition engine, ASR) and the visual-only decoder (lip-reading engine, VSR) are integrated into the system. The user can use the test data in appropriate format (see Prepare the Data) to compare the performance of the different systems. In the current version, these decoders could only process A/V speech data of numeric string.


System Requirements

CPU:                          Pentium 4, 1.7GHz

Memory:                    512MB(basic) / 1GB(recommended)

HD free space:           1GB

Peripheral:                  sound card, speaker, video card, display (resolution 1024x768 or better)

Platform:                     Win2000, WinXP

Development tools:    MS Visual C++ 6.0

 


Installation

Run the autorun.exe to install the package. The install program will set all the configuration files used in the AVCSR application.

 


The Key Features

The content of the  AVCSR package is shown in Figure 2.

 

Figure 2. Content of the open source AVSR package

The face and mouth tracking and detection, and the visual feature extraction (Figure 1) functions are released in open source (Visual C++). The use of the functions for face and mouth detection and tracking, visual feature extraction, principal component analysis (PCA) training and linear discriminant analysis (LDA) training are described in mouthTrack.doc, visualFeatures.doc, pca.doc and lda.doc respectively. The interface of the DLLs for HMM training, CHMM training, audio-only HMM-based decoder, visual-only HMM-based decoder, audio-visual CHMM-based decoder and audio feature extraction modules are described in hmm.doc, chmm.doc and mfcc.doc respectively. Table 1 summarizes the functionalities of the package.

Table 1. Content of the open source AVSR package

Content

Location

Code Type

Document

Visual Front-end (including mouthing tracking and visual speech feature extraction)

\cvaux

OpenCV-style source code

MouthTrack.htm VisualFeatures.htm

Audio Front-end (MFCC feature extraction)

\bin

Binary code (dll)

mfcc.htm

HMM decoders for audio-only and visual-only speech recognition (ASR decoder and VSR decoder)

\bin

Binary code (dll)

hmm.htm

CHMM decoder for AVSR

\bin

Binary code (dll)

chmm.htm

AVSR demo program

\apps\AvcsrDemoWin

Source code

avcsr.htm

PCA training and projection toolkit

\apps\pcademo

Source code

pca.htm

LDA training and projection toolkit

\apps\ldademo

Source code

lda.htm

HMM training programs for audio-only and visual-only speech recognition

\bin\HmmTrain

Binary code

hmm.htm

CHMM training programs for AVSR

\bin\ChmmTrain

Binary code

chmm.htm

                               


Where to get AVCSR

Go http://www.sourceforge.net/projects/opencvlibrary or http://www.intel.com/research/mrl/research/avcsr.htm


    Reference Manual


If you have questions/corrections/suggestions about these pages, mail to Lu.Hong.Liang@intel.com.