Open Source Audio-Visual Continuous Speech Recognition |
Audio-visual continuous speech recognition (AVCSR) is an attractive technology that combines the audio and visual feature together to improve the accuracy of the speech recognition system in noisy acoustic environments. The AVCSR system is illustrated in Figure 1. First, the face and mouth of the speaker are detected and tracked in consecutive frames. Next, a set of visual features [Liang02] are extracted from the mouth region. The acoustic features obtained from the audio channel consist of Mel frequency cepstral coefficients (MFCC). The audio and visual observation sequences are modeled jointly using a coupled hidden Markov model (CHMM) [Nefian02, Liu02 ].
Figure 1.
The AVCSR system
This demo is an offline system in which the recorded audio-visual data are loaded from AVI files and processed. Besides audio-visual speech recognition (AVCSR) decoder, two more recognition engines, the audio-only decoder (the audio-only speech recognition engine, ASR) and the visual-only decoder (lip-reading engine, VSR) are integrated into the system. The user can use the test data in appropriate format (see Prepare the Data) to compare the performance of the different systems. In the current version, these decoders could only process A/V speech data of numeric string.
CPU: Pentium 4, 1.7GHz
Memory: 512MB(basic) / 1GB(recommended)
HD free space: 1GB
Peripheral: sound card, speaker, video card, display (resolution 1024x768 or better)
Platform: Win2000, WinXP
Development tools: MS Visual C++ 6.0
Run the autorun.exe to install the package. The install program will set all the configuration files used in the AVCSR application.
The Key Features
The content of the AVCSR package is
shown in Figure 2.
Figure 2. Content of the open source AVSR package
The face and mouth tracking and detection, and the visual feature extraction (Figure 1) functions are released in open source (Visual C++). The use of the functions for face and mouth detection and tracking, visual feature extraction, principal component analysis (PCA) training and linear discriminant analysis (LDA) training are described in mouthTrack.doc, visualFeatures.doc, pca.doc and lda.doc respectively. The interface of the DLLs for HMM training, CHMM training, audio-only HMM-based decoder, visual-only HMM-based decoder, audio-visual CHMM-based decoder and audio feature extraction modules are described in hmm.doc, chmm.doc and mfcc.doc respectively. Table 1 summarizes the functionalities of the package.
Table 1. Content of the open source AVSR package
Content |
Location |
Code Type |
Document |
Visual Front-end (including mouthing tracking and visual speech feature extraction) |
\cvaux |
OpenCV-style source code |
|
Audio Front-end (MFCC feature extraction) |
\bin |
Binary code (dll) |
|
HMM decoders for audio-only and visual-only speech recognition (ASR decoder and VSR decoder) |
\bin |
Binary code (dll) |
|
CHMM decoder for AVSR |
\bin |
Binary code (dll) |
|
AVSR demo program |
\apps\AvcsrDemoWin |
Source code |
|
PCA training and projection toolkit |
\apps\pcademo |
Source code |
|
LDA training and projection toolkit |
\apps\ldademo |
Source code |
|
HMM training programs for audio-only and visual-only speech recognition |
\bin\HmmTrain |
Binary code |
|
CHMM training programs for AVSR |
\bin\ChmmTrain |
Binary code |
Go http://www.sourceforge.net/projects/opencvlibrary or http://www.intel.com/research/mrl/research/avcsr.htm
If you have questions/corrections/suggestions about these pages, mail to Lu.Hong.Liang@intel.com.