Coupled Hidden Markov Model (CHMM) in AVSR

 

Content

1.    Fundamental Concepts of Coupled Hidden Markov Model (CHMM)
2.    Model Training and Recognition
3.    Running Environment and File Formats
       3.1    monotrain
       3.2    gengv
       3.3    format
       3.4    maptlist
       3.5    emtrain
       3.6    mixup
       3.7    Config file example
       3.8    CHMMDecoder
4.    Example
References

 

1.  Fundamental Concepts of Coupled Hidden Markov Model (CHMM)

A CHMM can be seen as a collection of Hidden Markov Models (HMM), one for each data stream, where the hidden backbone nodes at time t for each HMM are conditioned by the backbone nodes at time t-1 for all the related HMMs. The following figure illustrates a continuous mixture two-stream coupled HMM in our audio-visual speech recognition system. The squares represent the hidden discrete nodes (backbone and mixture nodes) while the circles describe the continuous observable nodes. Unlike the independent HMM used for audio-visual data, the CHMM can capture the interactions between audio and video streams through the transition probabilities between the backbone nodes. CHMM based audio visual continuous speech recognition (AVCSR) system is an extension of the decision fusion system at phone level. The CHMM can model the audio-visual state asynchrony and preserve the natural audio visual dependencies over time.

Figure 1: the audio-visual coupled HMM

The parameters of a CHMM can be defined as below:

                                                                                                

                                                                                          

                                                                        

where is the state of the couple node in the cth stream at time t,

where and is the component of the mixture node in the cth stream at time t.

 

back to top

 

2.  Model Training and Recognition

The training of the CHMM parameters is performed in two stages. In the first stage, the CHMM parameters are estimated for isolated phoneme-viseme pairs. The parameters of the isolated phoneme-viseme CHMMs are estimated first using the Viterbi-based initialization described in [Nefian02], followed by the estimation-maximization (EM) algorithm [Jensen98]. In the second stage, the parameters of the CHMMs, estimated individually in the first stage, are refined through the embedded training of all CHMMs. In a way similar to the embedded training for HMMs [Young95], each of the models obtained in the first stage are extended with one entry and one exit non-emitting states.

The audio-visual speech recognition is carried out via a graph decoder applied to the word network consisting of all the words in the test dictionary. Each word in the network is stored as a sequence of phoneme-viseme CHMMs,  and the best sequence of words is obtained through an extension of the token passing algorithm [Young95].

 

back to top

 

3.  Running Environment and File Formats

The training process of CHMM involves the programs listed below, followed by a short description.

monotrain.exe:    Initializes CHMM parameters of isolated phoneme-viseme pairs using Viterbi algorithm and EM algorithm.

genv.exe:            Generates global variance file for training data.

format.exe:         Binds all individual monophone models into a CHMM model set for embedded training.

maptlist.exe:       Converts logical transcript file to physical transcript using CHMM mapping file.

emtrain.exe        Provides embedded forward-backward model re-estimation, dump out statistics (occupation) files.

mixup.exe           Splits mixtures.

Each executable program needs one config file and other relevant files to function properly. The parameters of command line and reference files for each program are presented in detail as following.

 

3.1  monotrain

  monotrain

 
   
  Command line parameters: monotrain -config avsr.cfg
   
  Parameters:  
   
      ChainNum Integer.
  Number of streams.
   
      LabelRootDir String, directory.
  Root directory for all label files.
   
      AFeatureRootDir String, directory.
  Root directory for all audio feature files.
   
      VFeatureRootDir String, directory.
  Root directory for all visual feature files.
   
      MonoPhnList String, input file.
  Monophone list file name.
   
      LabelFileList String, input file.
  Label list file. Contains the list of the names of label files used.
   
      AVecSize Integer.
  Vector size of audio feature.
   
      VVecSize Integer.
  Vector size of visual feature.
   
      CMS Integer.
  If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed on the audio feature; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done.
   
      adaptCMSnVNWinSize Integer.
  Window Size for adaptive CMS and VN on the audio feature. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN.
   
      LabelFileSuffix String, suffix of label files
   
      LabelFileListType Integer, list type of label files.
   
      VarFloor Float
  Variance floor value.
   
      WgtFloor Float.
  Mixture weight floor value.
   
      MinSegNum Integer.
  Minimum segment number used to train the model.
   
      MaxSegNum Integer.
  Maximum segment number used to train the model. If training corpus has more segments, use only MaxSegNum segments.
   
      LabCoefficient Float.
  Used to convert original label to the label in frame.
   
      AudioFeatureFileSuffix String, suffix of audio feature files.
   
      VideoFeatureFileSuffix String, suffix of visual feature files.
   
      Viterbi Boolean.
  If true, use Viterbi algorithm to initialize segmentation; if false, otherwise.
       
      Forward Boolean.
  If true, use Forward-Backward algorithm to train the model; if false, otherwise.
   
      TeeModel String, transcription of Tee model.
   
      FileSaveAfterViterbi Boolean.
  If true, save models after initial Viterbi iteration; if false; otherwise.
   
      MaxIterNumVTB Integer.
  Maximum iteration number of Viterbi re-estimation.
   
      OutputDirVTB String, directory.
  Output model directory for models created after running through Viterbi training and input directory for the forward-backward training stage.
   
      MaxIterNumFB Integer.
  Maximum iteration number of forward-backward re-estimation.
   
      OutputDirFB String, directory.
  Output model directory for models created after running through the forward-backward algorithm.

Discussion

The monotrain tool is used to initialize isolated CHMMs using the Viterbi algorithm and forward-backward algorithm. Label and feature files are the input data. Label file consists of a number of lines; each line is represented as:

start    end    phone

where 'start' is the segment start time, 'end' is the segment end time, and 'phone' is the segment corresponding monophone. The time is measured in 1e-7 seconds and the segment time is related to the parameter 'LabCoefficient' (for a 16Hz wave file at 100 frames per second, 'LabCoefficent' = 160). The initial and the final frame numbers N are found by the formula:

Below follows a fragment of the label file:

0 48 sil
48 70 z
70 88 ih
88 91 r
91 113 ow
113 137 sp

The monophone list file is used to list the transcription of monophones, with the format like following:

ah
ao
ax

The label list file contains the list of label files used, with the format like:

{
LabelFileName1
LabelFIleName2
}

Simple Gaussian distribution with diagonal covariance matrix is used to compute probability in CHMM. Simple distribution means that there is only one Gaussian in a mixture and Gaussian weight is equal to 1. For each state and for each segment, the vector of mean for the mixture component, covariance matrix for the mixture component, and transform probability are initialized. All parameters are initialized by uniform segmentaion.

The segment is read from the feature file. The Viterbi component of monotrain tool then will search the input data for as many segments as indicated by the parameter MaxSegNum. If there are fewer segments for a given monophone than indicated by the parameter MinSegNum, initialization will be canceled and no output files created.

The next steps are the Viterbi state realignment and the best likelihood path selection for each segment. The Viterbi component of the monotrain tool executes up to MaxIterNumVTB iterations. However, the program may be terminated earlier if the best likelihood/frame value change becomes relatively insignificant.

The Viterbi part of the monotrain tool creates two files: a binary file with the extension *.hmm and a text file with the extension *.txt. The number of segments found for all monophones from the MonoPhnList may not be smaller than MinSegNum.

The forward-backward component of the monotrain tool initializes the CHMMs using the forward-backward algorithm for each monophone. The input data for that component are CHMMs produced by the Viterbi part of the tool and feature files. It operates on the same segment and uses the same Baum-Welch re-estimation algorithm as the Viterbi part.

The forward-backward component produces CHMM files for all monophones from the MonoPhnList except the 'tee' model. Note that the model is not trained for the 'tee' model. The format tool will create CHMM for the 'tee' model later.

The forward-backward component of the monotrain tool executes up to MaxIterNumFB iterations of Baum-Welch re-estimation. However, the program may be terminated earlier if the likelihood/frame value change becomes relatively insignificant.

Just as the Viterbi part, the forward-backward component of the monotrain tool creates two files, a binary file with the extension *.hmm and a text file with the extension *.txt for all monophones from the list MonoPhnList except the 'tee' model.

The CMS and adpatCMSnVNWinSize parameters are for audio feature only.

 

back to top

 

3.2  gengv

  gengv

 
   
  Command line parameters: gengv -config avsr.cfg
   
  Parameters:  
   
      AudioFeatureDir String, directory.
  Root directory of all audio feature files.
   
      VisualFeatureDir String, directory.
  Root directory of all visual feature files.
   
      LabelFileList String, file name.
  Label file name list with starting and ending frame index.
   
      AudioVarOutputFile String, file name..
  Output file name of audio feature variance.
   
      VisualVarOutputFile String, file name.
  Output file name of visual feature variance.
   
      AudioVecSize Integer.
  Vector size of audio feature.
   
      VisualVecSize Integer.
  Vector size of visual feature.
   
      CMS Integer.
  If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done.
   
      adaptCMSnVNWinSize Integer.
  Window Size for adaptive CMS and VN. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN.
   

Discussion

The gengv tool generates global variance file for training data.

Input data are feature files with the extension *.mfcc that are stored in the corresponding directories and listed by the parameter LabelFileList. LabelFileList has the format like:

filename (without extension),    start frame~end frame

The command gengv produces the global variance vector computed by the formula:


where is the frame of feature file, is the global variance vector, is the global mean vector, is the number of feature files, is the length of the feature file.

 

back to top

 

3.3  format

  format

 
   
  Command line parameters: format -config avsr.cfg
   
  Parameters:  
   
      ChainNum Integer.
  Number of streams.
   
      MonoPhnList String, file name.
  Monophone list file name.
   
      InputDir String, directory.
  Root directory of the monophone model produced by monotrain tool.
   
      OutputDir String, directory.
  Directory of the output triphone model mapping, structure and parameter files.
   
      CHMMMapFile_W String, file name..
  Output CHMM mapping file name.
   
      CHMMPhysFile_W String, file name.
  Output CHMM physical file name.
   
      CHMMParamFile_W String, file name.
  Output CHMM parameter file name.
   
      TeeModel String.
  Transcription of tee model.
   
      SilenceModel String.
  Transcription of silence model..
   

Discussion

The format tool performs the CHMM mapping. Structure and parameter files (*.map, *.phys and *.param) are used for embedded training and decoding. The format tool builds model files from individual monophone models (files with the extension *.hmm) that are produced by the monotrain tool. The middle state of the 'silence' CHMM for initialization is used to add CHMM to the 'sp' model.

 

back to top

 

3.4  maptlist

  maptlist

 
   
  Command line parameters: maptlist -config avsr.cfg
   
  Parameters:  
   
      MonoPhnList String, file name.
  Monophone list file name.
   
      InputDir String, directory.
  Directory in with CHMM mapping and parameters files are placed.
   
      OutputDir String, directory.
  Directory for output physical transcript and CHMM physical files.
   
      CHMMMapFile String, file name..
  Input monophone CHMM mapping file name.
   
      CHMMPhysFile String, file name.
  Output CHMM physical file name (only for decision tree mapping). If the CHMM mapping file is a decision tree, then the physical CHMM file must be rebuilt with the help of the original CHMM parameter file and the decision tree. If the CHMM mapping file is not a decision tree, then the original physical CHMM file must be unchanged.
   
      CHMMParamFile String, file name.
  Input monophone CHMM parameter file name.
   
      TeeModel String.
  Transcription of tee model.
   
      LogicTListFile String, file name.
  Input logical transcript file.
   
      PhysTListFile String, file name.
  Output physical transcript file name.
   
      GroupSize Integer.
  Defines the number of tasks in each section of the output task list file.

Discussion

The maptlist tool transforms the logical task list file LogicTListFile that contains transcription of utterances in a physical task list file PhysTListFile used by the emtrain tool during embedded training. For each training data utterance the logical task list contains the monophone or triphone transcription of the utterance, the feature file name and the range.

Below are fragments of a logical monophone logical task list:

{
[transcript] sil z ih s r ah w ao ow ah t uw ah
trainData/003_1_1to2, 0~1189
}
 

The physical task list is similarly structured save the names that are substituted by indices of the corresponding CHMMs in the hmm.phys file.

Below follows a fragment of a physical task list:

{
[transcript] 16 22 9 14 13 0 21 1 12 0 17 19 0
trainData/003_1_1to2, 0~1189
}
 

The hmm.map file could be of two types: either DECISIONTREE or ONE2ONEMAPPING. In case the ONO2ONEMAPPING CHMM is used, the file hmm.phys already exists. If the DECISIONTREE type is used, the file hmm.phys is build simultaneously with the physical task list.

 

back to top

 

3.5  emtrain

  emtrain

 
   
  Command line parameters: emtrain -config avsr.cfg -group mono -pass mono11
  -group and -pass indicate the right parameters in the config file
   
  Parameters:  
   
      ChainNum Integer.
  Number of stream.
   
      GlobalVarFile0 String, file name.
  Input file name of global variance file of the audio feature (the result of gengv tool)
   
      GlobalVarFile1 String, file name.
  Input file name of global variance file of the visual feature (the result of gengv tool)
   
      AFeatureRootDir String, directory.
  Root directory of the audio feature files for embedded training.
   
      VFeatureRootDir String, directory.
  Root directory of the visual feature files for embedded training.
   
      PruneInit Float.
  Initial value of beam width for pruning in the backward calculation.
   
      PruneLimit Float.
  Upper limit of the backward beam width value. Once this number is exceeded, the current utterance is not used for training.
   
      PruneInc Float.
  Beam width value increment at every pruning failure.
   
      MinTrainSeg Integer.
  Minimal number of phone examples required to update the corresponding CHMM parameters.
   
      UpdateMode Integer.
  Update mode (4 bit encoded), 0 - disable, 1 - enable:
  bit 0                    transition matrix
  bit 1                                        Guassian weights
  bit 2                                        mean vectors
  bit 3                                        variance vectors.
   
      VarFloorCoeff Float.
  Variance floor value.
   
      CHMMPhysFile String, file name.
  Input CHMM physical file name (path from working directory)
   
      CHMMParamFile String, file name.
  Input CHMM parameter file name (path from InputDir directory)
   
      CHMMMapFile String, file name.
  Input CHMM mapping file name (path from working directory)
   
      PhysTListFile String, file name.
  Input physical list file. Contains the list of feature files used for training and physical transcription.
   
      TotalOcctFile String, file name.
  Output file name of the statistics file.
   
     adaptCMSnVNWinSize Integer.
  Window Size for adaptive CMS and VN on the audio feature.. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN.
   
      CMS Integer.
  If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed on the audio feature; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done.
   
      StateOrModelOcct_B Boolean.
  Defines type of statistcs (occupation) file type. If the StateOrModelOcct_B is true, the state-oriented statistics file is generated. This file is used by mixture merging and splitting commands; If the value is false, the model-oriented statistics file is generated. This file is used by decision and regression tree building tools (currently StateOrModelOcct_B can only be set to true).
   
      MinMixOcc Float.
  Minimal CHMM state mixture occupation required to update parameters of its Gaussian mixture.
   
      NewHMMParamFile String, file name.
  Output new CHMM parameter file name (path from the OutputDir directory).
   
      InputDir String, directory.
  Directory in which the CHMM parameter file is placed.
   
      OutputDir String, directory.
  Directory for output CHMM parameter file and statistics files.
   

Discussion

The emtrain tool performs the training iteration of the input CHMM model set on a given input feature file set. The training procedure is based on the embedded training (EM) method, i.e., an iterative method to perform maximum likelihood (ML) estimation with incompletely observed data. The EM method consists of the expectation step (E-step) to obtain the Baum function and maximization step (M-step). The Baum-Welch (forward-backward algorithm) method is used.  The tool applies F-B algorithm to each feature file and collects accumulators and occupation counters for each CHMM state in the model which are used then for re-estimation the new CHMM parameter values. In case the number of CHMM is less than the value of the parameter MinTrainSeg, the training data model parameters are not updated. The mean and variance vectors for which the occupation counter is less than the value of the parameter MinMixOcc are not updated either.

Parameter UpdateMode is used by the emtrain tool to define which parameters need to be updated. In order to make F-B algorithm work faster beam pruning is used. The initial value of the beam width is set by the parameter PruneInit. In case either forward or backward calculation has failed, the beam width is repeatedly increased by the value set by the parameter PruneInc until the beam width becomes more than the value of the parameter PruneLimit. If pruning for the maximum value of the beam width has failed, the feature file will not be used for training during this iteration.

The CMS and adpatCMSnVNWinSize parameters are for audio feature only.

 

back to top

 

3.6  mixup

  mixup

 
   
  Command line parameters: mixup -config avsr.cfg -group mono -pass mono14
  -group and -pass indicate the right parameters in the config file
   
  Parameters:  
   
      CHMMParamFile String, file name.
  Input and output CHMM parameter file name (path from the InputDir and OutputDir respectively).
   
      CHMMPhysFile String, file name.
  Input CHMM physical file name (path from the working directory).
   
      CHMMMapFile String, file name.
  Input CHMM mapping file name (path from the working directory).
   
      OcctFile String, file name.
  Input statistics file name (path from the InputDir directory).
   
      InputDir String, directory.
  Directory with input CHMM parameter files.
   
      OutputDir String, directory.
  Directory for output CHMM parameter files.
   
      MixNum Integer.
  The component number to which miture are split.
   
      MinMixOcc Float.
  Minimal CHMM state occupation to split mixture components.
   
      PertDepth Float.
  Perturb depth of the mean of split Gaussians. Split one Gaussian with mean m and variance v vectors to two new Gaussians whose means are m+PertDepth*v and m-PertDepth*v, variances are unchanged, i.e., equal to v.
   
      VarFloor Float.
  Variance floor value.
   
      SplitPenalty Float.
  Gaussian split penalty. Whenever a Gaussian is split once, this number is added to prevent Gaussian split too many times, usually set this parameter very big.
   

Discussion

The mixup tool increases the number of Gaussians in all mixtures to the MixNum value by successive splitting of most extended Gaussians.

Accurate approximation of the continuous probability density function requires an increase of the number of mixture densities or mixture coefficients at the CHMM training stage. This program increases the number of components in all Gaussian mixtures of CHMM by splitting components with the biggest determinant of the variance matrix.

where N is the vector length.

 

back to top

 

3.7  Config file Example

In order to help the users run CHMM training more easily, we provide an example batch file which sequentially uses the tools described in the manual before. Below follows fragments of the batch file:

gv -config av.cfg
monotrain -config av.cfg
format -config av.cfg
emtrain -config av.cfg -group mono -pass mono11
emtrain -config av.cfg -group mono -pass mono12
emtrain -config av.cfg -group mono -pass mono13
mixup -config av.cfg -group mono -pass mono14
emtrain -config av.cfg -group mono -pass mono21
emtrain -config av.cfg -group mono -pass mono22
emtrain -config av.cfg -group mono -pass mono23
 

Fragments of the corresponding config file 'av.cfg' are shown as following:

[monotrain]
ChainNum=2
LabelRootDir=label
AFeatureRootDir=AudioFeature
VFeatureRootDir=VisualFeature
MonoPhnList=mono.list
LabelFileList=lab.list
AVecSize=39
VVecSize=39
LabelFileListType=2
CMS=12
adaptCMSnVNWinSize=30000
LabelFileSuffix=label
VarFloor=1.0e-7
WgtFloor=1.0e-5
MinSegNum=1
MaxSegNum=500
LabCoefficient=1.0
AudioFeatureFileSuffix=mfcc
VideoFeatureFileSuffix=mfcc
Viterbi=true
Forward=true
TeeModel=sp
FileSaveAfterViterbi=false
MaxIterNumVTB=10
OutputDirVTB=.\result\output.vtbtrain
MaxIterNumFB=10
OutputDirFB=.\result\output.fbtrain

[format]
ChainNum=2
MonoPhnList=mono.list
InputHmmDir=.\result\output.fbtrain
InputDir=.\result\output.fbtrain
OutputDir=.\result\output.format
CHMMMapFile_W=hmm.map
CHMMPhysFile_W=hmm.phys
CHMMParamFile_W=hmm.param
TeeModel=sp
SilenceModel=sil

[gengv]
AudioFeatureDir=AudioFeature
VisualFeatureDir=VisualFeature
LabelFileList=feat.list
AudioVarOutputFile=.\result\output.gv\var_0.gv
VisualVarOutputFile=.\result\output.gv\var_1.gv
AudioVecSize=39
VisualVecSize=39
adaptCMSnVNWinSize=30000
CMS=12

[maptlist]
LogicTListFile=tlist.align
InputDir=.\
OutputDir=.\
MonoPhnList=mono.list
TeeModel=sp
CHMMMapFile=hmm.map
CHMMPhysFile=hmm.phys
CHMMParamFile=hmm.param
PhysTListFile=tlist.align.logic
GroupSize=1

[emtrain]
ChainNum = 2
GlobalVarFile0=.\result\output.gv\var_0.gv
GlobalVarFile1=.\result\output.gv\var_1.gv
AFeatureRootDir=AudioFeature
VFeatureRootDir=VisualFeature
PruneInit=800
PruneLimit=2000
PruneInc=1000
MinTrainSeg=1
UpdateMode=15
VarFloorCoeff=1e-04
CHMMPhysFile=.\result\output.format\hmm.phys
CHMMParamFile=hmm.param
CHMMMapFile=.\result\output.format\hmm.map
PhysTListFile=tlist.align.logic
TotalOcctFile=occt.total
adaptCMSnVNWinSize=30000
CMS=12
StateOrModelOcct_B=true
MinMixOcc=3.0
NewHMMParamFile=hmm.param

[emtrain.mono11]
InputDir=.\result\output.format
OutputDir=.\result\output.mono_hmm.11

[emtrain.mono12]
InputDir=.\result\output.mono_hmm.11
OutputDir=.\result\output.mono_hmm.12

[emtrain.mono13]
InputDir=.\result\output.mono_hmm.12
OutputDir=.\result\output.mono_hmm.13

[mixup.mono14]
CHMMParamFile=hmm.param
CHMMPhysFile=.\result\output.format\hmm.phys
CHMMMapFile=.\result\output.format\hmm.map
OcctFile=.\result\output.mono_hmm.13\occt.total
InputDir=.\result\output.mono_hmm.13
OutputDir=.\result\output.hmm_mixup14
MixNum=2
MinMixOcc=3.0
PertDepth=0.1
VarFloor=1.0e-7
SplitPenalty=1e+08

[emtrain.mono21]
InputDir=.\result\output.hmm_mixup14
OutputDir=.\result\output.mono_hmm.21

[emtrain.mono22]
InputDir=.\result\output.mono_hmm.21
OutputDir=.\result\output.mono_hmm.22

[emtrain.mono23]
InputDir=.\result\output.mono_hmm.22
OutputDir=.\result\output.mono_hmm.23
 

Note that each tool needs some reference files to run properly. Examples of these files are also included in our package.
 

back to top

 

3.8  CHMMDecoder

The decoding (recognition) components are packed in the CHMMDecoder.dll, which provides functions for speech recognition using the CHMM parameter file obtained from the training stage. The functions in CHMMDecoder.dll are as follows.

  int chmm_decoder_init(char *cfgFileName)
  Function description Initialize decoder with the config file.
  Input: Config file name
  Return: 0, initialization successful
1, initialization failed and error messages output to console.
  int chmm_decoder_rec(float *mfcc, float *vmfc, int nAVecSize, int nVVecSize, int   nFrame, char *Result, int ResLen)
  Function description: Main decoding procedure
  Input: mfcc:                input audio feature data
vmfc:                input visual feature data
nAVecSize:      vector size of audio feature
nVVectSize:     vector size of visual feature
nFrame:            frame number
Result:              decoding result in format 'word [startframe endframe]'
ResLen:            buffer size of Result, to prevent memory overflow
  Return: 0, decoding successful, result stored in Result buffer
  void chmm_decoder_set_weight(float audioW, float VisualW)
 
  Function description: Set decoding weights for the audio and visual feature.
  Input: audioW:            audio feature weight in decoding
visualW:            visual feature weight in decoding.
  Return: NULL
  void chmm_decoder_free()
  Function description: Free the resources.
  Return: NULL

The config file used by chmm_decoder_init sets the decoding parameters, the meaning of which are described as below:

  REFS_PATH String, directory.
Directory where the CHMM parameter files and reference files are placed.
   
  REC_OutputPath String, directory.
Directory to output decoding result.
  REC_SilenceModel String.
Transcript for silence model.
  REC_TeeModel String.
Transcript for tee model.
  REC_StatePruneBeam Float.
Beam width for state and model level pruning.
  REC_WordEndBeam Float.
Beam width for word end pruning.
  REC_OutputType String
Output result file type.
  REC_NBest Integer.
The best hypothesis number.
  REC_NBestBeam Float.
Beam width for best hypothesis search.
  REC_NToken Integer.
Size of the stack of tokens.
  REC_PhonePenalty Float.
The penalty value that is added to token probability after decoding of a phone.
  REC_MonoListFile String, file name.
Input monophone list file name (path from REFS_PATH directory)
  REC_HMMMapFile String, file name.
Input CHMM mapping file name (path from REFS_PATH directory)
  REC_HMMParamFile String, file name.
Input CHMM parameter file name (path from REFS_PATH directory)
  REC_HMMPhysFile String, fila name.
Input CHMM physical file name (path from REFS_PATH directory)
  PRELOAD_NETFILE0 String, file name.
Input file where the word lattice is placed (path from REFS_PATH directory)
  PRELOAD_NETNUM Integer.
Number of the lattice network loaded.
  TRN_WordDictFile String, file name.
Input word dictionary file (path from REFS_PATH directory).
  REC_ChainNum Integer.
Number of streams.
  SEG_LEN Integer.
Cache segment length in coupled feature.
  REC_TokenLimit Integer.
If the number of tokens exceeds this value, the soft token number pruning is activated.
  REC_BeamFactor Float.
If the soft token number pruning is active at the beginning of each input frame, the value of REC_StatePruneBeam is multiplied by REC_BeamFactor. This feature helps to accelerate decoding of poor speech fragments.
  REC_AUDIO_CEPSTRUMNUM Integer.
CMS length for the audio feature.
  REC_VIDEO_CEPSTRUMNUM Integer.
CMS length for the visual feature (0 by default).
  REC_AUDIO_ADAPTCMSWIN Integer.
adaptCMSWinSize for the audio feature.
  REC_VIDEO_ADAPTCMSWIN Integer.
adaptCMSWinSize for the visual feature (0 by default).
  REC_AUDIO_WEIGHT Float.
Decoding weight of the audio feature.
  REC_VIDEO_WEIGHT Float.
Decoding weight of the visual feature.
  REC_OUTPUT_SCORE_FORMAT Integer.
Indicate whether or not to output .score result file.

Examples of the config and reference files are included in the package.

 

back to top

 

4.  Example

An example of CHMM training is also included. Suppose the working directory is "C:\AvcsrDemo\bin\ChmmTrain", users need to copy the example directories "audiofeature", "visualfeature" and "label" to the current working directory ("C:\AvcsrDemo\bin\ChmmTrain"). 

The directory "C:\AvcsrDemo\bin\ChmmTrain" contains the executable tools and the reference files, whose functionalities can be found in this document.

"emtrain.exe", "format.exe", "gengv.exe", "maptlist.exe", "mixup.exe", "monotrain.exe": tools for CHMM AVSR training;

"tlist.align", "feat.list", "lab.list", "mono.list", "tlist.align.logic": reference files.

"av.cfg": config file for CHMM AVSR training

"train_av.bat": batch file to run under MS-DOS

The directory "AudioFeature" is where the audio feature to be put, "VisualFeature" is for the visual feature, and "Label" is to put the label files. Examples of the feature and label files are given in these directories.

After running the "train_av.bat", a new directory named "result" will be created in this directory ("C:\AvcsrDemo\bin\ChmmTrain"), which will contain the result model "hmm.param". Depending on the config file, the model is placed in the corresponding directories. Please refer to the config file for more details (in this example, the final CHMM parameter model file "hmm.param" is placed in "result\output.mono_hmm.66" from the current working directory, and the "hmm.map" and "hmm.phys" can be found in "result\output.format" directory).

 

back to top

 

References

[Nefian02] A. V. Nefian, L. Liang, X. Pi, X. Liu, and C. Mao. An coupled hidden Markov model for audio-visual speech recognition. In international Conference on Acoustics, Speech and Signal Processing, 2002

[Jensen98] Finn V. Jensen. An Introduction to Bayesian Networks. UCL Press Limited, London, UK, 1998

[Young95] S. Young et. al. The HTK Book. Entropic Cambridge Research Laboratory, Cambridge, UK, 1995.

 

back to top