Coupled Hidden Markov Model (CHMM) in AVSR
1. Fundamental Concepts of Coupled Hidden Markov Model (CHMM) |
2. Model Training and Recognition |
3. Running Environment and File Formats |
3.1 monotrain |
3.2 gengv |
3.3 format |
3.4 maptlist |
3.5 emtrain |
3.6 mixup |
3.7 Config file example |
3.8 CHMMDecoder |
4. Example |
References |
1. Fundamental Concepts of Coupled Hidden Markov Model (CHMM)
A CHMM can be seen as a collection of Hidden Markov Models (HMM), one for each data stream, where the hidden backbone nodes at time t for each HMM are conditioned by the backbone nodes at time t-1 for all the related HMMs. The following figure illustrates a continuous mixture two-stream coupled HMM in our audio-visual speech recognition system. The squares represent the hidden discrete nodes (backbone and mixture nodes) while the circles describe the continuous observable nodes. Unlike the independent HMM used for audio-visual data, the CHMM can capture the interactions between audio and video streams through the transition probabilities between the backbone nodes. CHMM based audio visual continuous speech recognition (AVCSR) system is an extension of the decision fusion system at phone level. The CHMM can model the audio-visual state asynchrony and preserve the natural audio visual dependencies over time.
Figure 1: the audio-visual coupled HMM
The parameters of a CHMM can be defined as below:
where
is the state of the couple node in the cth stream at time t,
where
and
is the component of the mixture node in the cth stream at time t.
2. Model Training and Recognition
The training of the CHMM parameters is performed in two stages. In the first stage, the CHMM parameters are estimated for isolated phoneme-viseme pairs. The parameters of the isolated phoneme-viseme CHMMs are estimated first using the Viterbi-based initialization described in [Nefian02], followed by the estimation-maximization (EM) algorithm [Jensen98]. In the second stage, the parameters of the CHMMs, estimated individually in the first stage, are refined through the embedded training of all CHMMs. In a way similar to the embedded training for HMMs [Young95], each of the models obtained in the first stage are extended with one entry and one exit non-emitting states.
The audio-visual speech recognition is carried out via a graph decoder applied to the word network consisting of all the words in the test dictionary. Each word in the network is stored as a sequence of phoneme-viseme CHMMs, and the best sequence of words is obtained through an extension of the token passing algorithm [Young95].
3. Running Environment and File Formats
The training process of CHMM involves the programs listed below, followed by a short description.
monotrain.exe: Initializes CHMM parameters of isolated phoneme-viseme pairs using Viterbi algorithm and EM algorithm.
genv.exe: Generates global variance file for training data.
format.exe: Binds all individual monophone models into a CHMM model set for embedded training.
maptlist.exe: Converts logical transcript file to physical transcript using CHMM mapping file.
emtrain.exe Provides embedded forward-backward model re-estimation, dump out statistics (occupation) files.
mixup.exe Splits mixtures.
Each executable program needs one config file and other relevant files to function properly. The parameters of command line and reference files for each program are presented in detail as following.
monotrain |
|
Command line parameters: | monotrain -config avsr.cfg |
Parameters: | |
ChainNum | Integer. |
Number of streams. | |
LabelRootDir | String, directory. |
Root directory for all label files. | |
AFeatureRootDir | String, directory. |
Root directory for all audio feature files. | |
VFeatureRootDir | String, directory. |
Root directory for all visual feature files. | |
MonoPhnList | String, input file. |
Monophone list file name. | |
LabelFileList | String, input file. |
Label list file. Contains the list of the names of label files used. | |
AVecSize | Integer. |
Vector size of audio feature. | |
VVecSize | Integer. |
Vector size of visual feature. | |
CMS | Integer. |
If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed on the audio feature; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done. | |
adaptCMSnVNWinSize | Integer. |
Window Size for adaptive CMS and VN on the audio feature. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN. | |
LabelFileSuffix | String, suffix of label files |
LabelFileListType | Integer, list type of label files. |
VarFloor | Float |
Variance floor value. | |
WgtFloor | Float. |
Mixture weight floor value. | |
MinSegNum | Integer. |
Minimum segment number used to train the model. | |
MaxSegNum | Integer. |
Maximum segment number used to train the model. If training corpus has more segments, use only MaxSegNum segments. | |
LabCoefficient | Float. |
Used to convert original label to the label in frame. | |
AudioFeatureFileSuffix | String, suffix of audio feature files. |
VideoFeatureFileSuffix | String, suffix of visual feature files. |
Viterbi | Boolean. |
If true, use Viterbi algorithm to initialize segmentation; if false, otherwise. | |
Forward | Boolean. |
If true, use Forward-Backward algorithm to train the model; if false, otherwise. | |
TeeModel | String, transcription of Tee model. |
FileSaveAfterViterbi | Boolean. |
If true, save models after initial Viterbi iteration; if false; otherwise. | |
MaxIterNumVTB | Integer. |
Maximum iteration number of Viterbi re-estimation. | |
OutputDirVTB | String, directory. |
Output model directory for models created after running through Viterbi training and input directory for the forward-backward training stage. | |
MaxIterNumFB | Integer. |
Maximum iteration number of forward-backward re-estimation. | |
OutputDirFB | String, directory. |
Output model directory for models created after running through the forward-backward algorithm. |
Discussion
The monotrain tool is used to initialize isolated CHMMs using the Viterbi algorithm and forward-backward algorithm. Label and feature files are the input data. Label file consists of a number of lines; each line is represented as:
start end phone
where 'start' is the segment start time, 'end' is the segment end time, and 'phone' is the segment corresponding monophone. The time is measured in 1e-7 seconds and the segment time is related to the parameter 'LabCoefficient' (for a 16Hz wave file at 100 frames per second, 'LabCoefficent' = 160). The initial and the final frame numbers N are found by the formula:
Below follows a fragment of the label file:
0 48 sil
48 70 z
70 88 ih
88 91 r
91 113 ow
113 137 sp
The monophone list file is used to list the transcription of monophones, with the format like following:
ah
ao
ax
The label list file contains the list of label files used, with the format like:
{
LabelFileName1
LabelFIleName2
}
Simple Gaussian distribution with diagonal covariance matrix is used to compute probability in CHMM. Simple distribution means that there is only one Gaussian in a mixture and Gaussian weight is equal to 1. For each state and for each segment, the vector of mean for the mixture component, covariance matrix for the mixture component, and transform probability are initialized. All parameters are initialized by uniform segmentaion.
The segment is read from the feature file. The Viterbi component of monotrain tool then will search the input data for as many segments as indicated by the parameter MaxSegNum. If there are fewer segments for a given monophone than indicated by the parameter MinSegNum, initialization will be canceled and no output files created.
The next steps are the Viterbi state realignment and the best likelihood path selection for each segment. The Viterbi component of the monotrain tool executes up to MaxIterNumVTB iterations. However, the program may be terminated earlier if the best likelihood/frame value change becomes relatively insignificant.
The Viterbi part of the monotrain tool creates two files: a binary file with the extension *.hmm and a text file with the extension *.txt. The number of segments found for all monophones from the MonoPhnList may not be smaller than MinSegNum.
The forward-backward component of the monotrain tool initializes the CHMMs using the forward-backward algorithm for each monophone. The input data for that component are CHMMs produced by the Viterbi part of the tool and feature files. It operates on the same segment and uses the same Baum-Welch re-estimation algorithm as the Viterbi part.
The forward-backward component produces CHMM files for all monophones from the MonoPhnList except the 'tee' model. Note that the model is not trained for the 'tee' model. The format tool will create CHMM for the 'tee' model later.
The forward-backward component of the monotrain tool executes up to MaxIterNumFB iterations of Baum-Welch re-estimation. However, the program may be terminated earlier if the likelihood/frame value change becomes relatively insignificant.
Just as the Viterbi part, the forward-backward component of the monotrain tool creates two files, a binary file with the extension *.hmm and a text file with the extension *.txt for all monophones from the list MonoPhnList except the 'tee' model.
The CMS and adpatCMSnVNWinSize parameters are for audio feature only.
gengv |
|
Command line parameters: | gengv -config avsr.cfg |
Parameters: | |
AudioFeatureDir | String, directory. |
Root directory of all audio feature files. | |
VisualFeatureDir | String, directory. |
Root directory of all visual feature files. | |
LabelFileList | String, file name. |
Label file name list with starting and ending frame index. | |
AudioVarOutputFile | String, file name.. |
Output file name of audio feature variance. | |
VisualVarOutputFile | String, file name. |
Output file name of visual feature variance. | |
AudioVecSize | Integer. |
Vector size of audio feature. | |
VisualVecSize | Integer. |
Vector size of visual feature. | |
CMS | Integer. |
If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done. | |
adaptCMSnVNWinSize | Integer. |
Window Size for adaptive CMS and VN. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN. | |
Discussion
The gengv tool generates global variance file for training data.
Input data are feature files with the extension *.mfcc that are stored in the corresponding directories and listed by the parameter LabelFileList. LabelFileList has the format like:
filename (without extension), start frame~end frame
The command gengv produces the global variance vector computed by the formula:
where
frame of
feature file,
is the number of feature files,
feature file.
format |
|
Command line parameters: | format -config avsr.cfg |
Parameters: | |
ChainNum | Integer. |
Number of streams. | |
MonoPhnList | String, file name. |
Monophone list file name. | |
InputDir | String, directory. |
Root directory of the monophone model produced by monotrain tool. | |
OutputDir | String, directory. |
Directory of the output triphone model mapping, structure and parameter files. | |
CHMMMapFile_W | String, file name.. |
Output CHMM mapping file name. | |
CHMMPhysFile_W | String, file name. |
Output CHMM physical file name. | |
CHMMParamFile_W | String, file name. |
Output CHMM parameter file name. | |
TeeModel | String. |
Transcription of tee model. | |
SilenceModel | String. |
Transcription of silence model.. | |
Discussion
The format tool performs the CHMM mapping. Structure and parameter files (*.map, *.phys and *.param) are used for embedded training and decoding. The format tool builds model files from individual monophone models (files with the extension *.hmm) that are produced by the monotrain tool. The middle state of the 'silence' CHMM for initialization is used to add CHMM to the 'sp' model.
maptlist |
|
Command line parameters: | maptlist -config avsr.cfg |
Parameters: | |
MonoPhnList | String, file name. |
Monophone list file name. | |
InputDir | String, directory. |
Directory in with CHMM mapping and parameters files are placed. | |
OutputDir | String, directory. |
Directory for output physical transcript and CHMM physical files. | |
CHMMMapFile | String, file name.. |
Input monophone CHMM mapping file name. | |
CHMMPhysFile | String, file name. |
Output CHMM physical file name (only for decision tree mapping). If the CHMM mapping file is a decision tree, then the physical CHMM file must be rebuilt with the help of the original CHMM parameter file and the decision tree. If the CHMM mapping file is not a decision tree, then the original physical CHMM file must be unchanged. | |
CHMMParamFile | String, file name. |
Input monophone CHMM parameter file name. | |
TeeModel | String. |
Transcription of tee model. | |
LogicTListFile | String, file name. |
Input logical transcript file. | |
PhysTListFile | String, file name. |
Output physical transcript file name. | |
GroupSize | Integer. |
Defines the number of tasks in each section of the output task list file. |
Discussion
The maptlist tool transforms the logical task list file LogicTListFile that contains transcription of utterances in a physical task list file PhysTListFile used by the emtrain tool during embedded training. For each training data utterance the logical task list contains the monophone or triphone transcription of the utterance, the feature file name and the range.
Below are fragments of a logical monophone logical task list:
{
[transcript] sil z ih s r ah w ao ow ah t uw ah
trainData/003_1_1to2, 0~1189
}
The physical task list is similarly structured save the names that are substituted by indices of the corresponding CHMMs in the hmm.phys file.
Below follows a fragment of a physical task list:
{
[transcript] 16 22 9 14 13 0 21 1 12 0 17 19 0
trainData/003_1_1to2, 0~1189
}
The hmm.map file could be of two types: either DECISIONTREE or ONE2ONEMAPPING. In case the ONO2ONEMAPPING CHMM is used, the file hmm.phys already exists. If the DECISIONTREE type is used, the file hmm.phys is build simultaneously with the physical task list.
emtrain |
|
Command line parameters: | emtrain -config avsr.cfg -group mono -pass mono11 |
-group and -pass indicate the right parameters in the config file | |
Parameters: | |
ChainNum | Integer. |
Number of stream. | |
GlobalVarFile0 | String, file name. |
Input file name of global variance file of the audio feature (the result of gengv tool) | |
GlobalVarFile1 | String, file name. |
Input file name of global variance file of the visual feature (the result of gengv tool) | |
AFeatureRootDir | String, directory. |
Root directory of the audio feature files for embedded training. | |
VFeatureRootDir | String, directory. |
Root directory of the visual feature files for embedded training. | |
PruneInit | Float. |
Initial value of beam width for pruning in the backward calculation. | |
PruneLimit | Float. |
Upper limit of the backward beam width value. Once this number is exceeded, the current utterance is not used for training. | |
PruneInc | Float. |
Beam width value increment at every pruning failure. | |
MinTrainSeg | Integer. |
Minimal number of phone examples required to update the corresponding CHMM parameters. | |
UpdateMode | Integer. |
Update mode (4 bit encoded), 0 - disable, 1 - enable: | |
bit 0 transition matrix | |
bit 1 Guassian weights | |
bit 2 mean vectors | |
bit 3 variance vectors. | |
VarFloorCoeff | Float. |
Variance floor value. | |
CHMMPhysFile | String, file name. |
Input CHMM physical file name (path from working directory) | |
CHMMParamFile | String, file name. |
Input CHMM parameter file name (path from InputDir directory) | |
CHMMMapFile | String, file name. |
Input CHMM mapping file name (path from working directory) | |
PhysTListFile | String, file name. |
Input physical list file. Contains the list of feature files used for training and physical transcription. | |
TotalOcctFile | String, file name. |
Output file name of the statistics file. | |
adaptCMSnVNWinSize | Integer. |
Window Size for adaptive CMS and VN on the audio feature.. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN. | |
CMS | Integer. |
If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed on the audio feature; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done. | |
StateOrModelOcct_B | Boolean. |
Defines type of statistcs (occupation) file type. If the StateOrModelOcct_B is true, the state-oriented statistics file is generated. This file is used by mixture merging and splitting commands; If the value is false, the model-oriented statistics file is generated. This file is used by decision and regression tree building tools (currently StateOrModelOcct_B can only be set to true). | |
MinMixOcc | Float. |
Minimal CHMM state mixture occupation required to update parameters of its Gaussian mixture. | |
NewHMMParamFile | String, file name. |
Output new CHMM parameter file name (path from the OutputDir directory). | |
InputDir | String, directory. |
Directory in which the CHMM parameter file is placed. | |
OutputDir | String, directory. |
Directory for output CHMM parameter file and statistics files. | |
Discussion
The emtrain tool performs the training iteration of the input CHMM model set on a given input feature file set. The training procedure is based on the embedded training (EM) method, i.e., an iterative method to perform maximum likelihood (ML) estimation with incompletely observed data. The EM method consists of the expectation step (E-step) to obtain the Baum function and maximization step (M-step). The Baum-Welch (forward-backward algorithm) method is used. The tool applies F-B algorithm to each feature file and collects accumulators and occupation counters for each CHMM state in the model which are used then for re-estimation the new CHMM parameter values. In case the number of CHMM is less than the value of the parameter MinTrainSeg, the training data model parameters are not updated. The mean and variance vectors for which the occupation counter is less than the value of the parameter MinMixOcc are not updated either.
Parameter UpdateMode is used by the emtrain tool to define which parameters need to be updated. In order to make F-B algorithm work faster beam pruning is used. The initial value of the beam width is set by the parameter PruneInit. In case either forward or backward calculation has failed, the beam width is repeatedly increased by the value set by the parameter PruneInc until the beam width becomes more than the value of the parameter PruneLimit. If pruning for the maximum value of the beam width has failed, the feature file will not be used for training during this iteration.
The CMS and adpatCMSnVNWinSize parameters are for audio feature only.
mixup |
|
Command line parameters: | mixup -config avsr.cfg -group mono -pass mono14 |
-group and -pass indicate the right parameters in the config file | |
Parameters: | |
CHMMParamFile | String, file name. |
Input and output CHMM parameter file name (path from the InputDir and OutputDir respectively). | |
CHMMPhysFile | String, file name. |
Input CHMM physical file name (path from the working directory). | |
CHMMMapFile | String, file name. |
Input CHMM mapping file name (path from the working directory). | |
OcctFile | String, file name. |
Input statistics file name (path from the InputDir directory). | |
InputDir | String, directory. |
Directory with input CHMM parameter files. | |
OutputDir | String, directory. |
Directory for output CHMM parameter files. | |
MixNum | Integer. |
The component number to which miture are split. | |
MinMixOcc | Float. |
Minimal CHMM state occupation to split mixture components. | |
PertDepth | Float. |
Perturb depth of the mean of split Gaussians. Split one Gaussian with mean m and variance v vectors to two new Gaussians whose means are m+PertDepth*v and m-PertDepth*v, variances are unchanged, i.e., equal to v. | |
VarFloor | Float. |
Variance floor value. | |
SplitPenalty | Float. |
Gaussian split penalty. Whenever a Gaussian is split once, this number is added to prevent Gaussian split too many times, usually set this parameter very big. | |
Discussion
The mixup tool increases the number of Gaussians in all mixtures to the MixNum value by successive splitting of most extended Gaussians.
Accurate approximation of the continuous probability density function requires an increase of the number of mixture densities or mixture coefficients at the CHMM training stage. This program increases the number of components in all Gaussian mixtures of CHMM by splitting components with the biggest determinant of the variance matrix.
where N is the vector length.
In order to help the users run CHMM training more easily, we provide an example batch file which sequentially uses the tools described in the manual before. Below follows fragments of the batch file:
gv -config av.cfg
monotrain -config av.cfg
format -config av.cfg
emtrain -config av.cfg -group mono -pass mono11
emtrain -config av.cfg -group mono -pass mono12
emtrain -config av.cfg -group mono -pass mono13
mixup -config av.cfg -group mono -pass mono14
emtrain -config av.cfg -group mono -pass mono21
emtrain -config av.cfg -group mono -pass mono22
emtrain -config av.cfg -group mono -pass mono23
Fragments of the corresponding config file 'av.cfg' are shown as following:
[monotrain]
ChainNum=2
LabelRootDir=label
AFeatureRootDir=AudioFeature
VFeatureRootDir=VisualFeature
MonoPhnList=mono.list
LabelFileList=lab.list
AVecSize=39
VVecSize=39
LabelFileListType=2
CMS=12
adaptCMSnVNWinSize=30000
LabelFileSuffix=label
VarFloor=1.0e-7
WgtFloor=1.0e-5
MinSegNum=1
MaxSegNum=500
LabCoefficient=1.0
AudioFeatureFileSuffix=mfcc
VideoFeatureFileSuffix=mfcc
Viterbi=true
Forward=true
TeeModel=sp
FileSaveAfterViterbi=false
MaxIterNumVTB=10
OutputDirVTB=.\result\output.vtbtrain
MaxIterNumFB=10
OutputDirFB=.\result\output.fbtrain
[format]
ChainNum=2
MonoPhnList=mono.list
InputHmmDir=.\result\output.fbtrain
InputDir=.\result\output.fbtrain
OutputDir=.\result\output.format
CHMMMapFile_W=hmm.map
CHMMPhysFile_W=hmm.phys
CHMMParamFile_W=hmm.param
TeeModel=sp
SilenceModel=sil
[gengv]
AudioFeatureDir=AudioFeature
VisualFeatureDir=VisualFeature
LabelFileList=feat.list
AudioVarOutputFile=.\result\output.gv\var_0.gv
VisualVarOutputFile=.\result\output.gv\var_1.gv
AudioVecSize=39
VisualVecSize=39
adaptCMSnVNWinSize=30000
CMS=12
[maptlist]
LogicTListFile=tlist.align
InputDir=.\
OutputDir=.\
MonoPhnList=mono.list
TeeModel=sp
CHMMMapFile=hmm.map
CHMMPhysFile=hmm.phys
CHMMParamFile=hmm.param
PhysTListFile=tlist.align.logic
GroupSize=1
[emtrain]
ChainNum = 2
GlobalVarFile0=.\result\output.gv\var_0.gv
GlobalVarFile1=.\result\output.gv\var_1.gv
AFeatureRootDir=AudioFeature
VFeatureRootDir=VisualFeature
PruneInit=800
PruneLimit=2000
PruneInc=1000
MinTrainSeg=1
UpdateMode=15
VarFloorCoeff=1e-04
CHMMPhysFile=.\result\output.format\hmm.phys
CHMMParamFile=hmm.param
CHMMMapFile=.\result\output.format\hmm.map
PhysTListFile=tlist.align.logic
TotalOcctFile=occt.total
adaptCMSnVNWinSize=30000
CMS=12
StateOrModelOcct_B=true
MinMixOcc=3.0
NewHMMParamFile=hmm.param
[emtrain.mono11]
InputDir=.\result\output.format
OutputDir=.\result\output.mono_hmm.11
[emtrain.mono12]
InputDir=.\result\output.mono_hmm.11
OutputDir=.\result\output.mono_hmm.12
[emtrain.mono13]
InputDir=.\result\output.mono_hmm.12
OutputDir=.\result\output.mono_hmm.13
[mixup.mono14]
CHMMParamFile=hmm.param
CHMMPhysFile=.\result\output.format\hmm.phys
CHMMMapFile=.\result\output.format\hmm.map
OcctFile=.\result\output.mono_hmm.13\occt.total
InputDir=.\result\output.mono_hmm.13
OutputDir=.\result\output.hmm_mixup14
MixNum=2
MinMixOcc=3.0
PertDepth=0.1
VarFloor=1.0e-7
SplitPenalty=1e+08
[emtrain.mono21]
InputDir=.\result\output.hmm_mixup14
OutputDir=.\result\output.mono_hmm.21
[emtrain.mono22]
InputDir=.\result\output.mono_hmm.21
OutputDir=.\result\output.mono_hmm.22
[emtrain.mono23]
InputDir=.\result\output.mono_hmm.22
OutputDir=.\result\output.mono_hmm.23
Note that each tool needs some reference files
to run properly. Examples of these files are also included in our package.
The decoding (recognition) components are packed in the CHMMDecoder.dll, which provides functions for speech recognition using the CHMM parameter file obtained from the training stage. The functions in CHMMDecoder.dll are as follows.
int chmm_decoder_init(char *cfgFileName) | |
Function description | Initialize decoder with the config file. |
Input: | Config file name |
Return: | 0, initialization successful |
1, initialization failed and error messages output to console. | |
int chmm_decoder_rec(float *mfcc, float *vmfc, int nAVecSize, int nVVecSize, int nFrame, char *Result, int ResLen) | |
Function description: | Main decoding procedure |
Input: | mfcc: input audio feature data |
vmfc: input visual feature data | |
nAVecSize: vector size of audio feature | |
nVVectSize: vector size of visual feature | |
nFrame: frame number | |
Result: decoding result in format 'word [startframe endframe]' | |
ResLen: buffer size of Result, to prevent memory overflow | |
Return: | 0, decoding successful, result stored in Result buffer |
void chmm_decoder_set_weight(float audioW, float VisualW) | |
Function description: | Set decoding weights for the audio and visual feature. |
Input: | audioW: audio feature weight in decoding |
visualW: visual feature weight in decoding. | |
Return: | NULL |
void chmm_decoder_free() | |
Function description: | Free the resources. |
Return: | NULL |
The config file used by chmm_decoder_init sets the decoding parameters, the meaning of which are described as below:
REFS_PATH | String, directory. |
Directory where the CHMM parameter files and reference files are placed. | |
REC_OutputPath | String, directory. |
Directory to output decoding result. | |
REC_SilenceModel | String. |
Transcript for silence model. | |
REC_TeeModel | String. |
Transcript for tee model. | |
REC_StatePruneBeam | Float. |
Beam width for state and model level pruning. | |
REC_WordEndBeam | Float. |
Beam width for word end pruning. | |
REC_OutputType | String |
Output result file type. | |
REC_NBest | Integer. |
The best hypothesis number. | |
REC_NBestBeam | Float. |
Beam width for best hypothesis search. | |
REC_NToken | Integer. |
Size of the stack of tokens. | |
REC_PhonePenalty | Float. |
The penalty value that is added to token probability after decoding of a phone. | |
REC_MonoListFile | String, file name. |
Input monophone list file name (path from REFS_PATH directory) | |
REC_HMMMapFile | String, file name. |
Input CHMM mapping file name (path from REFS_PATH directory) | |
REC_HMMParamFile | String, file name. |
Input CHMM parameter file name (path from REFS_PATH directory) | |
REC_HMMPhysFile | String, fila name. |
Input CHMM physical file name (path from REFS_PATH directory) | |
PRELOAD_NETFILE0 | String, file name. |
Input file where the word lattice is placed (path from REFS_PATH directory) | |
PRELOAD_NETNUM | Integer. |
Number of the lattice network loaded. | |
TRN_WordDictFile | String, file name. |
Input word dictionary file (path from REFS_PATH directory). | |
REC_ChainNum | Integer. |
Number of streams. | |
SEG_LEN | Integer. |
Cache segment length in coupled feature. | |
REC_TokenLimit | Integer. |
If the number of tokens exceeds this value, the soft token number pruning is activated. | |
REC_BeamFactor | Float. |
If the soft token number pruning is active at the beginning of each input frame, the value of REC_StatePruneBeam is multiplied by REC_BeamFactor. This feature helps to accelerate decoding of poor speech fragments. | |
REC_AUDIO_CEPSTRUMNUM | Integer. |
CMS length for the audio feature. | |
REC_VIDEO_CEPSTRUMNUM | Integer. |
CMS length for the visual feature (0 by default). | |
REC_AUDIO_ADAPTCMSWIN | Integer. |
adaptCMSWinSize for the audio feature. | |
REC_VIDEO_ADAPTCMSWIN | Integer. |
adaptCMSWinSize for the visual feature (0 by default). | |
REC_AUDIO_WEIGHT | Float. |
Decoding weight of the audio feature. | |
REC_VIDEO_WEIGHT | Float. |
Decoding weight of the visual feature. | |
REC_OUTPUT_SCORE_FORMAT | Integer. |
Indicate whether or not to output .score result file. |
Examples of the config and reference files are included in the package.
An example of CHMM training is also included. Suppose the working directory is "C:\AvcsrDemo\bin\ChmmTrain", users need to copy the example directories "audiofeature", "visualfeature" and "label" to the current working directory ("C:\AvcsrDemo\bin\ChmmTrain").
The directory "C:\AvcsrDemo\bin\ChmmTrain" contains the executable tools and the reference files, whose functionalities can be found in this document.
"emtrain.exe", "format.exe", "gengv.exe", "maptlist.exe", "mixup.exe", "monotrain.exe": tools for CHMM AVSR training;
"tlist.align", "feat.list", "lab.list", "mono.list", "tlist.align.logic": reference files.
"av.cfg": config file for CHMM AVSR training
"train_av.bat": batch file to run under MS-DOS
The directory "AudioFeature" is where the audio feature to be put, "VisualFeature" is for the visual feature, and "Label" is to put the label files. Examples of the feature and label files are given in these directories.
After running the "train_av.bat", a new directory named "result" will be created in this directory ("C:\AvcsrDemo\bin\ChmmTrain"), which will contain the result model "hmm.param". Depending on the config file, the model is placed in the corresponding directories. Please refer to the config file for more details (in this example, the final CHMM parameter model file "hmm.param" is placed in "result\output.mono_hmm.66" from the current working directory, and the "hmm.map" and "hmm.phys" can be found in "result\output.format" directory).
[Nefian02] A. V. Nefian, L. Liang, X. Pi, X. Liu, and C. Mao. An coupled hidden Markov model for audio-visual speech recognition. In international Conference on Acoustics, Speech and Signal Processing, 2002
[Jensen98] Finn V. Jensen. An Introduction to Bayesian Networks. UCL Press Limited, London, UK, 1998
[Young95] S. Young et. al. The HTK Book. Entropic Cambridge Research Laboratory, Cambridge, UK, 1995.