Hidden Markov Model (HMM) in ASR
1. Model Training and Recognition |
2. Running Environment and File Formats |
2.1 monotrain |
2.2 gengv |
2.3 format |
2.4 maptlist |
2.5 emtrain |
2.6 bldlogtlist |
2.7 clone |
2.8 sbuild |
2.9 smerge |
2.10 savmodel |
2.11 gaussmrg |
2.12 mixup |
2.13 Config file example |
2.14 AHMMDecoder and VHMMDecoder |
3. Example |
References |
1. Model Training and Recognition
The fundamental principles of HMM in ASR (Automatic Speech Recognition) can be found in Rabiner's book [Rab93], Jelinek's book [Jel97] and Lee's book[Lee96] and thus they are omitted in this document.
This package contains tools for acoustic model training and speech recognition testing. The acoustic modeling units are sub-units of word pronunciations, called phonemes (or monophones). A sentence (a string of words) can be represented as a sequence of monophones concatenated from each word pronunciation. To model the speech co-articulation effect, context triphones are used, which are renamed phonemes in their left and right context.
The acoustic training flowchart is shown as below, which includes three sub-stages: monophone training, context-triphone training and clustered-triphone training. Each training stage will be detailed in the following paragraphs.
Figure 1. Training flowchart
The speech recognizer is a Viterbi search program to find the best state sequence that matches the given speech utterance. The recognizer requires the following items: 1. An acoustic model to match the acoustics; 2. A language model to match the syntax and semantics; 3. A word pronunciation dictionary to organize HMM models during search. A diagram of the testing stage is shown in the figure below. The output of the recognizer can be either the transcription of the speech, or a word-graph (or both).
Figure 2. Testing flowchart
Next we will present the manual for the tools for HMM training and testing.
2. Running Environment and File Formats
The training process of HMM involves the programs listed below, followed by a short description.
monotrain.exe: Initializes HMM parameters of isolated phoneme-viseme pairs using Viterbi algorithm and EM algorithm.
genv.exe: Generates global variance file for training data.
format.exe: Binds all individual monophone models into a HMM model set for embedded training.
maptlist.exe: Converts logical transcript file to physical transcript using HMM mapping file.
emtrain.exe Provides embedded forward-backward model re-estimation, dump out statistics (occupation) files.
bldlogtlist.exe Builds logical task list with triphone transcription.
clone.exe Clones monophone models into triphone models.
sbuild.exe Builds decision trees for monophone states.
smerge.exe Merges leaves of decision trees for monophone states.
savmodel.exe Builds model set according to decision trees.
gaussmrg.exe Merges useless components for Gaussian mixture.
mixup.exe Splits mixtures.
Each executable program needs one config file and other relevant files to function properly. The parameters of command line and reference files for each program are presented in detail as following.
monotrain |
|
Command line parameters: | monotrain -config asr.cfg |
Parameters: | |
LabelRootDir | String, directory. |
Root directory for all label files. | |
FeatureRootDir | String, directory. |
Root directory for all feature files. | |
MonoPhnList | String, input file. |
Monophone list file name. | |
LabelFileList | String, input file. |
Label list file. Contains the list of the names of label files used. | |
VecSize | Integer. |
Vector size of feature. | |
CMS | Integer. |
If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed on the audio feature; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done. | |
adaptCMSnVNWinSize | Integer. |
Window Size for adaptive CMS and VN on the audio feature. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN. | |
LabelFileSuffix | String, suffix of label files |
LabelFileListType | Integer, list type of label files. |
VarFloor | Float |
Variance floor value. | |
WgtFloor | Float. |
Mixture weight floor value. | |
MinSegNum | Integer. |
Minimum segment number used to train the model. | |
MaxSegNum | Integer. |
Maximum segment number used to train the model. If training corpus has more segments, use only MaxSegNum segments. | |
LabCoefficient | Float. |
Used to convert original label to the label in frame. | |
FeatureFileSuffix | String, suffix of feature files. |
Viterbi | Boolean. |
If true, use Viterbi algorithm to initialize segmentation; if false, otherwise. | |
Forward | Boolean. |
If true, use Forward-Backward algorithm to train the model; if false, otherwise. | |
TeeModel | String, transcription of Tee model. |
FileSaveAfterViterbi | Boolean. |
If true, save models after initial Viterbi iteration; if false; otherwise. | |
MaxIterNumVTB | Integer. |
Maximum iteration number of Viterbi re-estimation. | |
OutputDirVTB | String, directory. |
Output model directory for models created after running through Viterbi training and input directory for the forward-backward training stage. | |
MaxIterNumFB | Integer. |
Maximum iteration number of forward-backward re-estimation. | |
OutputDirFB | String, directory. |
Output model directory for models created after running through the forward-backward algorithm. |
Discussion
The monotrain tool is used to initialize isolated HMMs using the Viterbi algorithm and forward-backward algorithm. Label and feature files are the input data. Label file consists of a number of lines; each line is represented as:
start end phone
where 'start' is the segment start time, 'end' is the segment end time, and 'phone' is the segment corresponding monophone. The time is measured in 1e-7 seconds and the segment time is related to the parameter 'LabCoefficient' (for a 16Hz wave file at 100 frames per second, 'LabCoefficent' = 160). The initial and the final frame numbers N are found by the formula:
Below follows a fragment of the label file:
0 48 sil
48 70 z
70 88 ih
88 91 r
91 113 ow
113 137 sp
The monophone list file is used to list the transcription of monophones, with the format like following:
ah
ao
ax
The label list file contains the list of label files used, with the format like:
{
LabelFileName1
LabelFIleName2
}
Simple Gaussian distribution with diagonal covariance matrix is used to compute probability in HMM. Simple distribution means that there is only one Gaussian in a mixture and Gaussian weight is equal to 1. For each state and for each segment, the vector of mean for the mixture component, covariance matrix for the mixture component, and transform probability are initialized. All parameters are initialized by uniform segmentaion.
The segment is read from the feature file. The Viterbi component of monotrain tool then will search the input data for as many segments as indicated by the parameter MaxSegNum. If there are fewer segments for a given monophone than indicated by the parameter MinSegNum, initialization will be canceled and no output files created.
The next steps are the Viterbi state realignment and the best likelihood path selection for each segment. The Viterbi component of the monotrain tool executes up to MaxIterNumVTB iterations. However, the program may be terminated earlier if the best likelihood/frame value change becomes relatively insignificant.
The Viterbi part of the monotrain tool creates two files: a binary file with the extension *.hmm and a text file with the extension *.txt. The number of segments found for all monophones from the MonoPhnList may not be smaller than MinSegNum.
The forward-backward component of the monotrain tool initializes the HMMs using the forward-backward algorithm for each monophone. The input data for that component are HMMs produced by the Viterbi part of the tool and feature files. It operates on the same segment and uses the same Baum-Welch re-estimation algorithm as the Viterbi part.
The forward-backward component produces HMM files for all monophones from the MonoPhnList except the 'tee' model. Note that the model is not trained for the 'tee' model. The format tool will create HMM for the 'tee' model later.
The forward-backward component of the monotrain tool executes up to MaxIterNumFB iterations of Baum-Welch re-estimation. However, the program may be terminated earlier if the likelihood/frame value change becomes relatively insignificant.
Just as the Viterbi part, the forward-backward component of the monotrain tool creates two files, a binary file with the extension *.hmm and a text file with the extension *.txt for all monophones from the list MonoPhnList except the 'tee' model.
gengv |
|
Command line parameters: | gengv -config asr.cfg |
Parameters: | |
FeatureDir | String, directory. |
Root directory of all feature files. | |
LabelFileList | String, file name. |
Label file name list with starting and ending frame index. | |
OutputFile | String, file name.. |
Output file name of feature variance. | |
Discussion
The gengv tool generates global variance file for training data.
Input data are feature files with the extension *.mfcc that are stored in the corresponding directories and listed by the parameter LabelFileList. LabelFileList has the format like:
filename (without extension), start frame~end frame
The command gengv produces the global variance vector computed by the formula:
where
frame of
feature file,
is the number of feature files,
feature file.
format |
|
Command line parameters: | format -config asr.cfg |
Parameters: | |
MonoPhnList | String, file name. |
Monophone list file name. | |
InputDir | String, directory. |
Root directory of the monophone model produced by monotrain tool. | |
OutputDir | String, directory. |
Directory of the output triphone model mapping, structure and parameter files. | |
HMMMapFile_W | String, file name.. |
Output HMM mapping file name. | |
HMMPhysFile_W | String, file name. |
Output HMM physical file name. | |
HMMParamFile_W | String, file name. |
Output HMM parameter file name. | |
TeeModel | String. |
Transcription of tee model. | |
SilenceModel | String. |
Transcription of silence model.. | |
Discussion
The format tool performs the HMM mapping. Structure and parameter files (*.map, *.phys and *.param) are used for embedded training and decoding. The format tool builds model files from individual monophone models (files with the extension *.hmm) that are produced by the monotrain tool. The middle state of the 'silence' HMM for initialization is used to add HMM to the 'sp' model.
maptlist |
|
Command line parameters: | maptlist -config asr.cfg |
Parameters: | |
MonoPhnList | String, file name. |
Monophone list file name. | |
InputDir | String, directory. |
Directory in with HMM mapping and parameters files are placed. | |
OutputDir | String, directory. |
Directory for output physical transcript and HMM physical files. | |
HMMMapFile | String, file name.. |
Input monophone HMM mapping file name. | |
HMMPhysFile | String, file name. |
Output HMM physical file name (only for decision tree mapping). If the HMM mapping file is a decision tree, then the physical HMM file must be rebuilt with the help of the original HMM parameter file and the decision tree. If the HMM mapping file is not a decision tree, then the original physical HMM file must be unchanged. | |
HMMParamFile | String, file name. |
Input monophone HMM parameter file name. | |
TeeModel | String. |
Transcription of tee model. | |
LogicTListFile | String, file name. |
Input logical transcript file. | |
PhysTListFile | String, file name. |
Output physical transcript file name. | |
GroupSize | Integer. |
Defines the number of tasks in each section of the output task list file. |
Discussion
The maptlist tool transforms the logical task list file LogicTListFile that contains transcription of utterances in a physical task list file PhysTListFile used by the emtrain tool during embedded training. For each training data utterance the logical task list contains the monophone or triphone transcription of the utterance, the feature file name and the range.
Below are fragments of a logical monophone logical task list:
{
[transcript] sil z ih s r ah w ao ow ah t uw ah
trainData/003_1_1to2, 0~1189
}
The physical task list is similarly structured save the names that are substituted by indices of the corresponding HMMs in the hmm.phys file.
Below follows a fragment of a physical task list:
{
[transcript] 16 22 9 14 13 0 21 1 12 0 17 19 0
trainData/003_1_1to2, 0~1189
}
The hmm.map file could be of two types: either DECISIONTREE or ONE2ONEMAPPING. In case the ONO2ONEMAPPING HMM is used, the file hmm.phys already exists. If the DECISIONTREE type is used, the file hmm.phys is build simultaneously with the physical task list.
emtrain |
|
Command line parameters: | emtrain -config asr.cfg -group mono -pass mono11 |
-group and -pass indicate the right parameters in the config file | |
Parameters: | |
GlobalVarFile | String, file name. |
Input file name of global variance file of the feature (the result of gengv tool) | |
FeatureRootDir | String, directory. |
Root directory of the feature files for embedded training. | |
PruneInit | Float. |
Initial value of beam width for pruning in the backward calculation. | |
PruneLimit | Float. |
Upper limit of the backward beam width value. Once this number is exceeded, the current utterance is not used for training. | |
PruneInc | Float. |
Beam width value increment at every pruning failure. | |
MinTrainSeg | Integer. |
Minimal number of phone examples required to update the corresponding HMM parameters. | |
UpdateMode | Integer. |
Update mode (4 bit encoded), 0 - disable, 1 - enable: | |
bit 0 transition matrix | |
bit 1 Guassian weights | |
bit 2 mean vectors | |
bit 3 variance vectors. | |
VarFloorCoeff | Float. |
Variance floor value. | |
HMMPhysFile | String, file name. |
Input HMM physical file name (path from working directory) | |
HMMParamFile | String, file name. |
Input HMM parameter file name (path from InputDir directory) | |
HMMMapFile | String, file name. |
Input HMM mapping file name (path from working directory) | |
PhysTListFile | String, file name. |
Input physical list file. Contains the list of feature files used for training and physical transcription. | |
TotalOcctFile | String, file name. |
Output file name of the statistics file. | |
adaptCMSnVNWinSize | Integer. |
Window Size for adaptive CMS and VN on the audio feature.. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN. | |
CMS | Integer. |
If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed on the audio feature; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done. | |
StateOrModelOcct_B | Boolean. |
Defines type of statistcs (occupation) file type. If the StateOrModelOcct_B is true, the state-oriented statistics file is generated. This file is used by mixture merging and splitting commands; If the value is false, the model-oriented statistics file is generated. This file is used by decision and regression tree building tools (currently StateOrModelOcct_B can only be set to true). | |
MinMixOcc | Float. |
Minimal HMM state mixture occupation required to update parameters of its Gaussian mixture. | |
NewHMMParamFile | String, file name. |
Output new HMM parameter file name (path from the OutputDir directory). | |
InputDir | String, directory. |
Directory in which the HMM parameter file is placed. | |
OutputDir | String, directory. |
Directory for output HMM parameter file and statistics files. | |
Discussion
The emtrain tool performs the training iteration of the input HMM model set on a given input feature file set. The training procedure is based on the embedded training (EM) method, i.e., an iterative method to perform maximum likelihood (ML) estimation with incompletely observed data. The EM method consists of the expectation step (E-step) to obtain the Baum function and maximization step (M-step). The Baum-Welch (forward-backward algorithm) method is used. The tool applies F-B algorithm to each feature file and collects accumulators and occupation counters for each HMM state in the model which are used then for re-estimation the new HMM parameter values. In case the number of HMM is less than the value of the parameter MinTrainSeg, the training data model parameters are not updated. The mean and variance vectors for which the occupation counter is less than the value of the parameter MinMixOcc are not updated either.
Parameter UpdateMode is used by the emtrain tool to define which parameters need to be updated. In order to make F-B algorithm work faster beam pruning is used. The initial value of the beam width is set by the parameter PruneInit. In case either forward or backward calculation has failed, the beam width is repeatedly increased by the value set by the parameter PruneInc until the beam width becomes more than the value of the parameter PruneLimit. If pruning for the maximum value of the beam width has failed, the feature file will not be used for training during this iteration.
bldlogtlist |
|
Command line parameters: | bldlogtlist -config asr.cfg |
Parameters: | |
InputDir | String, directory. |
Directory with the input logical monophone task list. | |
OutputDir | String, directory. |
Directory for the output logical triphone task list and seen triphone list files. | |
MonoListFile | String, input file. |
Monophone list file name. | |
InputTaskListFile | String, input file. |
The name of the input task file with monophone transcription. | |
IsWithin | Boolean. |
Provides the within-word triphone extension if 'true', cross-word extension otherwise. | |
OutputSeenList | String, output file. |
The name of the output file that contains all seen triphone names. | |
OutputTList | String, output file |
The name of output task list file with triphone transcription. |
Discussion
The bldlogictlist tool extends the monophone utterance transcription in the input task list file to the triphone transcription, either within-word or cross-word. The output logical task list is used by the maptlist tool for building the physical task list file. The seen triphone list file is used for model cloning by the clone tool and for building decision and regression trees.
clone |
|
Command line parameters: | clone -config asr.cfg |
Parameters: | |
InputDir | String, directory. |
Directory where the monophone model parameter file is placed. | |
OutputDir | String, directory. |
Directory for triphone model mapping, structure and parameter files. | |
MonoListFile | String, input file. |
Monophone list file name. | |
MonoListFile | String, input file. |
Monophone list file name. | |
HMMMapFile | String, input file. |
Monophone HMM mapping file name. | |
HMMPhysFile | String, input file. |
Monophone HMM physical file name. | |
HMMParamFile | String, input file. |
Monophone HMM parameters file name. | |
TriPhoneList | String, input file. |
Seen triphone list file name. | |
HMMMapFile_W | String, output file. |
Triphone HMM mapping file name. | |
HMMPhysFile_W | String, output file. |
Triphone HMM physical file name. | |
HMMParamFile_W | String, output file. |
Triphone HMM parameters file name. | |
Discussion
The clone tool transforms context-independent (monophone) models into models dependent on the left and right neighboring phones (context-dependent or triphone models). For each seen triphone the monophone HMM for its central monophone is replicated.
sbuild | |
Command line parameters: |
sbuild -config asr.cfg -group dcstree |
Parameters: |
|
InputDir |
String, directory. |
Directory with input HMM parameter file. | |
OutputDir | String, directory. |
|
Directory for output decision tree files (*.sbd). |
TaskListFile |
String, input file. |
Task list file name. This file contains monophone names and state numbers for which decision trees were built. | |
MonoListFile | String, input file. |
|
Monophone list file name. |
SeenHMMList |
String, input file. |
Seen triphone file list name. Each line of this file contains a logical triphone name. | |
HMMMapFile | String, input file. |
HMM mapping file name. | |
HMMPhysFile |
String, input file. |
HMM physical file name. | |
HMMParamFile | String, input file. |
HMM parameter file name. | |
QuestionListFile |
String, input file. |
Name of the file with phonetic question list. | |
TotalOcctFile | String, input file. |
Statistics file after last iteration of embedded training. This file must have the model form. | |
TraceBuilding |
Boolean. |
If 'true', the sequence of the chosen best questions is printed. | |
TresholdBuild | Float. |
Threshold value to cease clusters splitting. | |
Outlier |
Float. |
Threshold value to prevent clusters with only one element. |
Discussion
The sbuild tool performs clustering on triphone sets for each state of each central monophone to improve HMM trainability. Decision trees may be built for each state of each monophone except the 'tee' model. The idea of the clustering algorithm is in selecting the best sequence of phonetic questions about the left and right contexts to split phones into phoneme varieties (clusters) with similar pronunciation. At each step the best questions for each cluster are selected according to the statistical characteristics. After that the best cluster must be split according to the best question. Clustering produces a decision tree with non-terminal nodes containing questions about the context and leaf nodes with phone clusters. After clustering, all decision trees will be saved in separate files with the file extension *.sbd
smerge | |
Command line parameters: |
smerge -config asr.cfg -group dcstree |
Parameters: |
|
InputDir |
String, directory. |
Directory with input HMM parameter file. | |
OutputDir | String, directory. |
|
Directory for output decision tree files (*.sbd) and for input decision tree files (*.smg) |
TaskListFile |
String, input file. |
Task list file name. This file contains monophone names and state numbers for which decision trees were built. | |
MonoListFile | String, input file. |
|
Monophone list file name. |
HMMMapFile | String, input file. |
HMM mapping file name. | |
HMMPhysFile |
String, input file. |
HMM physical file name. | |
HMMParamFile | String, input file. |
HMM parameter file name. | |
QuestionListFile |
String, input file. |
Name of the file with phonetic question list. | |
TotalOcctFile | String, input file. |
Statistics file after last iteration of embedded training. This file must have the model form. | |
TraceMerging |
Boolean. |
If 'true', the sequence of the leaves joining is printed. | |
TresholdMerge | Float. |
Threshold value to cease clusters joining | |
Outlier |
Float. |
Threshold value to prevent clusters with only one element. |
Discussion
The smerge tool prunes and merges decision trees and selects typical states. The pruning step follows the reverse order of the decision tree building process according to the threshold. Each pair of nodes that has been split during the sbuild procedure will be merged into one node with previous phones cluster. The 'unbuild' process stops when the next considered node has a splitting improvement, i.e., 'splitting probability' minus 'total probability', less than the threshold. At the next step the most likely leaf nodes with the merging cost less than the threshold are merged together. In the end a typical state for each leaf node will be selected. Thus, the tool produces more complex decision trees, or cycle trees, that will be saved in files with the file extension *.smg.
savmodel | |
Command line parameters: |
savmodel -config asr.cfg -group dcstree |
Parameters: |
|
InputDir |
String, directory. |
Directory with input HMM parameter file. | |
OutputDir | String, directory. |
|
Output directory for new HMM parameter file. |
TaskListFile |
String, input file. |
Task list file name with the names of all monophone states. | |
MonoListFile | String, input file. |
|
Monophone list file name. |
SeenHMMList |
String, input file. |
Seen triphone file list name. Each line of this file contains a logical triphone name. | |
HMMMapFile | String, input file. |
HMM mapping file name. | |
HMMPhysFile |
String, input file. |
HMM physical file name. | |
HMMParamFile | String, input file. |
HMM parameter file name. | |
QuestionListFile |
String, input file. |
Name of the file with phonetic question list. | |
HMMMapFile_W | String, output file. |
New HMM mapping file. | |
HMMParamFile_W |
String, output file. |
New HMM parameter file. |
Discussion
The savmodel tool creates HMM mapping and HMM parameter files for clustered model using decision trees. The simple linear form HMM mapping file for context dependent models is transformed into a tree-oriented HMM mapping file that contains decision tree nodes, a question list, etc.
After compressing and remapping mixtures the tool saves all HMM weights, mean and variance vectors into an HMM parameter file. The tool produces a decision tree mapping mechanism useful for handling unseen triphones and rising HMM trainability.
gaussmrg | |
Command line parameters: |
gaussmrg -config asr.cfg -group clust -pass clust34 |
Parameters: |
|
InputDir |
String, directory. |
Directory with input HMM parameter file. | |
OutputDir | String, directory. |
|
Directory for output decision tree files (*.sbd). |
HMMParamFIle |
String, input file. |
Input HMM parameters file name. | |
MonoListFile | String, input file. |
|
Monophone list file name. |
TotalOcctFile |
String, input file. |
Statistics file for the input HMM parameter file. This file should have model form. | |
MinMixOcc | Float. |
Minimal HMM state occupation to prevent Gaussians from merging. | |
DetLimit |
Float. |
All Gaussian mixtures with the determinant of Gausssian variance less than the value DetLimit are deleted to prevent delta from functioning like Gaussians for rare phone HMM. |
Discussion
The Gaussmrg tool merges poorly used Gaussian mixture components according to the statistics file for HMM trining iteration.
The merging algorithm recalculation formulas for the weights w, mean m and variance v vectors for two mixtures is as follows:
where N is the vector length, w=w1+w2
mixup |
|
Command line parameters: | mixup -config avsr.cfg -group mono -pass mono14 |
-group and -pass indicate the right parameters in the config file | |
Parameters: | |
HMMParamFile | String, file name. |
Input and output HMM parameter file name (path from the InputDir and OutputDir respectively). | |
HMMPhysFile | String, file name. |
Input HMM physical file name (path from the working directory). | |
HMMMapFile | String, file name. |
Input HMM mapping file name (path from the working directory). | |
OcctFile | String, file name. |
Input statistics file name (path from the InputDir directory). | |
InputDir | String, directory. |
Directory with input HMM parameter files. | |
OutputDir | String, directory. |
Directory for output HMM parameter files. | |
MixNum | Integer. |
The component number to which miture are split. | |
MinMixOcc | Float. |
Minimal HMM state occupation to split mixture components. | |
PertDepth | Float. |
Perturb depth of the mean of split Gaussians. Split one Gaussian with mean m and variance v vectors to two new Gaussians whose means are m+PertDepth*v and m-PertDepth*v, variances are unchanged, i.e., equal to v. | |
VarFloor | Float. |
Variance floor value. | |
SplitPenalty | Float. |
Gaussian split penalty. Whenever a Gaussian is split once, this number is added to prevent Gaussian split too many times, usually set this parameter very big. | |
Discussion
The mixup tool increases the number of Gaussians in all mixtures to the MixNum value by successive splitting of most extended Gaussians.
Accurate approximation of the continuous probability density function requires an increase of the number of mixture densities or mixture coefficients at the HMM training stage. This program increases the number of components in all Gaussian mixtures of HMM by splitting components with the biggest determinant of the variance matrix.
where N is the vector length.
In order to help the users run HMM training more easily, we provide an example batch file which sequentially uses the tools described in the manual before. Below follows fragments of the batch file:
monotrain -config asr.cfg
format -config asr.cfg
gengv -config asr.cfg
maptlist -config asr.cfg -pass mono
emtrain -config asr.cfg -group mono
emtrain -config asr.cfg -group mono
emtrain -config asr.cfg -group mono
bldlogtlist -config asr.cfg
clone -config asr.cfg
maptlist -config asr.cfg -group context -pass context21
emtrain -config asr.cfg -group context -pass context22
sbuild -config asr.cfg -group dcstree
smerge -config asr.cfg -group dcstree
savmodel -config asr.cfg -group dcstree
maptlist -config asr.cfg -pass clust
emtrain -config asr.cfg -group clust -pass clust31
emtrain -config asr.cfg -group clust -pass clust32
gaussmrg -config asr.cfg -group clust -pass clust33
mixup -config asr.cfg -group clust -pass clust34
Example of the config file 'asr.cfg' is included in the package.
Note that each tool needs some reference files
to run properly. Examples of these files are also included in our package.
2.14 AHMMDecoder and VHMMDecoder
The decoding (recognition) components are packed in the AHMMDecoder.dll, which provides functions for speech recognition using the HMM parameter file obtained from the training stage on audio feature. The functions in AHMMDecoder.dll are as follows.
int ahmm_decoder_init( char *cfgfile ); | |
Function description | Initialize decoder with the config file. |
Input: | Config file name |
Return: | 0, initialization successful |
1, initialization failed and error messages output to console. | |
int ahmm_decoder_rec(float *data, int vecSize, int frameNum, char *result, int resLen); | |
Function description: | Main decoding procedure |
Input: | data: input feature data. |
vecSize: vector size of feature. | |
frameNum: frame number | |
result: decoding result in format 'word [startframe endframe]' | |
resLen: buffer size of result, to prevent memory overflow | |
Return: | 0, decoding successful, result stored in Result buffer |
void ahmm_decoder_free() | |
Function description: | Free the resources. |
Return: | NULL |
The functions in VHMMDecoder.dll are as follows.
int vhmm_decoder_init( char *cfgfile ); | |
Function description | Initialize decoder with the config file. |
Input: | Config file name |
Return: | 0, initialization successful |
1, initialization failed and error messages output to console. | |
int vhmm_decoder_rec(float *data, int vecSize, int frameNum, char *result, int resLen); | |
Function description: | Main decoding procedure |
Input: | data: input feature data. |
vecSize: vector size of feature. | |
frameNum: frame number | |
result: decoding result in format 'word [startframe endframe]' | |
resLen: buffer size of result, to prevent memory overflow | |
Return: | 0, decoding successful, result stored in Result buffer |
void vhmm_decoder_free() | |
Function description: | Free the resources. |
Return: | NULL |
The config file used by ahmm_decoder_init (vhmm_decoder_init) sets the decoding parameters, the meaning of which are described as below:
REFS_PATH | String, directory. |
Directory where the HMM parameter files and reference files are placed. | |
REC_SilenceModel | String. |
Transcript for silence model. | |
REC_TeeModel | String. |
Transcript for tee model. | |
REC_StatePruneBeam | Float. |
Beam width for state and model level pruning. | |
REC_WordEndBeam | Float. |
Beam width for word end pruning. | |
REC_OutputType | String |
Output result file type. | |
REC_NBest | Integer. |
The best hypothesis number. | |
REC_NBestBeam | Float. |
Beam width for best hypothesis search. | |
REC_NToken | Integer. |
Size of the stack of tokens. | |
REC_PhonePenalty | Float. |
The penalty value that is added to token probability after decoding of a phone. | |
REC_MonoListFile | String, file name. |
Input monophone list file name (path from REFS_PATH directory) | |
REC_HMMMapFile | String, file name. |
Input HMM mapping file name (path from REFS_PATH directory) | |
REC_HMMParamFile | String, file name. |
Input HMM parameter file name (path from REFS_PATH directory) | |
REC_HMMPhysFile | String, fila name. |
Input HMM physical file name (path from REFS_PATH directory) | |
PRELOAD_NETFILE0 | String, file name. |
Input file where the word lattice is placed (path from REFS_PATH directory) | |
PRELOAD_NETNUM | Integer. |
Number of the lattice network loaded. | |
TRN_WordDictFile | String, file name. |
Input word dictionary file (path from REFS_PATH directory). | |
SEG_LEN | Integer. |
Cache segment length in the feature. | |
REC_TokenLimit | Integer. |
If the number of tokens exceeds this value, the soft token number pruning is activated. | |
REC_BeamFactor | Float. |
If the soft token number pruning is active at the beginning of each input frame, the value of REC_StatePruneBeam is multiplied by REC_BeamFactor. This feature helps to accelerate decoding of poor speech fragments. | |
REC_CepstrumNum | Integer. |
CMS length for the feature. | |
REC_ADAPTCMSWIN | Integer. |
adaptCMSWinSize for the feature. | |
REC_OUTPUT_SCORE_FORMAT | Integer. |
Indicate whether or not to output .score result file. |
Examples of the config and reference files are included in the package.
An example of HMM training is also included. Suppose the working directory is "C:\AvcsrDemo\bin\HmmTrain", users need to copy the example directories "visualfeature" and "label" to the current working directory ("C:\AvcsrDemo\bin\HmmTrain").
Directory "C:\AvcsrDemo\bin\HmmTrain" contains the executable tools and the reference files, whose functionalities can be found in this document.
"bldlogtlist", "clone", "emtrain.exe", "format.exe", "gaussmrg.exe", "gengv.exe", "maptlist.exe", "mixup.exe", "monotrain.exe", "savmodel.exe", "sbuild.exe", "smerge.exe": tools for HMM ASR training;
"tlist.align", "tlist_startend.align", "viseme_mono.list", "copyhmm.txt", "tlist_viseme.tree", "question_viseme.tri": reference files;
"hmm.cfg": config file for HMM ASR training;
"hmm_train.bat": batch file to run under MS-DOS.
Directory "visualfeature" is the place to put training data. Directory "label" is the place for label file. Examples of the feature and label files are given in these directories.
After running the "hmm_train.bat", a new directory named "result" will be created in this directory, which will contain the result model. Depending on the config file, the model is placed in the corresponding directories. Please refer to the config file for more details (in this example, the final HMM parameter model file "hmm.wwclust.param" is placed in "result\Model\output.clust_hmm.gm32.orig" from the current working directory, and the "hmm.wwclust.map" can be found in "result\Model\output.clustering" directory).
[Rab93] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Prentice Hall PTR, ISBN 0-130-15157-2, 1993
[Jel97] F. Jelinek. Statistical Methods for Speech Recognition. The MIT Press, ISBN 0-262-10066-5, 1997
[Lee96] C. H. Lee, F. K. Soong, and K. K. Paliwal, Automatic Speech and Speaker Recognition: Advanced Topics. Kluwer Academic publishers, ISBN 0-792-39706-1, 1996