Hidden Markov Model (HMM) in ASR

 

Content

1.    Model Training and Recognition
2.    Running Environment and File Formats
       2.1    monotrain
       2.2    gengv
       2.3    format
       2.4    maptlist
       2.5    emtrain
       2.6    bldlogtlist
       2.7    clone
       2.8    sbuild
       2.9    smerge
       2.10  savmodel
       2.11  gaussmrg
       2.12  mixup
       2.13  Config file example
       2.14  AHMMDecoder and VHMMDecoder
3.    Example
References

 

1.  Model Training and Recognition

The fundamental principles of HMM in ASR (Automatic Speech Recognition) can be found in Rabiner's book [Rab93], Jelinek's book [Jel97] and Lee's book[Lee96] and thus they are omitted in this document.

This package contains tools for acoustic model training and speech recognition testing. The acoustic modeling units are sub-units of word pronunciations, called phonemes (or monophones). A sentence (a string of words) can be represented as a sequence of monophones concatenated from each word pronunciation. To model the speech co-articulation effect, context triphones are used, which are renamed phonemes in their left and right context.

The acoustic training flowchart is shown as below, which includes three sub-stages: monophone training, context-triphone training and clustered-triphone training. Each training stage will be detailed in the following paragraphs.

                           Figure 1. Training flowchart

The speech recognizer is a Viterbi search program to find the best state sequence that matches the given speech utterance. The recognizer requires the following items: 1. An acoustic model to match the acoustics; 2. A language model to match the syntax and semantics; 3. A word pronunciation dictionary to organize HMM models during search. A diagram of the testing stage is shown in the figure below. The output of the recognizer can be either the transcription of the speech, or a word-graph (or both).

                                  Figure 2. Testing flowchart

 Next we will present the manual for the tools for HMM training and testing. 

 

back to top

 

2.  Running Environment and File Formats

The training process of HMM involves the programs listed below, followed by a short description.

monotrain.exe:    Initializes HMM parameters of isolated phoneme-viseme pairs using Viterbi algorithm and EM algorithm.

genv.exe:            Generates global variance file for training data.

format.exe:         Binds all individual monophone models into a HMM model set for embedded training.

maptlist.exe:       Converts logical transcript file to physical transcript using HMM mapping file.

emtrain.exe        Provides embedded forward-backward model re-estimation, dump out statistics (occupation) files.

bldlogtlist.exe     Builds logical task list with triphone transcription.

clone.exe            Clones monophone models into triphone models.

sbuild.exe           Builds decision trees for monophone states.

smerge.exe         Merges leaves of decision trees for monophone states.

savmodel.exe     Builds model set according to decision trees.

gaussmrg.exe     Merges useless components for Gaussian mixture.

mixup.exe           Splits mixtures.

Each executable program needs one config file and other relevant files to function properly. The parameters of command line and reference files for each program are presented in detail as following.

 

2.1  monotrain

  monotrain

 
   
  Command line parameters: monotrain -config asr.cfg
   
  Parameters:  
   
      LabelRootDir String, directory.
  Root directory for all label files.
   
      FeatureRootDir String, directory.
  Root directory for all feature files.
   
      MonoPhnList String, input file.
  Monophone list file name.
   
      LabelFileList String, input file.
  Label list file. Contains the list of the names of label files used.
   
      VecSize Integer.
  Vector size of  feature.
   
      CMS Integer.
  If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed on the audio feature; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done.
   
      adaptCMSnVNWinSize Integer.
  Window Size for adaptive CMS and VN on the audio feature. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN.
   
      LabelFileSuffix String, suffix of label files
   
      LabelFileListType Integer, list type of label files.
   
      VarFloor Float
  Variance floor value.
   
      WgtFloor Float.
  Mixture weight floor value.
   
      MinSegNum Integer.
  Minimum segment number used to train the model.
   
      MaxSegNum Integer.
  Maximum segment number used to train the model. If training corpus has more segments, use only MaxSegNum segments.
   
      LabCoefficient Float.
  Used to convert original label to the label in frame.
   
      FeatureFileSuffix String, suffix of feature files.
   
      Viterbi Boolean.
  If true, use Viterbi algorithm to initialize segmentation; if false, otherwise.
       
      Forward Boolean.
  If true, use Forward-Backward algorithm to train the model; if false, otherwise.
   
      TeeModel String, transcription of Tee model.
   
      FileSaveAfterViterbi Boolean.
  If true, save models after initial Viterbi iteration; if false; otherwise.
   
      MaxIterNumVTB Integer.
  Maximum iteration number of Viterbi re-estimation.
   
      OutputDirVTB String, directory.
  Output model directory for models created after running through Viterbi training and input directory for the forward-backward training stage.
   
      MaxIterNumFB Integer.
  Maximum iteration number of forward-backward re-estimation.
   
      OutputDirFB String, directory.
  Output model directory for models created after running through the forward-backward algorithm.

Discussion

The monotrain tool is used to initialize isolated HMMs using the Viterbi algorithm and forward-backward algorithm. Label and feature files are the input data. Label file consists of a number of lines; each line is represented as:

start    end    phone

where 'start' is the segment start time, 'end' is the segment end time, and 'phone' is the segment corresponding monophone. The time is measured in 1e-7 seconds and the segment time is related to the parameter 'LabCoefficient' (for a 16Hz wave file at 100 frames per second, 'LabCoefficent' = 160). The initial and the final frame numbers N are found by the formula:

Below follows a fragment of the label file:

0 48 sil
48 70 z
70 88 ih
88 91 r
91 113 ow
113 137 sp

The monophone list file is used to list the transcription of monophones, with the format like following:

ah
ao
ax

The label list file contains the list of label files used, with the format like:

{
LabelFileName1
LabelFIleName2
}

Simple Gaussian distribution with diagonal covariance matrix is used to compute probability in HMM. Simple distribution means that there is only one Gaussian in a mixture and Gaussian weight is equal to 1. For each state and for each segment, the vector of mean for the mixture component, covariance matrix for the mixture component, and transform probability are initialized. All parameters are initialized by uniform segmentaion.

The segment is read from the feature file. The Viterbi component of monotrain tool then will search the input data for as many segments as indicated by the parameter MaxSegNum. If there are fewer segments for a given monophone than indicated by the parameter MinSegNum, initialization will be canceled and no output files created.

The next steps are the Viterbi state realignment and the best likelihood path selection for each segment. The Viterbi component of the monotrain tool executes up to MaxIterNumVTB iterations. However, the program may be terminated earlier if the best likelihood/frame value change becomes relatively insignificant.

The Viterbi part of the monotrain tool creates two files: a binary file with the extension *.hmm and a text file with the extension *.txt. The number of segments found for all monophones from the MonoPhnList may not be smaller than MinSegNum.

The forward-backward component of the monotrain tool initializes the HMMs using the forward-backward algorithm for each monophone. The input data for that component are HMMs produced by the Viterbi part of the tool and feature files. It operates on the same segment and uses the same Baum-Welch re-estimation algorithm as the Viterbi part.

The forward-backward component produces HMM files for all monophones from the MonoPhnList except the 'tee' model. Note that the model is not trained for the 'tee' model. The format tool will create HMM for the 'tee' model later.

The forward-backward component of the monotrain tool executes up to MaxIterNumFB iterations of Baum-Welch re-estimation. However, the program may be terminated earlier if the likelihood/frame value change becomes relatively insignificant.

Just as the Viterbi part, the forward-backward component of the monotrain tool creates two files, a binary file with the extension *.hmm and a text file with the extension *.txt for all monophones from the list MonoPhnList except the 'tee' model.

 

back to top

 

2.2  gengv

  gengv

 
   
  Command line parameters: gengv -config asr.cfg
   
  Parameters:  
   
      FeatureDir String, directory.
  Root directory of all feature files.
   
      LabelFileList String, file name.
  Label file name list with starting and ending frame index.
   
      OutputFile String, file name..
  Output file name of feature variance.
   

Discussion

The gengv tool generates global variance file for training data.

Input data are feature files with the extension *.mfcc that are stored in the corresponding directories and listed by the parameter LabelFileList. LabelFileList has the format like:

filename (without extension),    start frame~end frame

The command gengv produces the global variance vector computed by the formula:


where is the frame of feature file, is the global variance vector, is the global mean vector, is the number of feature files, is the length of the feature file.

 

back to top

 

2.3  format

  format

 
   
  Command line parameters: format -config asr.cfg
   
  Parameters:  
   
      MonoPhnList String, file name.
  Monophone list file name.
   
      InputDir String, directory.
  Root directory of the monophone model produced by monotrain tool.
   
      OutputDir String, directory.
  Directory of the output triphone model mapping, structure and parameter files.
   
      HMMMapFile_W String, file name..
  Output HMM mapping file name.
   
      HMMPhysFile_W String, file name.
  Output HMM physical file name.
   
      HMMParamFile_W String, file name.
  Output HMM parameter file name.
   
      TeeModel String.
  Transcription of tee model.
   
      SilenceModel String.
  Transcription of silence model..
   

Discussion

The format tool performs the HMM mapping. Structure and parameter files (*.map, *.phys and *.param) are used for embedded training and decoding. The format tool builds model files from individual monophone models (files with the extension *.hmm) that are produced by the monotrain tool. The middle state of the 'silence' HMM for initialization is used to add HMM to the 'sp' model.

 

back to top

 

2.4  maptlist

  maptlist

 
   
  Command line parameters: maptlist -config asr.cfg
   
  Parameters:  
   
      MonoPhnList String, file name.
  Monophone list file name.
   
      InputDir String, directory.
  Directory in with HMM mapping and parameters files are placed.
   
      OutputDir String, directory.
  Directory for output physical transcript and HMM physical files.
   
      HMMMapFile String, file name..
  Input monophone HMM mapping file name.
   
      HMMPhysFile String, file name.
  Output HMM physical file name (only for decision tree mapping). If the HMM mapping file is a decision tree, then the physical HMM file must be rebuilt with the help of the original HMM parameter file and the decision tree. If the HMM mapping file is not a decision tree, then the original physical HMM file must be unchanged.
   
      HMMParamFile String, file name.
  Input monophone HMM parameter file name.
   
      TeeModel String.
  Transcription of tee model.
   
      LogicTListFile String, file name.
  Input logical transcript file.
   
      PhysTListFile String, file name.
  Output physical transcript file name.
   
      GroupSize Integer.
  Defines the number of tasks in each section of the output task list file.

Discussion

The maptlist tool transforms the logical task list file LogicTListFile that contains transcription of utterances in a physical task list file PhysTListFile used by the emtrain tool during embedded training. For each training data utterance the logical task list contains the monophone or triphone transcription of the utterance, the feature file name and the range.

Below are fragments of a logical monophone logical task list:

{
[transcript] sil z ih s r ah w ao ow ah t uw ah
trainData/003_1_1to2, 0~1189
}
 

The physical task list is similarly structured save the names that are substituted by indices of the corresponding HMMs in the hmm.phys file.

Below follows a fragment of a physical task list:

{
[transcript] 16 22 9 14 13 0 21 1 12 0 17 19 0
trainData/003_1_1to2, 0~1189
}
 

The hmm.map file could be of two types: either DECISIONTREE or ONE2ONEMAPPING. In case the ONO2ONEMAPPING HMM is used, the file hmm.phys already exists. If the DECISIONTREE type is used, the file hmm.phys is build simultaneously with the physical task list.

 

back to top

 

2.5  emtrain

  emtrain

 
   
  Command line parameters: emtrain -config asr.cfg -group mono -pass mono11
  -group and -pass indicate the right parameters in the config file
   
  Parameters:  
   
      GlobalVarFile String, file name.
  Input file name of global variance file of the feature (the result of gengv tool)
   
      FeatureRootDir String, directory.
  Root directory of the feature files for embedded training.
   
      PruneInit Float.
  Initial value of beam width for pruning in the backward calculation.
   
      PruneLimit Float.
  Upper limit of the backward beam width value. Once this number is exceeded, the current utterance is not used for training.
   
      PruneInc Float.
  Beam width value increment at every pruning failure.
   
      MinTrainSeg Integer.
  Minimal number of phone examples required to update the corresponding HMM parameters.
   
      UpdateMode Integer.
  Update mode (4 bit encoded), 0 - disable, 1 - enable:
  bit 0                    transition matrix
  bit 1                                        Guassian weights
  bit 2                                        mean vectors
  bit 3                                        variance vectors.
   
      VarFloorCoeff Float.
  Variance floor value.
   
      HMMPhysFile String, file name.
  Input HMM physical file name (path from working directory)
   
      HMMParamFile String, file name.
  Input HMM parameter file name (path from InputDir directory)
   
      HMMMapFile String, file name.
  Input HMM mapping file name (path from working directory)
   
      PhysTListFile String, file name.
  Input physical list file. Contains the list of feature files used for training and physical transcription.
   
      TotalOcctFile String, file name.
  Output file name of the statistics file.
   
     adaptCMSnVNWinSize Integer.
  Window Size for adaptive CMS and VN on the audio feature.. If the adaptCMSnVNWinSize exceeds 0, adaptive CMS and VN will be used; if the adaptCMSnVNWinSize is less than 0, adaptive CMS and VN will not be used. If adaptCMSnVNWinSize is equal to 0, use utterance-based CMS and VN.
   
      CMS Integer.
  If the number of Cepstrum coefficients is positive, Cepstrum Mean Subtraction is performed on the audio feature; if negative, Cepstrum Variance Normalization; if = 0, no input feature transformation should be done.
   
      StateOrModelOcct_B Boolean.
  Defines type of statistcs (occupation) file type. If the StateOrModelOcct_B is true, the state-oriented statistics file is generated. This file is used by mixture merging and splitting commands; If the value is false, the model-oriented statistics file is generated. This file is used by decision and regression tree building tools (currently StateOrModelOcct_B can only be set to true).
   
      MinMixOcc Float.
  Minimal HMM state mixture occupation required to update parameters of its Gaussian mixture.
   
      NewHMMParamFile String, file name.
  Output new HMM parameter file name (path from the OutputDir directory).
   
      InputDir String, directory.
  Directory in which the HMM parameter file is placed.
   
      OutputDir String, directory.
  Directory for output HMM parameter file and statistics files.
   

Discussion

The emtrain tool performs the training iteration of the input HMM model set on a given input feature file set. The training procedure is based on the embedded training (EM) method, i.e., an iterative method to perform maximum likelihood (ML) estimation with incompletely observed data. The EM method consists of the expectation step (E-step) to obtain the Baum function and maximization step (M-step). The Baum-Welch (forward-backward algorithm) method is used.  The tool applies F-B algorithm to each feature file and collects accumulators and occupation counters for each HMM state in the model which are used then for re-estimation the new HMM parameter values. In case the number of HMM is less than the value of the parameter MinTrainSeg, the training data model parameters are not updated. The mean and variance vectors for which the occupation counter is less than the value of the parameter MinMixOcc are not updated either.

Parameter UpdateMode is used by the emtrain tool to define which parameters need to be updated. In order to make F-B algorithm work faster beam pruning is used. The initial value of the beam width is set by the parameter PruneInit. In case either forward or backward calculation has failed, the beam width is repeatedly increased by the value set by the parameter PruneInc until the beam width becomes more than the value of the parameter PruneLimit. If pruning for the maximum value of the beam width has failed, the feature file will not be used for training during this iteration.

 

back to top

 

2.6 bldlogtlist

  bldlogtlist

 
   
  Command line parameters: bldlogtlist -config asr.cfg
   
  Parameters:  
   
      InputDir String, directory.
  Directory with the input logical monophone task list.
   
      OutputDir String, directory.
  Directory for the output logical triphone task list and seen triphone list files.
   
      MonoListFile String, input file.
  Monophone list file name.
   
      InputTaskListFile String, input file.
  The name of the input task file with monophone transcription.
   
      IsWithin Boolean.
  Provides the within-word triphone extension if 'true', cross-word extension otherwise.
   
      OutputSeenList String, output file.
  The name of the output file that contains all seen triphone names.
   
      OutputTList String, output file
  The name of output task list file with triphone transcription.

Discussion

The bldlogictlist tool extends the monophone utterance transcription in the input task list file to the triphone transcription, either within-word or cross-word. The output logical task list is used by the maptlist tool for building the physical task list file. The seen triphone list file is used for model cloning by the clone tool and for building decision and regression trees.

 

back to top

 

2.7  clone

  clone

 
   
  Command line parameters: clone -config asr.cfg
   
  Parameters:  
   
      InputDir String, directory.
  Directory where the monophone model parameter file is placed.
   
      OutputDir String, directory.
  Directory for triphone model mapping, structure and parameter files.
   
      MonoListFile String, input file.
  Monophone list file name.
   
      MonoListFile String, input file.
  Monophone list file name.
   
      HMMMapFile String, input file.
  Monophone HMM mapping file name.
   
      HMMPhysFile String, input file.
  Monophone HMM physical file name.
   
      HMMParamFile String, input file.
  Monophone HMM parameters file name.
   
      TriPhoneList String, input file.
  Seen triphone list  file name.
   
      HMMMapFile_W String, output file.
  Triphone HMM mapping file name.
   
      HMMPhysFile_W String, output file.
  Triphone HMM physical file name.
   
      HMMParamFile_W String, output file.
  Triphone HMM parameters file name.
   

Discussion

The clone tool transforms context-independent (monophone) models into models dependent on the left and right neighboring phones (context-dependent or triphone models). For each seen triphone the monophone HMM for its central monophone is replicated.

 

back to top

 

2.8  sbuild

  sbuild  

  Command line parameters:

sbuild -config asr.cfg -group dcstree

  Parameters:

 
   

      InputDir

String, directory.
  Directory with input HMM parameter file.

      OutputDir String, directory.

 

Directory for output decision tree files (*.sbd).

      TaskListFile

String, input file.
  Task list file name. This file contains monophone names and state numbers for which decision trees were built.

      MonoListFile String, input file.

 

Monophone list file name.

      SeenHMMList

String, input file.
Seen triphone file list name. Each line of this file contains a logical triphone name.

      HMMMapFile String, input file.

HMM mapping file name.

      HMMPhysFile

String, input file.
HMM physical file name.

      HMMParamFile String, input file.

HMM parameter file name.

      QuestionListFile

String, input file.
Name of the file with phonetic question list.

      TotalOcctFile String, input file.

Statistics file after last iteration of embedded training. This file must have the model form.
 

      TraceBuilding

Boolean.
If 'true', the sequence of the chosen best questions is printed.

      TresholdBuild Float.

Threshold value to cease clusters splitting.

      Outlier

Float.
Threshold value to prevent clusters with only one element.

Discussion

The sbuild tool performs clustering on triphone sets for each state of each central monophone to improve HMM trainability. Decision trees may be built for each state of each monophone except the 'tee' model. The idea of the clustering algorithm is in selecting the best sequence of phonetic questions about the left and right contexts to split phones into phoneme varieties (clusters) with similar pronunciation. At each step the best questions for each cluster are selected according to the statistical characteristics. After that the best cluster must be split according to the best question. Clustering produces a  decision tree with non-terminal nodes containing questions about the context and leaf nodes with phone clusters. After clustering, all decision trees will be saved in separate files with the file extension *.sbd

 

back to top

 

2.9  smerge

  smerge  

  Command line parameters:

smerge -config asr.cfg -group dcstree

  Parameters:

 
   

      InputDir

String, directory.
  Directory with input HMM parameter file.

      OutputDir String, directory.

 

Directory for output decision tree files (*.sbd) and for input decision tree files (*.smg)

      TaskListFile

String, input file.
  Task list file name. This file contains monophone names and state numbers for which decision trees were built.

      MonoListFile String, input file.

 

Monophone list file name.
      HMMMapFile String, input file.

HMM mapping file name.

      HMMPhysFile

String, input file.
HMM physical file name.

      HMMParamFile String, input file.

HMM parameter file name.

      QuestionListFile

String, input file.
Name of the file with phonetic question list.

      TotalOcctFile String, input file.

Statistics file after last iteration of embedded training. This file must have the model form.
 

      TraceMerging

Boolean.
If 'true', the sequence of the leaves joining is printed.

      TresholdMerge Float.

Threshold value to cease clusters joining

      Outlier

Float.
Threshold value to prevent clusters with only one element.

Discussion

The smerge tool prunes and merges decision trees and selects typical states. The pruning step follows the reverse order of the decision tree building process according to the threshold. Each pair of nodes that has been split during the sbuild procedure will be merged into one node with previous phones cluster. The 'unbuild' process stops when the next considered node has a splitting improvement, i.e., 'splitting probability' minus 'total probability', less than the threshold. At the next step the most likely leaf nodes with the merging cost less than the threshold are merged together. In the end a typical state for each leaf node will be selected. Thus, the tool produces more complex decision trees, or cycle trees, that will be saved in files with the file extension *.smg.

 

back to top

 

2.10  savmodel

  savmodel  

  Command line parameters:

savmodel -config asr.cfg -group dcstree

  Parameters:

 
   

      InputDir

String, directory.
  Directory with input HMM parameter file.

      OutputDir String, directory.

 

Output directory for new HMM parameter file.

      TaskListFile

String, input file.
Task list file name with the names of all monophone states.

      MonoListFile String, input file.

 

Monophone list file name.

      SeenHMMList

String, input file.
Seen triphone file list name. Each line of this file contains a logical triphone name.

      HMMMapFile String, input file.

HMM mapping file name.

      HMMPhysFile

String, input file.
HMM physical file name.

      HMMParamFile String, input file.

HMM parameter file name.

      QuestionListFile

String, input file.
Name of the file with phonetic question list.

      HMMMapFile_W String, output file.

New HMM mapping file.
 

      HMMParamFile_W

String, output file.
New HMM parameter file.

Discussion

The savmodel tool creates HMM mapping and HMM parameter files for clustered model using decision trees. The simple linear form HMM mapping file for context dependent models is transformed into a tree-oriented HMM mapping file that contains decision tree nodes, a question list, etc.

After compressing and remapping mixtures the tool saves all HMM weights, mean and variance vectors into an HMM parameter file. The tool produces a decision tree mapping mechanism useful for handling unseen triphones and rising HMM trainability.

 

back to top

 

2.11  gaussmrg

  gaussmrg  

  Command line parameters:

gaussmrg -config asr.cfg -group clust -pass clust34

  Parameters:

 
   

      InputDir

String, directory.
  Directory with input HMM parameter file.

      OutputDir String, directory.

 

Directory for output decision tree files (*.sbd).

      HMMParamFIle

String, input file.
Input HMM parameters file name.

      MonoListFile String, input file.

 

Monophone list file name.

      TotalOcctFile

String, input file.
Statistics file for the input HMM parameter file. This file should have model form.

      MinMixOcc Float.

Minimal HMM state occupation to prevent Gaussians from merging.

      DetLimit

Float.
All Gaussian mixtures with the determinant of Gausssian variance less than the value DetLimit are deleted to prevent delta from functioning like Gaussians for rare phone HMM.

Discussion

The Gaussmrg tool merges poorly used Gaussian mixture components according to the statistics file for HMM trining iteration.

The merging algorithm recalculation formulas for the weights w, mean m and variance v vectors for two mixtures is as follows:

where N is the vector length, w=w1+w2

 

back to top

 

2.12  mixup

  mixup

 
   
  Command line parameters: mixup -config avsr.cfg -group mono -pass mono14
  -group and -pass indicate the right parameters in the config file
   
  Parameters:  
   
      HMMParamFile String, file name.
  Input and output HMM parameter file name (path from the InputDir and OutputDir respectively).
   
      HMMPhysFile String, file name.
  Input HMM physical file name (path from the working directory).
   
      HMMMapFile String, file name.
  Input HMM mapping file name (path from the working directory).
   
      OcctFile String, file name.
  Input statistics file name (path from the InputDir directory).
   
      InputDir String, directory.
  Directory with input HMM parameter files.
   
      OutputDir String, directory.
  Directory for output HMM parameter files.
   
      MixNum Integer.
  The component number to which miture are split.
   
      MinMixOcc Float.
  Minimal HMM state occupation to split mixture components.
   
      PertDepth Float.
  Perturb depth of the mean of split Gaussians. Split one Gaussian with mean m and variance v vectors to two new Gaussians whose means are m+PertDepth*v and m-PertDepth*v, variances are unchanged, i.e., equal to v.
   
      VarFloor Float.
  Variance floor value.
   
      SplitPenalty Float.
  Gaussian split penalty. Whenever a Gaussian is split once, this number is added to prevent Gaussian split too many times, usually set this parameter very big.
   

Discussion

The mixup tool increases the number of Gaussians in all mixtures to the MixNum value by successive splitting of most extended Gaussians.

Accurate approximation of the continuous probability density function requires an increase of the number of mixture densities or mixture coefficients at the HMM training stage. This program increases the number of components in all Gaussian mixtures of HMM by splitting components with the biggest determinant of the variance matrix.

where N is the vector length.

 

back to top

 

2.13  Config file Example

In order to help the users run HMM training more easily, we provide an example batch file which sequentially uses the tools described in the manual before. Below follows fragments of the batch file:

monotrain -config asr.cfg
format -config asr.cfg
gengv -config asr.cfg
maptlist -config asr.cfg -pass mono
emtrain -config asr.cfg -group mono
emtrain -config asr.cfg -group mono
emtrain -config asr.cfg -group mono
bldlogtlist -config asr.cfg
clone -config asr.cfg
maptlist -config asr.cfg -group context -pass context21
emtrain -config asr.cfg -group context -pass context22
sbuild -config asr.cfg -group dcstree
smerge -config asr.cfg -group dcstree
savmodel -config asr.cfg -group dcstree
maptlist -config asr.cfg -pass clust
emtrain -config asr.cfg -group clust -pass clust31
emtrain -config asr.cfg -group clust -pass clust32
gaussmrg -config asr.cfg -group clust -pass clust33
mixup -config asr.cfg -group clust -pass clust34
 

Example of the config file 'asr.cfg' is included in the package.

Note that each tool needs some reference files to run properly. Examples of these files are also included in our package.
 

back to top

 

2.14  AHMMDecoder and VHMMDecoder

The decoding (recognition) components are packed in the AHMMDecoder.dll, which provides functions for speech recognition using the HMM parameter file obtained from the training stage on audio feature. The functions in AHMMDecoder.dll are as follows.

 int ahmm_decoder_init( char *cfgfile );
  Function description Initialize decoder with the config file.
  Input: Config file name
  Return: 0, initialization successful
1, initialization failed and error messages output to console.
  int ahmm_decoder_rec(float *data, int vecSize, int frameNum, char *result, int resLen);
  Function description: Main decoding procedure
  Input: data:                 input feature data.
vecSize:            vector size of feature.
frameNum:       frame number
result:               decoding result in format 'word [startframe endframe]'
resLen:            buffer size of result, to prevent memory overflow
  Return: 0, decoding successful, result stored in Result buffer
  void ahmm_decoder_free()
  Function description: Free the resources.
  Return: NULL

The functions in VHMMDecoder.dll are as follows.

 int vhmm_decoder_init( char *cfgfile );
  Function description Initialize decoder with the config file.
  Input: Config file name
  Return: 0, initialization successful
1, initialization failed and error messages output to console.
 int vhmm_decoder_rec(float *data, int vecSize, int frameNum, char *result, int resLen);
  Function description: Main decoding procedure
  Input: data:                 input feature data.
vecSize:            vector size of feature.
frameNum:       frame number
result:               decoding result in format 'word [startframe endframe]'
resLen:            buffer size of result, to prevent memory overflow
  Return: 0, decoding successful, result stored in Result buffer
  void vhmm_decoder_free()
  Function description: Free the resources.
  Return: NULL

The config file used by ahmm_decoder_init (vhmm_decoder_init) sets the decoding parameters, the meaning of which are described as below:

  REFS_PATH String, directory.
Directory where the HMM parameter files and reference files are placed.
   
  REC_SilenceModel String.
Transcript for silence model.
  REC_TeeModel String.
Transcript for tee model.
  REC_StatePruneBeam Float.
Beam width for state and model level pruning.
  REC_WordEndBeam Float.
Beam width for word end pruning.
  REC_OutputType String
Output result file type.
  REC_NBest Integer.
The best hypothesis number.
  REC_NBestBeam Float.
Beam width for best hypothesis search.
  REC_NToken Integer.
Size of the stack of tokens.
  REC_PhonePenalty Float.
The penalty value that is added to token probability after decoding of a phone.
  REC_MonoListFile String, file name.
Input monophone list file name (path from REFS_PATH directory)
  REC_HMMMapFile String, file name.
Input HMM mapping file name (path from REFS_PATH directory)
  REC_HMMParamFile String, file name.
Input HMM parameter file name (path from REFS_PATH directory)
  REC_HMMPhysFile String, fila name.
Input HMM physical file name (path from REFS_PATH directory)
  PRELOAD_NETFILE0 String, file name.
Input file where the word lattice is placed (path from REFS_PATH directory)
  PRELOAD_NETNUM Integer.
Number of the lattice network loaded.
  TRN_WordDictFile String, file name.
Input word dictionary file (path from REFS_PATH directory).
  SEG_LEN Integer.
Cache segment length in the feature.
  REC_TokenLimit Integer.
If the number of tokens exceeds this value, the soft token number pruning is activated.
  REC_BeamFactor Float.
If the soft token number pruning is active at the beginning of each input frame, the value of REC_StatePruneBeam is multiplied by REC_BeamFactor. This feature helps to accelerate decoding of poor speech fragments.
  REC_CepstrumNum Integer.
CMS length for the feature.
  REC_ADAPTCMSWIN Integer.
adaptCMSWinSize for the feature.
  REC_OUTPUT_SCORE_FORMAT Integer.
Indicate whether or not to output .score result file.

Examples of the config and reference files are included in the package.

 

back to top

 

3.  Example

An example of HMM training is also included. Suppose the working directory is "C:\AvcsrDemo\bin\HmmTrain", users need to copy the example directories "visualfeature" and "label" to the current working directory ("C:\AvcsrDemo\bin\HmmTrain"). 

Directory "C:\AvcsrDemo\bin\HmmTrain" contains the executable tools and the reference files, whose functionalities can be found in this document.

"bldlogtlist", "clone", "emtrain.exe", "format.exe", "gaussmrg.exe", "gengv.exe", "maptlist.exe", "mixup.exe", "monotrain.exe", "savmodel.exe", "sbuild.exe", "smerge.exe": tools for HMM ASR training;

"tlist.align", "tlist_startend.align", "viseme_mono.list", "copyhmm.txt", "tlist_viseme.tree", "question_viseme.tri": reference files;

"hmm.cfg": config file for HMM ASR training;

"hmm_train.bat": batch file to run under MS-DOS.

Directory "visualfeature" is the place to put training data. Directory "label" is the place for label file. Examples of the feature and label files are given in these directories.

After running the "hmm_train.bat", a new directory named "result" will be created in this directory, which will contain the result model. Depending on the config file, the model is placed in the corresponding directories. Please refer to the config file for more details (in this example, the final HMM parameter model file "hmm.wwclust.param" is placed in "result\Model\output.clust_hmm.gm32.orig" from the current working directory, and the "hmm.wwclust.map" can be found in "result\Model\output.clustering" directory).

 

back to top

 

References

[Rab93]  L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Prentice Hall PTR, ISBN 0-130-15157-2, 1993

[Jel97]  F. Jelinek. Statistical Methods for Speech Recognition. The MIT Press, ISBN 0-262-10066-5, 1997

[Lee96]  C. H. Lee, F. K. Soong, and K. K. Paliwal, Automatic Speech and Speaker Recognition: Advanced Topics. Kluwer Academic publishers, ISBN 0-792-39706-1, 1996

back to top