Linear Discriminant Analysis |
This application implements the linear discriminant analysis (LDA) described in VisualFeatures.htm. In the AVCSR system (see avcsr.htm) the application is used obtain the viseme-based LDA space (training mode) and projections (projection mode). In both the training and projection modes, the system upsamples the input data to 100Hz, and uses feature mean normalization.
Training mode:
lda.exe -train <viseme file> <input data folder> <label folder>
<LDA matrix file> <subspace dim>
<visume file> Text file that lists all the visemes used. The definition of the visemes used in the current system, which are a subset of the viseme set described in [Neti, 00], are shown in the table below:
Viseme name |
Description |
Corresponding Phones |
sil |
Silence |
sil, sp |
a |
Lip-rounding based vowels |
ao, ah, aa, er, oy, aw, hh |
u |
uw, uh, ow |
|
e |
ae, eh, ey, ay |
|
i |
ih, iy, ax |
|
lr |
Alveolar-semivowels |
l, el, r, y |
sz |
Alveolar-fricatives |
s, z |
td |
Alveolar |
t, d, n, en |
szh |
Palato-alveolar |
sh, zh, ch, jh |
pb |
Bilabial |
p, b, m |
tdh |
Dental |
th, dh |
fv |
Labio-dental |
f, v |
kg |
Velar |
ng, k, g, w |
File visemes.txt is the viseme file used in the LDA transform for numeric string data, in which there are 11 visemes listed.
<input data folder> a folder that stores the input data files (in the current application, the input files contain the PCA projections described in PcaDemo.doc). Each input data file is a binary file with the following format:
[nSamples] [samplePeriod] [sampleSize] [paramKind] [data body], where
[nSamples] 4-byte integer that indicates the number of the data vectors in the file
[samplePeriod] 4-byte integer that indicates the sample period in 100ns units, e.g. for 25Hz video it should be 400000.
[sampleSize] 2-byte integer that indicates the size of data vector in bytes. For the n-dimensional vector, it should be n×sizeof (float).
[paramKind] 2-byte integer that indicates the type of the data file, be set to 99.
[float data] 4-byte float data that is stored vector by vector. For the m n-dimensional data vectors, there should be m×n float data.
<label folder> Folder that stores the label files. Each label file corresponds to one input data in <input data folder>. For example, the label file test01.vis corresponds to the data file test01.pmfc and describes the visemes and the time boundaries of visemes in text format as show in the figure below.
As shown in the figure, in each line the first two numbers represent the start and end point of one viseme, followed by the label of the viseme. Note that the time boundaries are given in samples of 10 miliseconds. For example, the line “48 70 sz” means that the viseme “sz” starts from 48×10=480 ms and ends before 70×10=70 ms.
<LDA matrix file> File stores the LDA training result, see sample file test.lda.
< subspace dim > integer that indicates the dimension of the LDA subspace. For the numeric string data, the recommended subsapace dimension is 10.
Projection mode:
lda.exe -proj <LDA matrix file> <input data folder> <result
file folder>
<Lda matrix file> data file obtained in training mode. It contains the generalized eigenvectors of the LDA.
<input data folder> folder that stores the input data files (See the description of argument <input data folder> in training mode).
<result file folder> folder that stores the LDA projection files. Each file corresponds to one input data file, e.g. for input file test01.pmfc, the corresponding projection file is test01.mfcc. The format of the projection file is the same as the input data file. (See the description of argument <input data folder> in training mode.)
References
[Neti, 00] Neti C., Potamianos G., Luettin J., et al. “Audio-visual speech recognition, Final Workshop 2000 Report, Center for Language and Speech Processing”, The Johns Hopkins University, Baltimore, MD (Oct. 12, 2000).