NAME
Julian - Grammar based continuous speech recognition parser.
SYNOPSIS
julian [-C jconffile] [options ...]
DESCRIPTION
Julian is a continuous speech recognition parser based on
finite state grammar. High precision recognition is
achieved using a two pass hierarchical search.
Julian can perform recognition on microphone input, audio
files, and feature parameter files. Also, as standard
format acoustic models and language models can be used,
these models can be changed to perform recognition under
various conditions.
The maximum vocabulary is 65,535 words.
Model Usage
Julius uses the following models.
Acoustic Models
Acoustic HMM(Hidden Markov Model) are used.
Phoneme models (monophone), context dependent
phoneme models (triphone), tied-mixture and
phonetic tied-mixture models can be used. When
using context dependent models, interword
context is taken into consideration. Files
written in HTKs HMM definition language can be
used.
Language Model
For the task grammar, sentence structures are
written in a BNF style using word categories as
terminating symbols to a grammar file. A voca
file contains the pronunciation (phoneme sequence)
for all words within each category are created.
These files are converted with mkdfa.pl(1) to a
deterministic finite automaton file (.dfa) and a
dictionary file(.dict)
Speech Input
It is possible to recognize live input from either a
microphone A-D or a DatLink (NetAudio) system. Speech
waveform files (16bit WAV (no compression), or RAW format)
and feature parameter files (HTK format) can be used.
Warning: Julian can only extract MFCC_E_D_N_Z features
internally. If it is necessary to use HMMs based on
another type of feature extraction then microphone input
and speech waveform files cannot be used. Use an external
tool such as wav2mfcc to create the appropriate feature
parameter files.
Search Algorithm
Recognition in Julian uses a two pass structure. In the
first pass a high-speed, approximate search is performed
using weaker constraints then the given grammar. Here
a LA beam search using only inter-category constraints
extracted from the grammar is performed.
Using the original grammar rules the second pass
re-searches the results of the first pass, a high
precision result is gained quickly. In the second pass
the optimal solution is guaranteed using the A* search.
When using a context dependent phoneme model (triphone),
interword contexts are considered on both
the first and second passes. For tied-mixture and phonetic
tied-mixture models, high speed acoustic likelihood
calculations using gaussian pruning are performed.
OPTIONS
The options below allow you to set models used and system
parameters. You can set these option at the command line,
however it is recommended that you combine these options
in the jconf settings file and use the "-C" option at
run time.
Below we give an explanation of each of the options.
Speech Input
-input {rawfile|mfcfile|mic|netaudio|adinserv}
Select the speech wave data input source.
(default: mfcfile)
For informtaion on file formats refer to the Julius
documentation.
-NA server:unit
When using (-input netaudio) set the server name
and unit ID of the DatLink unit to connect to.
-filelist file
With (-input rawfile|mfcfile) perform
recognition on all files contained within the target
filelist.
-adport portnum
With (input adinserv) A-D server port number.
Speech segmentation
-pausesegment
-nopausesegment
Force speech segmentation (segment detection) ON / OFF.
(For mic, adinnet default = ON. For files, default = OFF)
-lv threslevel
Amplitude threshold (0 - 32767). If the amplitude
passes this threshold it is considered as the
beginning of the speech segment, if it drops below
this level then it is the end of the speech segment.
(default: 3000)
-headmargin msec
Margin at the start of the speech segment (msec).
(default: 300)
-tailmargin msec
Margin at the end of the speech segment (msec).
(default: 400)
-zc zerocrossnum
Zero crossing threshold. (default: 60)
-nostrip
Depending on the sound device, invalid "0" samples
at the start and end of recording may not be removed
automatically. The default is to perform automatic removal.
Acoustic Analysis
-smpFreq frequency
Sampling frequency (Hz).
(default: 16kHz = 625ns).
-smpPeriod period
Sampling rate (ns)
(default: 625ns = 16kHz).
-fsize sample
Analysis window size (No. samples).
(default: 400, 25mS)
-fshift sample
Frame shift (No. samples). (default: 160, 10mS)
-hipass frequency
Highpass filter cutoff frequency (Hz).
(default: -1 = disable)
-lopass frequency
Lowpass filter cutoff frequency (Hz).
(default: -1 = disable)
Language Model(BNF type Grammar)
-dfa dfa_filename
Select the finite state automaton grammar file
(.dfa) to use. (Required)
-penalty1 float
First pass word insertion penalty. (default: 0.0)
-penalty2 float
Second pass word insertion penalty.
(default: 0.0)
Recognition Dictionary
-v dictionary_file
Recognition Dictionary File (Required).
-silhead {WORD|WORD[OUTSYM]|#num}
-siltail {WORD|WORD[OUTSYM]|#num}
Sentence start and end silence as defined in the
word dictionary.
(default: "<s>" / "</s>")
These are dealt with specially during recognition to
hypothesise start and end points (margins). They can
be defined as shown below.
Example
Word_name <s>
Word_name[output_symbol] <s>[silB]
#Word_ID #14
(Word_ID is the word position in the dictionary file
(order) starting from 0)
-forcedict
Disregard dictionary errors.
(Skip word definitions with errors)
Acoustic Model(HMM)
-h hmmfilename
The name of the HMM definintion file to use.
(Required)
-hlist HMMlistfilename
HMMList filename. Required when using triphone
based HMMS. Details are contained in the Julius
documentation.
This file provides a mapping between the logical
triphones names generated from the phonetic
representation in the dictionary and the HMM
definition names.
-force_ccd / -no_ccd
When using a triphone acoustic model these options
control interword context dependency.If neither of
these options are set then the use of interword
context dependency will be determined from the
models definition names.
If the "-force_ccd" option is set when using
something other then a triphone model, there is no
guarantee that Julius will run.
-notypecheck
Do not check the input parameter type.
(default: Perform the check)
-iwcd1 {max|avg}
When using a triphone acoustic model set the
interword acoustic likelihood calculation method
used in the first pass.
max: The maximum same context triphone value (default)
avg: The average same context triphone value
Options for tied-mixture and PTM acoustic models
-tmix K
Perform Gaussian Pruning only calculate the upper
k gaussian densities per codebook. (default: 2)
-gprune {safe|heuristic|beam|none}
Set the gaussian pruning technique to use.
(default: safe (standard) beam (high-speed))
-gshmm hmmdefs
Set the Gaussian Mixture Selection monophone
model to use. A GMS monophone model is generated
from an ordinary monophone HMM model using the
attached program mkgshmm(1).
(default : none (do not use GMS))
-gsnum N
When using GMS, only perform triphone calculations
for the top N monophone states. (default: 24)
Search Parameters (First Pass)
-b beam_width
Beam width (Number of HMM nodes).
As this value increases the precision also increases,
however, however processing time and memory usage also
increase.
default values: Model dependent,
400 (monophone)
800 (triphone,PTM)
1000 (triphone,PTM,engine=v2.1)
-1pass
Only perform the first pass search. This mode is
automatically set when no 3-gram language model
has been specified (-nlr).
-realtime
-norealtime
Explicity state whether real time processing will be
used in the first pass or not. For file input the
default is OFF (-norealtime), for microphone, or
NetAudio network input the default is ON
(-realtime). This option relates to the way CMN is
performed: when OFF CMN is calculated for each
input independently, when the realtime option is ON
the previous 5 second of input is always used.
Refer to -progout.
Search Parameters (Second Pass)
-b2 hyponum
Hypothesis envelope width. This number of hypotheses
are expanded(sorted by length), shorter hypotheses are
not expanded. This prevents search failures. (default: 30)
-n candidate_num
The search continues until "candidate_num" sentence
hypothesis have been found. These hypotheses are
re-sorted by score and the final result is displayed.
(Refer to the "-output" option). As Julius does not
strictly guarantee a optimal second pass search,
the maximum likelihood candidate is not always
given first.
As this value is increased the probability that the
maximum likelihood hypothesis is returned increases,
but as a prolonged search must be performed, the
processing time also becomes large. (default: 1)
default value is dependent on the recognition engine
settings ("--enable-setup= ").
10 (standard)
1 (fast,v2.1)
-output N
Used with the "-n" option above. Output the top N
sentence hypothesis. (default: 1)
-sb score
Score envelope width. For each frame, do not scan
areasthat deviate from the highest score by more
then this envelope. This directly relates to the speed
of the second pass acoustic likelihood calculations.
(default: 80.0)
-s stack_size
The maximum number of hypothesis that can be stored
on the stack during the search. A larger value gives more
stable results, but increases the amount of memory
required. (default: 500)
-m overflow_pop_times
Number of expanded hypotheseserequired to
discontinue the search. If the number of expanded
hypotheses is greater then this threshold then, the search
is discontinued at that point. The larger this
value is, the longer the search will continue, but
processing time for search failures will also
increase. (default: 2000)
-lookuprange nframe
When performing word expansion, this option sets
the number of frames before and after in which to consider
word expansion. This prevents the omission of short
words but, with a large value, the number of hypotheses
expanded increases and the system slow. (default: 5)
Forced alignment
-walign
Return the result of viterbi alignment of the word
units from the recognition results.
-palign
Return the result of viterbi alignment of the
phoneme units from the recognition results.
Message Output
-quiet Omit phoneme sequence and score, only output
the best word sequence hypothesis.
-progout
Gradually output the interim results from the
first pass at regular intervals.
-proginterval msec
set the -progout output time interval (msec).
-demo The same as "-progout -quiet".
Other
-debug Display debug information.
-C jconffile
Load the jconf settings file. Here runtime options
can be loaded that are set in this file.
-version
Display program name, compile time, and compile
time options.
-help
Display a brief overview of options.
EXAMPLES
For examples of system usage refer to the Julian documentation.
SEE ALSO
mkbingram(1), adinrec(1), adintool(1), mkdfa(1),
mkgsmm(1), wav2mfcc(1)
DIAGNOSTICS
On exiting normally, Julian will return the exit status
0, If an error is found then Julius exits abnormally, and the
exit status 1 is returned.
If an input file cannot be found or cannot be loaded for
some reason then Julian will skip processing for that file.
BUGS
There are a number of restrictions to the type and size of the
models Julian can use. For a detailed explanation refer
to the Julian and Julius documentation.
For bug-reports, inquires and comments please contact
julius@kuis.kyoto-u.ac.jp
AUTHORS
Rev.1.0 (1998/07/20)
Designed by Tatsuya Kawahara and Akinobo Lee
(Kyoto University)
Rev.2.0 (1999/02/20)
Rev.2.1 (1999/04/20)
Rev.2.2 (1999/10/04)
Rev.3.1 (2000/05/11)
Development by Akinobo Lee (Kyoto University)
Rev.3.2 (2001/08/15)
Development mainly by Akinobo Lee
(Nara Institute of Science and Technology)
THANKS TO
Up to Rev.3.1 this program was released under the speech
media laboratory, Kyoto University (Doshiya Lab). From
Rev.3.2 Julian has been integrated with Julius and released
under the "Information Processing Society, Continuous
Speech Recognition Consortium".
The Windows Microsoft Speech API compatible version was
developed by Takashi Sumiyoshi (Kyoto University).
I am very grateful to all those that provided me with
timely advice, comments and guidance.
Last modified: 2001/11/16 07:27:14