NAME
Julius - Japanese LVCSR engine
SYNOPSIS
julius [-C jconffile] [options ...]
DESCRIPTION
Julius is a open source speech recognition engine that can
perform continuous speech recognition with a vocabulary in
the tens of thousands of words. High precision recognition
can be obtained using a 3-gram based two pass search
technique.
Julius can perform recognition on microphone input, audio
files, and feature parameter files. Also as standard
format acoustic models and language models can be used,
these models can be changed to perform recognition under
various conditions.
The maximum vocabulary is 65,535 words.
Model Usage
Julius uses the following models.
Acoustic Models
Acoustic HMM(Hidden Markov Model) are used.
Phoneme models (monophone), context dependent
phoneme models (triphone), tied-mixture and
phonetic tied-mixture models can be used. When
using context dependent models, interword
context is taken into consideration. Files
written in HTKs HMM definition language can be
used.
Language Model
The system uses 2-gram and reverse 3-gram
language models. Standard format ARPA files can
be loaded. Binary format N-gram models built
using the attached tool mkbingram can also be
used.
Speech Input
It is possible to recognize live input from either a
microphone A-D or a DatLink (NetAudio) system. Speech
waveform files (16bit WAV (no compression), or RAW format)
and feature parameter files (HTK format) can be used.
Warning: Julius can only extract MFCC_E_D_N_Z features
internally. If it is necessary to use HMMs based on
another type of feature extraction then microphone input
and speech waveform files cannot be used. Use an external
tool such as wav2mfcc to create the appropriate feature
parameter files.
Search Algorithm
Julius recognition is based on a two pass strategy. On the
first pass the entire input is process and a interim
result is displayed. The model used in this pass is a word
2-gram and a word HMM tree structured network. Decoding
is performed by a frame synchronous beam search.
The second pass searches using a reverse 3-gram, this
attempts to gain a higher precision recognition result.
Word unit stack decoding is performed using the
restrictions from interim results of the first pass and
look-ahead information.
When using context dependent phones (triphone), interword
contexts are taken into consideration. For tied-mixture
and phonetic tied-mixture models, high-speed acoustic
likelihood calculation is possible using gaussian pruning.
OPTIONS
The options below allow you to select the models used and
set system parameters. You can set these option at the
command line, however it is recommended that you combine
these options in the jconf settings file and use the "-C"
option at run time.
Below is an explanation of all the possible options.
Speech Input
-input {rawfile|mfcfile|mic|netaudio|adinserv}
Select the speech wave data input source.
(default: mfcfile)
For information on file formats refer to the Julius
documentation.
-NA server:unit
When using (-input netaudio) set the server name
and unit ID of the Dateline unit to connect to.
-firelight file
With (-input rawfile|mfcfile) perform
recognition on all files contained within the target
firelight.
-adport portnum
With (input adinserv) A-D server port number.
Speech segmentation
-pausesegment
-nopausesegment
Force speech segmentation (segment detection) ON / OFF.
(For mic, adinnet default = ON. For files, default = OFF)
-lv threslevel
Amplitude threshold (0 - 32767). If the amplitude
passes this threshold it is considered as the
beginning of the speech segment, if it drops below
this level then it is the end of the speech segment.
(default: 3000)
-headmargin msec
Margin at the start of the speech segment (msec).
(default: 300)
-tailmargin msec
Margin st the end of the speech segment (msec).
(default: 400)
-zc zerocrossnum
Zerocrossing threshold. (default: 60)
-nostrip
Depending on the sound device, invalid "0" samples
at the start and end of recording may not be removed
automatically. The default is to perform automatic removal.
Acoustic Analysis
-smpFreq frequency
Sampling frequency (Hz).
(default: 16kHz = 625ns)
-smpPeriod period
Sampling rate (ns)
(default: 625ns = 16kHz)
-fsize sample
Analysis window size (No. samples).
(default: 400, 25mS)
-fshift sample
Frame shift (No. samples). (default: 160, 10mS)
-hipass frequency
Highpass filter cutoff frequency (Hz).
(default: -1 = disable)
-lopass frequency
Lowpass filter cutoff frequency (Hz)
(default: -1 = disable)
Language Model(N-gram)
-nlr 2gram_filename
2-gram language model filename (standard ARPA format)
-nrl rev_3gram_filename
Reverse 3-gram language model filename. This is
required for the second search pass. If this is
not defined then only the first pass will take
place.
-d bingram_filename
Use a binary language model as built using
mkbingram(1). This is used in place of the "-nlr"
and "-nlr" options above, and allows Julius to
perform initialization quickly.
-lmp lm_weight lm_penalty
-lmp2 lm_weight2 lm_penalty2
Language model score weights and word insertion
penalties for the first and second passes respectively.
The hypothesis language scores are scaled as shown below
lm_score1 = lm_weight * 2-gram_score + lm_penalty
lm_score2 = lm_weight2 * 3-gram_score + lm_penalty2
The actual hypothesis word score is a N-gram
log-likelihood which is scaled is using the
appropriate factors given below.
The default values are dependent on the language model:
First-Pass | Second-Pass
--------------------------
5.0/-1.0 | 6.0/0.0 (monophone)
8.0/-2.0 | 8.0/-2.0 (triphone,PTM)
9.0/8.0 | 11.0/-2.0 (triphone,PTM,engine=v2.1)
-transp float
Insertion penalty for [transparent words].
(default: 0.0)
Word Dictionary
-v dictionary_file
Word Dictionary File (Required)
-silhead {WORD|WORD[OUTSYM]|#num}
-siltail {WORD|WORD[OUTSYM]|#num}
Sentence start and end silence as defined in the
word dictionary.
(default: "<s>" / "</s>")
These are dealt with specially during recognition to
hypotheses start and end points (margins). They can
be defined as shown below.
Example
Word_name <s>
Word_name[output_symbol] <s>[silB]
#Word_ID #14
(Word_ID is the word position in the dictionary file
(order) starting from 0)
-forcedict
Disregard dictionary errors.
(Skip word definitions with errors)
Acoustic Model(HMM)
-h hmmfilename
The name of the HMM definition file to use.
(Required)
-hlist HMMlistfilename
HMMList filename. Required when using triphone
based HMMS. Details are contained in the Julius
documentation.
This file provides a mapping between the logical
triphones names generated from the phonetic
representation in the dictionary and the HMM
definition names.
-force_ccd / -no_ccd
When using a triphone acoustic model these options
control interword context dependency.If neither of
these options are set then the use of interword
context dependency will be determined from the
models definition names.
If the "-force_ccd" option is set when using
something other then a triphone model, there is no
guarantee that Julius will run.
-notypecheck
Do not check the input parameter type.
(default: Perform the check)
-iwcd1 {max|avg}
When using a triphone acoustic model set the
interword acoustic likelihood calculation method
used in the first pass.
max: The maximum, identical context triphone value (default)
avg: The average, identical context triphone value
Options for tied-mixture and PTM acoustic models
-tmix K
When performing gaussian pruning only calculate the upper
k gaussian densities per codebook. (default: 2)
-gprune {safe|heuristic|beam|none}
Set the gaussian pruning technique to use.
(default: safe (standard) beam (high-speed))
-gshmm hmmdefs
Set the Gaussian Mixture Selection monophone acoustic
model to use. A GMS monophone model is generated
from an ordinary monophone HMM model using the
attached program mkgshmm(1).
(default : none (do not use GMS))
-gsnum N
When using GMS, only perform triphone calculations
for the top N monophone states. (default: 24)
Short pause segmentation
-spdur Set the sp threshold length for use in the first
pass (number of frames). If number of frames that
the sp "unit" has the maximum likelihood is greater
then this threshold then, interrupt the first pass
and start the second pass. (default: 10)
By default short pause segmentation is not used. At
configuration time use the "--enable-sp-segment" option to
perform segmentation.
(For details refer to the Julius documentation)
Search Parameters (First Pass)
-b beam_width
Beam width (Number of HMM nodes).
As this value increases the precision also increases,
however, however processing time and memory usage also
increase.
default values: Model dependent,
400 (monophone)
800 (triphone,PTM)
1000 (triphone,PTM,engine=v2.1)
-sepnum N
(Used with the configure option "--enable-lowmem2")
Number of high frequency words to separate from the
dictionary tree. (default: 150)
-1pass
Only perform the first pass search. This mode is
automatically set when no 3-gram langauge model
has been specified (-nlr).
-realtime
-norealtime
Explicitly state whether real time processing will be
used in the first pass or not. For file input the
default is OFF (-norealtime), for microphone, or
NetAudio network input the default is ON
(-realtime). This option relates to the way CMN is
performed: when OFF CMN is calculated for each
input independently, when the realtime option is ON
the previous 5 second of input is always used.
Refer to -progout.
Search Parameters (Second Pass)
-b2 hyponum
Hypothesis envelope width. This number of hypotheses
are expanded(sorted by length), shorter hypotheses are
not expanded. This prevents search failures. (default: 30)
-n candidate_num
The search continues until "candidate_num" sentence
hypothesis have been found. These hypotheses are
re-sorted by score and the final result is displayed.
(Refer to the "-output" option). As Julius does not
strictly guarantee a optimal second pass search,
the maximum likelihood candidate is not always
given first.
As this value is increased the probability that the
maximum likelihood hypothesis is returned increases,
but as a prolonged search must be performed, the
processing time also becomes large. (default: 1)
default value is dependent on the recognition engine
settings ("--enable-setup= ").
10 (standard)
1 (fast,v2.1)
-output N
Used with the "-n" option above. Output the top N
sentence hypothesis. (default: 1)
-sb score
Score envelope width. For each frame, do not scan
areasthat deviate from the highest score by more
then this envelope. This directly relates to the speed
of the second pass acoustic likelihood calculations.
(default: 80.0)
-s stack_size
The maximum number of hypothesis that can be stored
on the stack during the search. A larger value gives more
stable results, but increases the amount of memory
required. (default: 500)
-m overflow_pop_times
Number of expanded hypotheseserequired to
discontinue the search. If the number of expanded
hypotheses is greater then this threshold then, the search
is discontinued at that point. The larger this
value is, the longer the search will continue, but
processing time for search failures will also
increase. (default: 2000)
-lookuprange nframe
When performing word expansion, this option sets
the number of frames before and after in which to consider
word expansion. This prevents the omission of short
words but, with a large value, the number of hypotheses
expanded increases and the system slow. (default: 5)
Forced alignment
-walign
Return the result of viterbi alignment of the word
units from the recognition results.
-palign
Return the result of viterbi alignment of the
phoneme units from the recognition results.
Message Output
-separatescore
Output the language acoustic scores separately
-quiet Omit phoneme sequence and score, only output
the best word sequence hypothesis.
-progout
Gradually output the interim results from the
first pass at regular intervals.
-proginterval msec
set the -progout output time interval (msec).
-demo The same as "-progout -quiet".
Other
-debug Display debug information.
-C jconffile
Load the jconf settings file. Here runtime options
can be loaded that are set in this file.
-version
Display program name, compile time, and compile
time options.
-help
Display a brief overview of options.
EXAMPLES
For examples of system usage refer to the Julius documentation.
SEE ALSO
mkbingram(1), adinrec(1), adintool(1), mkdfa(1),
mkgsmm(1), wav2mfcc(1)
DIAGNOSTICS
On exiting normally, Julius will return the exit status
0, If an error is found then Julius exits abnormally, and the
exit status 1 is returned.
If an input file cannot be found or cannot be loaded for
some reason then Julius will skip processing for that file.
BUGS
There are some restrictions to the type and size of the
models Julius can use. For a detailed explanation refer
to the Julius documentation.
For bug-reports, inquires and comments please contact
Julius@kuis.kyoto-u.ac.jp
AUTHORS
Rev.1.0 (1998/02/20)
Designed by Tatsuya Kawahara and Akinobo Lee
(Kyoto University)
Development by Akinobo Lee (Kyoto University)
Rev.1.1 (1998/04/14)
Rev.1.2 (1998/10/31)
Rev.2.0 (1999/02/20)
Rev.2.1 (1999/04/20)
Rev.2.2 (1999/10/04)
Rev.3.0 (2000/02/14)
Rev.3.1 (2000/05/11)
Development by Akinobo Lee (Kyoto University)
Rev.3.2 (2001/08/15)
Development mainly by Akinobo Lee
(Nara Institute of Science and Technology)
THANKS TO
From Rev.3.2 Julius is released by the "Information
Processing Society, Continuous Speech Consortium"
The Windows DLL version was developed and released by
Hideki Banno (Nagoya University)
The Windows Microsoft Speech API compatible version was
developed by Takashi Sumiyoshi (Kyoto University)
I am very grateful to all those that provided me with
timely advice, comments and guidance.
Last modified: 2001/11/15 07:27:14