Sphinx-4
|
Sphinx-4 Links SourceForge |
General InformationInstallation
Sphinx-4 in Detail
|
Sphinx-4 is a state-of-the-art speech recognition system written entirely in the JavaTM programming language. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).
Sphinx-4 started out as a port of Sphinx-3 to the Java
programming language, but evolved into a recognizer designed to be much
more flexible than Sphinx-3, thus becoming an excellent platform for
speech research.
Live mode and batch mode speech recognizers, capable of recognizing discrete and continuous speech.
Generalized pluggable front end architecture. Includes pluggable implementations of preemphasis, Hamming window, FFT, Mel frequency filter bank, discrete cosine transform, cepstral mean normalization, and feature extraction of cepstra, delta cepstra, double delta cepstra features.
Generalized pluggable language model architecture. Includes pluggable language model support for ASCII and binary versions of unigram, bigram, trigram, Java Speech API Grammar Format (JSGF), and ARPA-format FST grammars.
Generalized acoustic model architecture. Includes pluggable support for Sphinx-3 acoustic models.
Generalized search management. Includes pluggable support for breadth first and word pruning searches.
Utilities for post-processing recognition results, including obtaining confidence scores, generating lattices and embedding ECMAScript into JSGF tags.
Standalone tools. Includes tools for displaying waveforms and spectrograms and generating features from audio.
(NOTE: The links in this section point
to local files created by javadoc. If they are broken, please follow
the instructions on Creating
Javadocs to create these links.)
Sphinx-4 is a very flexible system capable of performing many different types of recognition tasks. As such, it is difficult to characterize the performance and accuracy of Sphinx-4 with just a few simple numbers such as speed and accuracy. Instead, we regularly run regression tests on Sphinx-4 to determine how it performs under a variety of tasks. These tasks and their latest results are as follows (each task is progressively more difficult than the previous task):
The following table compares the performance of Sphinx 3.3 with Sphinx-4.
Test | S3.3 WER | S4 WER | S3.3 RT | S4 RT(1) | S4 RT (2) | Vocabulary Size | Language Model |
---|---|---|---|---|---|---|---|
TI46 | 1.217 | 0.168 | 0.14 | .03 | .02 | 11 | isolated digits recognition |
TIDIGITS | 0.661 | 0.549 | 0.16 | 0.07 | 0.05 | 11 | continuous digits |
AN4 | 1.300 | 1.192 | 0.38 | 0.25 | 0.20 | 79 | trigram |
RM1 | 2.746 | 2.88 | 0.50 | 0.50 | 0.41 | 1,000 | trigram |
WSJ5K | 7.323 | 6.97 | 1.36 | 1.22 | 0.96 | 5,000 | trigram |
HUB4 | 18.845 | 18.756 | 3.06 | ~4.4 | 3.95 | 60,000 | trigram |
Note that performance work on the HUB4 test is not complete
Key:
This data was collected on a dual CPU UltraSPARC(R)-III running at 1015 MHz with 2G of memory.
Sphinx-4 has been built and tested on the Solaris TM Operating Environment, Mac OS X, Linux and Win32 operating systems. Running, building, and testing Sphinx-4 requires additional software. Before you start, you will need the following software available on your machine.
Sphinx-4 has two packages available for download:
See this FAQ question to help determine whether you should get the binary or the source distribution.
After you have downloaded the distribution, unjar the
ZIP files using the jar
command which is in
the bin
directory of your Java installation:
jar xvf sphinx4-{version}-bin.zip
jar xvf sphinx4-{version}-src.zip
For both downloads, a directory called "sphinx4-{version}" will be created.
There are also the RM1 acoustic model, and HUB4 acoustic and language models, available for download at the same location on SourceForge. Download them only if you want to run the regression tests for RM1 and HUB4.
If you want to be able to get the latest updates from the svn repository, you should retrieve the code from the repository on SourceForge. The Sphinx-4 code is located at sourceforge.net as open source. Please follow the instructions below to retrieve it.
% svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinx4
Since the sphinx4-{version}-bin.zip distribution does not contain the source code, you must download the sphinx4-{version}-src.zip, or retrieved the code from SourceForge using svn, in order to be able to build from the sources. The software required for building Sphinx-4 are listed in the Required Software section.
Setup JSAPI 1.0
Before you build Sphinx-4, it is important to setup your environment to support the Java Speech API (JSAPI), because a number of tests and demos rely on having JSAPI installed.
To build Sphinx-4, at the command prompt change to the
directory where you installed Sphinx-4 (usually, a simple "cd sphinx4"
will do). Set your JAVA_HOME
, ANT_HOME
and PATH
environment variables as described
above. Then type the following:
ant
This executes the Apache
Ant command to build the Sphinx-4 classes under the bld
directory, the jar files under the lib
directory, and the demo jar files under the bin
directory.
To delete all the output from the build to give you a fresh start:
ant clean
The javadocs have already been built if you downloaded the sphinx4-{version}-bin.zip. In order to build the javadocs yourself, you must download the sphinx4-{version}-src.zip distribution instead. To build the javadocs, go to the top level directory ("sphinx4-{version}"), and type:
ant javadoc
This will build javadocs from public classes, displaying only the public methods and fields. In general, this is all the information you will need. If you need more details, such as private or protected classes, you can generate the corresponding javadoc by doing, for example:
ant -Daccess=private javadoc
The setup is straightforward:
src
-directory
as
source folders to your project.lib/js.jar
, lib/tags.jar
and lib/jsapi.jar
to your project classpath. Due to license restrictions jsapi.jar
is
not shipped directly with Sphinx4 but can be easily created by running
lib/jsapi.sh
(or lib/jsapi.bat
on windows) once.build.xml
as
project ant file. This can be done in most cases by just right-clicking
the build.xml
in
the file navigator pane of your IDE and selecting "Add as project ant
file". To
debug the demo applications you also need to add the src/apps
folder and the acoustic model jars (that can be deployed to the lib
-directory with a
simple ant all
) to
your classpath.Sphinx-4 contains a number of demo programs. If you downloaded the binary distribution (sphinx4-{version}-bin.zip), the JAR files of the demos are already built, so you can just run them directly. However, if you downloaded the source distribution (sphinx4-{version}-src.zip or via svn), you need to build the demos. Click on the links below for instructions on how to build and run the demos.
There is also a live-mode test program (this link only works if you downloaded the source distribution), which is available if you download the sphinx-src-{version}.zip file but not available in the sphinx-bin-{version}.zip file.
The AudioTool is a visual tool that records and displays the waveform and spectrogram of an audio signal. It is available in both the binary and source releases.
The
document
Sphinx-4 Configuration Management describes, in detail, how
to configure a Sphinx-4 system.
The
document
Sphinx-4 Instrumentation describes, in detail, how to use
the instrumentation facilities of the Sphinx-4 system.
Sphinx-4 contains a number of regression tests using common speech databases. Again, you have to download the source distribution or downloaded the source tree using svn in order to get the regression tests directory. The regression tests we have are:
Before you run any of the tests, make sure that you have built Sphinx-4 already. To do so, go to the top level and type:
ant
You also need to make sure you have the appropriate acoustic model(s) installed. More details below.
The Sphinx-4 regression tests have different directories for the different tasks. The directory sphinx4/tests/performance contains directories named ti46, tidigits, an4, rm1, hub4, and some other tests. Each of these directories contains a build.xml with targets specific to the particular task. The build.xml allows you to run a number of different tests. Type:
ant -projecthelpto list a help text with the possible targets.
The TIDIGITS models are already included as part of the distribution. Therefore, you do not need to download them separately. You must have the TI46 test data, available from the LDC TI46 website.
You need to edit the batch file called ti46.batch
,
located in tests/performance/ti46
directory.
You will need to change it such that it matches where you stored the
TI46 test files. Refer to the section Batch
Files for detail about the format of batch files.
To run the tests:
% cd sphinx4/tests/performance/ti46
% ant -projecthelp # to see a list of possible targets
% ant ti46_wordlist
The TIDIGITS models are already included as part of the distribution. Therefore, you do not need to download them separately.
You must have the TIDIGITS test data, available from the LDC TIDIGITS website.
You need to edit the batch file called tidigits.batch
,
located in the tests/performance/tidigits
directory. You will need to change it such that it matches where you
stored the TIDIGITS test files. Refer to the section Batch Files for detail about
the format of batch files.
To run the tests:
% cd sphinx4/tests/performance/tidigits
% ant -projecthelp # to see a list of possible targets
% ant tidigits_flat_unigram
The Wall Street Journal (WSJ) models are already included as part of the distribution. Therefore, you do not need to download them separately.
Download the big endian raw audio format of the AN4 Database. Unpack it at a directory of your choice:
% gunzip an4_raw.bigendian.tar.gz
% tar -xvf an4_raw.bigendian.tar
Then update the following batch files (located in the tests/performance/an4
directory), so that they match up with where you unpacked the AN4 data.
You probably just need to replace all instances of the string "/lab/speech/sphinx4/data"
inside these batch files. Please refer to the Batch
Files section for details about batch files:
an4_full.batch
an4_spelling.batch
an4_words.batch
After you have updated the batch files, you can run the tests by:
% cd sphinx4/tests/performance/an4
% ant -projecthelp # to see a list of possible targets
% ant an4_words_unigram
Make sure that you have downloaded the binary RM1 model
file, called RM1_13dCep_16k_40mel_130Hz_6800Hz.jar
,
located at the sphinx4
package in the downloads
page.
Then in the build file for the RM1 tests, sphinx4/tests/performance/rm1/build.xml
,
changed the classpath
property of the build
file to point to the location of your RM1_13dCep_16k_40mel_130Hz_6800Hz.jar
.
You must have the RM1 test data, available from the LDC RM1 website.
You also need to prepare a batch file called rm1.batch
,
by following instructions in the Batch
Files section. There is already one in the RM1 test
directory, but it will not work for you, since the paths to test files
will not match your setup.
To run the tests:
% cd sphinx4/tests/performance/rm1
% ant -projecthelp # to see a list of possible targets
% ant rm1_bigram
You must have the HUB4 test data, available from the LDC HUB4 website.
You must download the binary HUB4 model file, called HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.jar
,
and the binary HUB4 trigram language model, called HUB4_trigram_lm.zip
,
both located at the sphinx4
package in the
downloads page. For the trigram language model file, unpack
it by:
jar xvf HUB4_trigram_lm.zipThe trigram model file is called
language_model.arpaformat.DMP
.
Then, in the build file for the HUB4 tests, sphinx4/tests/performance/hub4/build.xml
,
changed the classpath
property of the build
file to point to the location of your HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.jar
.
In the configuration file, tests/performance/hub4/hub4.config.xml
,
change the 'location' of the 'trigramModel' component to where your language_model.arpaformat.DMP
file is located.
You also need to prepare a batch file, which is currently
called f0_hub4.batch
in the build.xml file,
by following instructions in the Batch
Files section.
To run the test:
% cd sphinx4/tests/performance/hub4
% ant -projecthelp # to see a list of possible targets
% ant hub4_trigram
Each batch mode regression test consists of the following components:
To learn about how to setup a regression test, take a look at the walkthrough of setting up the AN4 tests.
Batch files are used in batch mode regressions tests. It is a text file that contains the list of files to be processed, with the transcription for each file. The format is as shown below: one line for each file, where the first element in a line is the file name, which can be an absolute or relative path, and includes the file extension; after the file name, the words that make up the transcription for the audio. Sphinx-4 uses the transcription provided here to compute the system's accuracy after each sentence is processed. An utterance's processing produces in a hypothesis for what was said. This hypothesis is compared with the transcription, i.e., the hypothesis is aligned against the reference transcript, and a summary of the results is reported.
/lab/speech/sphinx4/data/tidigits/test/raw16k/man/man.ah.24z982za.raw two four zero nine eight two zero
/lab/speech/sphinx4/data/tidigits/test/raw16k/man/man.ah.25896o4a.raw two five eight nine six oh four
An example batch file is
tidigits.batch
(this link only works if you downloaded the source distribution).
The audio files used by Sphinx-4 can contain raw audio or cepstra, which is a form of encoded speech. The Java platform has support for other data formats, such as MS WAV or Sun's au, but, provided as is, Sphinx-4 can handle only raw data.
The audio defaults to 2 bytes/sample, at 16000 samples per second. The files are expected to be binaries without header. The Java platform assumes big endian order, always. These defaults can be changed. For example, the byte order or the sampling rate can be changed.
The input can also be cepstra. The cepstral file has a 4 byte integer containing the number of floats that follow. The following floats are 13 dimensional vectors concatenated. Notice that since the first piece of information is the number of floats, the total file size can be computed. If a comparisons with the actual size fails, either the byte order has to be reversed, or the file is corrupted. Importantly, the byte order can be automatically detected.
Walkthrough of
Setting up the AN4 Tests
To illustrate the process of setting up a regression test, lets use AN4, an existing test, as an example. Use the following steps to create the AN4 tests.
tests/performance
.
For example, the AN4 tests reside in tests/performance/an4
.
/lab/speech/sphinx4/data/an4
.
Since the AN4 test data already comes in raw audio format, no
conversion is necessary. However, other test databases might require
conversion to raw audio. For example, the TIDIGITS test files are in
SPHERE format, so it is necessary to convert them to raw audio format
before it can be read by the Sphinx-4 front end. This is usually
accomplished by using the program sox
on UNIX
platforms.
tests/performance/an4/an4_full.batch
file looks like:
/lab/speech/sphinx4/data/an4/an4_clstk/fash/an251-fash-b.raw yesAll batch files should reside in the test directory, in this case
/lab/speech/sphinx4/data/an4/an4_clstk/fash/an253-fash-b.raw go
/lab/speech/sphinx4/data/an4/an4_clstk/fash/an254-fash-b.raw yes
/lab/speech/sphinx4/data/an4/an4_clstk/fash/an255-fash-b.raw u m n y h six
...
tests/performance/an4
.
ant
at the top level
directory will create the JAR file for the WSJ model. The JAR file
should be included in the classpath of the application you are
deploying. In this case, the WSJ JAR file (lib/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar
)
is included in the java command line inside the build.xml run file. We
also need to specify in the config file (see the next item below) the
acoustic model class we are using, which in this case is edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz
. The dictionary is also specified in the config
file using the resource mechanism of Sphinx-4.
tests/performance/an4/an4.config.xml
,
please take a look at it. This file describes how the batch-mode
recognizer and its various sub-components should be configured. Note
that this file also contains configurations for the live-mode
recognizer, which is not the subject of interest of this walkthrough.
In the following we will refer to components in the config file using highlights
.
In an4.config.xml, the batch-mode recognizer is
called batch
. It uses the Recognizer called wordRecognizer
,
which contains the decoder
, as well as
various monitors that keeps track of recognition accuracy, speed, and
memory. The decoder
contains the searchManager
,
which in turn contains the linguist
, the pruner
,
the scorer
, and the activeList
.
Refer to the Javadoc
(go to bottom of the page) for a description of each of these
components. The linguist used is the flatLinguist
,
and the grammar of the flatLinguist
is either
the wordListGrammar
, which is a file with a
list of words, e.g.,
ANDthe
APOSTROPHE
APRIL
AREA
AUGUST
CODE
lmGrammar
(i.e., N-gram language model),
or fstGrammar
(i.e., finite state transducer
grammar). The lmGrammar
uses a language model
file (text-based for AN4) generated by the CMU
Statistical Language Modeling (SLM) Toolkit. The flatLinguist
also specifies the acoustic model used, and in this case it is the WSJ
models. The location and format of the WSJ model, as well as the
location of the various files in the model, are also specified. The scorer
contains the front end, which is called mfcFrontEnd
since it produces MFCC features.
build.xml
is necessary to run
Ant. This file is the Ant version of the Makefile in Make. All Ant
targets are listed in this file. For details on how to write this file,
refer to the documentation at http://ant.apache.org/.
Lets use the first Ant target, an4_words_wordlist
,
as an example. This Ant target invokes the java
command on the class edu.cmu.sphinx.tools.batch.BatchModeRecognizer
.
This class takes a configuration file (an4.config.xml
)
and a batch file (an4_words.batch
) as
arguments. This class looks for the component named batch
in the configuration file. The configuration manager will create this
component (and its subcomponents). Therefore, the component edu.cmu.sphinx.tools.batch.BatchModeRecognizer
should always be named "batch"
in the
config.xml file. Other AN4 Ant targets are created similarly.
Currently, Sphinx-4 uses models created with SphinxTrain, also available at cmusphinx.org. SphinxTrain generates acoustic models in the format used by Sphinx-3. To create a package as used by Sphinx-4, please check the page about using SphinxTrain models.
The two main acoustic models that are used by Sphinx-4, TIDIGITS and
Wall Street Journal, are already included in the "lib"
directory of the binary distribution. For the source distribution, you
will
build it when you type ant
at the top level
directory.
Our regression tests also uses the RM1 and HUB4 models,
which are available for download separately on the download page.
Sphinx-4 can handle model packages provided as a jar file.
Each acoustic model implements the
AcousticModel
interface. For example, the WSJ models are wrapped by
a class called edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz
,
which implements the AcousticModel interface.
This implementation class is in the JAR file of the models, together
with the actual data files of the model. This way, two simple steps are
need to use a particular acoustic model:
You can find out the model implementation class of a JAR file using the
java -jar
command. For example, you can
find out the model
class of the WSJ model by:
sphinx4>java -jar lib/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jarThe print out also includes details about how the model was trained, but this is not important for the average user.
Wall Street Journal acoustic models
Class: edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz
Is Binary: true
Sparse Form: false
Filters: 40
Vector Length: 39
Gaussians: 8
Model Definition: etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.4000.mdef
Data Location: cd_continuous_8gau
Feature Type: cepstra_delta_doubledelta
Sample Rate: 16000
Description: Wall Street Journal acoustic models
Number Fft Points: 512
Max Freq: 6800
Min Freq.: 130
The language model used by Sphinx-4 follows the ARPA format. Language models provided with the acoustic model packages were created with the Carnegie Mellon University Statistical Language Modeling toolkit (CMU SLM toolkit), available at CMU. A manual is available there.
The language model is created from a list of transcriptions. Given a file with training transcription, the following script creates a list of words that appear in the transcriptions, then creates a bigram and a trigram LM files in the ARPA format. The file with extension ccs contains the context cues, and it is usually a list of words used as markers - beginning or end of speech etc.
set task = RM
# Location of the CMU SLM toolkit
set bindir = ~/src/CMU-SLM_Toolkit_v2/bin
cat $task.transcript | $bindir/text2wfreq | $bindir/wfreq2vocab > $task.vocab
set mode = "-absolute"
# Create bigram
cat $task.transcript | $bindir/text2idngram -n 2 -vocab $task.vocab | \
$bindir/idngram2lm $mode -context $task.ccs -n 2 -vocab $task.vocab \
-idngram - -arpa $task.bigram.arpa
# Create trigram
cat $task.transcript | $bindir/text2idngram -n 3 -vocab $task.vocab | \
$bindir/idngram2lm $mode -context $task.ccs -n 3 -vocab $task.vocab \
-idngram - -arpa $task.trigram.arpa
Sphinx-4 uses the Java Speech API Grammar Format (JSGF) to perform speech recognition using a BNF-style grammar. Currently, you can only use JSGF grammars with the FlatLinguist. To specify JSGF grammars, set the following in the configuration file:
<component name="flatLinguist" type="edu.cmu.sphinx.linguist.flat.FlatLinguist">
<property name="grammar" value="jsgfGrammar">
// ... other properties ...
</component>
<component name="jsgfGrammar" type="edu.cmu.sphinx.jsapi.JSGFGrammar">
<property name="grammarLocation" value="...URL of grammar directory"/>
</component>
For information on how to write JSGF grammars, and how to
specify the location of your JSGF grammar file(s), and the limitations
of
the current implementation of JSGF grammar, please refer to the
Javadocs
for
JSGFGrammar.
The Sphinx-4 API can be found in the javadoc documentation.
If the previous is broken, please build the javadocs using the instructions in Creating Javadocs. In fact, rebuilding javadocs is something you should do every time you change code in Sphinx-4.
In this section, we will provide an overview of Sphinx-4,
starting with an introduction of HMM-based recognizers. We will
highlight in
Sphinx-4 is an HMM-based speech recognizer.
During speech recognition, features are derived from the
incoming speech (we will use "speech" to mean the same thing as
"audio") in the same way as in the training process. The component of
the recognizer that generates these features is called the
The process of speech recognition is to find the best
possible sequence of words (or units) that will fit the given input
speech. It is a
Constructing the above graph requires knowledge from
various sources. It requires a
Usually, the search graph also has information about how
likely certain words will occur. This information is supplied by the
Once this graph is constructed, the sequence of
parametrized speech signals (i.e., the features) is matched against
different paths through the graph to find the best fit. The best fit is
usually the least cost or highest scoring path, depending on the
implementation. In Sphinx-4, the task of searching through the graph
for the best path is done by the
As you can see from the above graph, a lot of the nodes
have self transitions. This can lead to a very large number of possible
paths through the graph. As a result, finding the best possible path
can take a very long time. The purpose of the
As we described earlier, the input speech signal is
transformed into a sequence of feature vectors. After the last feature
vector is decoded, we look at all the paths that have reached the final
exit node (the red node). The path with the highest score is the best
fit, and a
In this section, we describe the main components of Sphinx-4, and how they work together during the recognition process. First of all, lets look at the architecture diagram of Sphinx-4. It contains almost all the concepts (the words in red) that were introduced in the previous section. There are a few additional concepts in the diagram, which we will explain promptly.
When the recognizer starts up, it constructs the front end (which generates features from speech), the decoder, and the linguist (which generates the search graph) according to the configuration specified by the user. These components will in turn construct their own subcomponents. For example, the linguist will construct the acoustic model, the dictionary, and the language model. It will use the knowledge from these three components to construct a search graph that is appropriate for the task. The decoder will construct the search manager, which in turn constructs the scorer, the pruner, and the active list.
Most of these components represents interfaces. The
search manager, linguist, acoustic model, dictionary, language model,
active list, scorer, pruner, and search graph are all Java interfaces.
There can be different implementations of these interfaces. For
example, there are two different implementations of the search manager.
Then, how does the system know which implementation to use? It is
specified by the user via the configuration file, an XML-based file
that is loaded by the
The
When the application asks the recognizer to perform recognition, the search manager will ask the scorer to score each token in the active list against the next feature vector obtained from the front end. This gives a new score for each of the active paths. The pruner will then prune the tokens (i.e., active paths) using certain heuristics. Each surviving paths will then be expanded to the next states, where a new token will be created for each next state. The process repeats itself until no more feature vectors can be obtained from the front end for scoring. This usually means that there is no more input speech data. At that point, we look at all paths that have reached the final exit state, and return the highest scoring path as the result to the application.
The performance of Sphinx-4 critically depends on your task and how you configured Sphinx-4 to suit your task. For example, a large vocabulary task needs a different linguist than a small vocabulary task. Your system has to be configured differently for the two tasks. This section will not tell you the exact configuration for different tasks, which will be dealt with later. Instead, this section will introduce you to the configuration mechanism of Sphinx-4, which is via an XML-based configuration file. Please click on the document Sphinx-4 Configuration Management to learn how to do this. It is important that you read this document before you proceed.