AbbotDemo mini-FAQ
==================

Q1: What is AbbotDemo?
Q2: Why was AbbotDemo released?
Q3: How do install AbbotDemo?
Q4: How to run AbbotDemo?
Q5: Selecting the input device
Q6: Troubleshooting
Q7: Upgrading AbbotDemo
Q8: How does AbbotDemo work?
Q9: What short-cuts have been made in this system?
Q10: Known bugs
Q11: Is this package supported?
Q12: Can I do phone recognition?
QN-2: Legalities
QN-1: Who is responsible for AbbotDemo?
QN: Where can I find out more?


Q1: What is AbbotDemo?
----------------------

AbbotDemo is a packaged demonstration of the Abbot connectionist/HMM
continuous speech recognition system developed by the Connectionist
Speech Group at Cambridge University.  The system is designed to
recognize British English and American English clearly spoken in a quiet
acoustic environment.

This demonstration system has a vocabulary of 10,000 words - anything
spoken outside this vocabulary can not be recognised (and therefore will
be recognised as another word or string of words).  The vocabulary and
grammar is based around the task of reading from a North American
Business newspaper (the word list is given in file spring96-10k.lst).


Q2: Why was AbbotDemo released?
-------------------------------

a) For information:  We want to show what speech recognition systems
   are capable of at the moment.

b) For publicity: Connectionist HMM systems have some advantages over
   traditional HMM systems.  We are open to people who wish to license
   this technology and we are looking for funding.


Q3: How do install AbbotDemo?
-----------------------------

This is a binary only release (compilation free:).  Binaries are
available from the svr-ftp.eng.cam.ac.uk FTP site in directory
comp.speech/recognition/AbbotDemo.  The file AbbotDemo-0.x.tar.gz
contains binaries for all supported architectures.  The files
AbbotDemo-0.x-${OS}.tar.gz contain complete releases for specific
operating systems only.  The available architectures are SunOS
(version 4), Solaris (version 2), IRIX (version 5), HP-UX (version 9)
and Linux (ELF)

To install you need to get the appropriate binary release and extract
the files using gzip and tar.  Typically this will look something like:

unix$ gunzip -c AbbotDemo-0.6.tar.gz | tar xvf -


Q4: How do run AbbotDemo?
-------------------------

The recognition system is called from the "AbbotDemo" shell script.
This script must be given an argument of either "-uk" or "-us" to run
with British or American English models respectively.   For example:

unix$ ./AbbotDemo -us

A window should appear, called AbbotAudio, for controlling the recording
of the speech.  A sample session is described below.

Initialization: Before processing any speech, first click on
  "Calibrate".  This calibrates the automatic speech start- and
  end-point detection algorithm based on the background noise level.
  This calibration process should be repeated whenever the speech
  capture environment changes.

Speech Collection: Click on "Acquire" and say something; for example,
  "President Clinton denied it".  The system has a rudimentary automatic
  start and end point detector and the waveform will be displayed once
  recording has finished.  If a waveform does not appear, check that the
  input levels are set to reasonable values.  There exists a
  "-audiogain" flag to AbbotDemo which will pop-up an additional window
  for setting the recording gain.  Be sure to repeat the calibration
  step if the recording levels are changed.  You may also make a second
  click on "Acquire" to stop the recording.

Speech Validation: Click on "Play" to confirm the recoding quality.
  This will play the sampled waveform.  If you want to see a
  time-frequency plot of the recorded speech, click on the "Spectrogram"
  button.

Recognition: Now click on "Pipe to NOWAY" to start the recognition
  process.  The screen should show something like this (with each line
  overwriting the last):

  1 THIS
  1 SPEECH READING
  1 SPEECH RECOGNITION THE 
  1 SPEECH RECOGNITION IS A B.
  1 SPEECH RECOGNITION IS A PIECE OF THE
  1 SPEECH RECOGNITION IS A PIECE OF CAKE

  The script prints out the best guess to the word string as the
  recognition proceeds and the final recognised word string at the end.
  Recognition should take about 20 Mbyte of memory and run in a few
  times real time on a Pentium or faster processor.

File Access: The "Import" button provides an alternate method for
  acquiring the speech waveform.  Clicking on this button causes the
  procedure to read an ascii, linearly encoded, 16 kHz data from the file
  "timeData" (in the current directory).  Similarly, clicking on "Export"
  causes AbbotAudio to write an ascii, linearly encoded, 16 kHz data to
  the file "timeData".

There exits another flag called "-showguts".  When AbbotDemo is invoked
with this flag set another window is created that shows the phonemes
that were recognised in the sentence.  Like the spectrogram option in
AbbotAudio, time is displayed on the horizontal axis.  The vertical axis
has one line for every phoneme in the system, the width of the line
indicates the estimate of the probability that the given phoneme was
present.

Alternatively, if you do not have X or have problems associated with
AbbotAudio, you can send prerecorded files through the recogniser by
specifying the names of the audio files on the command line.  These files
should be of speech sampled at 16 kHz with 16 bits/sample in the natural
byte order and with no header.  For example:

unix$ srec -t 3 -s16000 -b16 test.raw
Speed 16000 Hz (mono)
16 bits per sample
unix$ ~/AbbotDemo-0.4/AbbotDemo test.raw
1 SPEECH RECOGNITION IS A PIECE OF CAKE


The file test.raw is included as an example in the 'etc' directory.
This file is in the natural byte order for all supported machines appart
from Linux, whereon it should be byte swapped (e.g. with "dd conv=swab
if=test.raw of=ettsr.wa").


Q5: Selecting the input device
------------------------------

The input device can be selected with a command line option to
AbbotAudio or using an environment variable.

  command line (checked first)

    -input <input-choice>			set input port
    -output <output-choice>			set output port

  environment variable (checked if not specified on command line) 

    setenv ABBOTAUDIO_INPUT <input-choice>	set input port
    setenv ABBOTAUDIO_OUTPUT <output-choice>	set output port

Where the <input-choice> is one of:
							Default
	SUN  : mic, line				mic
  	SGI  : mic, line, digital			mic
	HP   : mic, line				line
	LINUX: NONE					-

and the <output-choice> is one of:
							Default
	SUN  : speaker, headphone, line			speaker
	SGI  : NONE					-
	HP   : speaker, headphone, line-out, jack	jack
	LINUX: NONE					-


Q6: Troubleshooting
-------------------

If no output:

  * Did AbbotDemo produce any warning messages?
  * Did a waveform appear after recording?
  * Check the operation of the rest of the system with: AbbotDemo etc/test.raw

No waveform may indicate a number of trouble spots.  Consider the following:

  * microphone connected to inappropriate jack
  * line levels are set incorrectly
  * recording levels are set incorrectly
  * noise level of audio front-end has unexpected characteristics
    which cause problems for the speech detector.  If you suspect this
    to be the case, click on "Calibrate" in AbbotAudio and collect some
    silence.

If poor output:

  * Was the signal recorded in noise-free conditions?
  * Are you putting on your best British accent?
  * Are there very many out of vocabulary words?
  * Is the text similar to that of a business newspaper?


Q7: Upgrading AbbotDemo
-----------------------

The basic distribution comes with a language model for 10,000 words and
pronunciations for 20,000 words.

The system can be upgraded to 20,000 words with a better language model
by fetching the file spring96-20k-16-16.bin from the AbbotDemo FTP
directory (see Q3).  You will need the GNU gzip utility to uncompress
this file.  To run this system we recommend 64 Mbyte RAM.  Please note
that this file is 18 Mbyte, which is why it is not included in the core
distribution.

To run AbbotDemo with the new vocabulary and language model simply
specify the file name on the command line with the -lm flag like this:

unix$ AbbotDemo -uk -lm spring96-20k-16-16.bin

The new language models will run more slowly than the old 
words used were in the 5,000 word vocabulary.  If many of the words used
are not in the 5,000 word vocabulary but are in the new vocabulary you
may find it runs faster and it should certainly be more accurate.

In adddition we produced a language model for the EuroSpeech conference
based on the EuroSpeech93 proceedings and some speech papers that were
available locally.  The amount of training data is much less than the
standard North American Business news domain and hence the language
model quality is not as good but it does show the use of speech
recognition in another domain.  Pick up a copy of ICASSP, EuroSpeech or
ICSLP and start reading to both systems to see the effect of having an
appropriate language model.

The files are available from host svr-ftp.eng.cam.ac.uk in directory
pub/comp.speech/recognition/AbbotDemo/ as euro16k-00.bin.gz and
euro16k.dict.gz.  Only British English pronunciations are supported.
The system is run as:

  AbbotDemo -uk -dictionary euro16k.dict -lm euro16k-00.bin

euro16k.dict contains pronunciations for 17372 words and the FTP size is
4.5 Mbyte.   32-64 Mbyte of RAM is recommended.


Q8: How does AbbotDemo work?
----------------------------

AbbotDemo is just a shell script.  If you look through it you'll see
that most of it is occupied in setting the correct options and all the
hard work is done by about ten lines at the end.  Each program is glued
to the next by UNIX pipes - it is the aim of this section to describe
each program and the data formats used at the interfaces.

AbbotAudio: This is the front end to the whole system.  It is written
  for X (which has proved to be remarkably non-portable).  Basic use is
  described earlier in the FAQ.  The data format of the output is a
  stream of 16 bit samples at 16 kHz.  The special sample value of
  -32768 (the most negative 16 bit number) is used to flag the end of
  sentence.

AbbotAudioCat: A very simple substitute for AbbotAudio which glues
  together the files on the command line in the same format as
  AbbotAudio.

rasta:  Performs perceptual linear prediction (PLP) on the incoming
  waveform and writes out a set of PLP/rasta coefficients.   This version
  written by Nelson Morgan at ICSI.   The format of the output stream is
  13 floating point values at four bytes each.

rnnInputNorm: Normalises each channel to zero mean and unity variance.
  This helps to compensate for variations in the channel.  The first
  sentence is buffered before normalisation and subsequent sentences are
  either buffered and processed independently or not buffered and the
  means and variances are computed using an IIR filter.   The output
  format is one byte to flag end-of-sentence followed by 13 bytes
  containing the (squashed) PLP coefficients and two padding bytes.

rnnForward:  Estimates the probabilities of each of the phonemes from
  the normalised PLP stream.   This is achieved with a recurrent neural
  network and is one of the two biggest users of CPU cycles.   The
  output is one byte to flag end-of-sentence and N bytes to encode the
  log probabilities of each of the phonemes where N is the size of the
  phone set used.

rnnGrabUtterance:  Sits in the pipe rather like a tee(1) command but
  just writes the last sentence that was passed through to the output file.

xshowGuts: takes the single sentence provided by rnnGrabUtterance and
  displays it on the screen in a X window.    For a very clean segment
  of speech it is often possible to read off the phonemes that were said,
  and if you are good at segmentation perhaps the words as well.

noway:  Converts the phone probabilities into a word string.    This is
  the other process that needs a lot of CPU power.   There is a tradeoff
  between speed and accuracy that can be adjusted via the values of
  -beam, -state_beam and -prob_min.   4, 3, and 0.0005 will run more
  slowly and more accurately than 2, 1, and 0.005.  The -inc_output N
  option displays the best word string every N frames.

howAreWeDoing:  Takes the -inc_output from noway and using the backspace
  and carriage return control codes produces a mock-up of a dictation
  interface.   There is no provision for changing the last line so the
  whole sentence is redone if that becomes necessary.


Q9 What short-cuts have been made in this system?
-------------------------------------------------

To operate in near real-time and to be accessible using FTP a number of
compromises were made.

AbbotDemo uses one recurrent network to estimate phone probabilities.
We find that using four networks and combining the outputs can result in
20% fewer errors.

AbbotDemo uses context independent phone models.  Using context
dependent phone models approximately doubles the number of parameters
(and therefore the CPU required to run the network in real-time) but
does result in about 15% fewer errors.

The vocabulary size of the basic AbbotDemo is 10,000 words.  If you pick
up a copy of the Wall Street Journal chances are this will only cover
about 94% of the words in a given passage of text and therefore there
will be at least 6% errors as the system can not recognise words that
are not in the vocabulary.  If you obtain the 20k language model
this should improve to 97% coverage.

The size of the language models was constrained in order to allow ease
of FTP access and reasonable disk usage.  This results in a significant
increase in the perplexity (the average number of words that are
considered as the next word) of about 80% for the standard distributions
and about 40% for the additional language models.  The corresponding
increase in word error rate has not been measured but it is expected to
be about 20%.

The decoder, noway, has tighter pruning than used in our "evaluation
quality" decodes in order to achieve faster operation.  These are
adjustable with the -beam, -state_beam, -n_hyps and -prob_min options at
the end of the AbbotDemo script.  Our tests show that the faster options
supplied with this demo result in about 10% increase in word error rate.

For the US English version of AbbotDemo, the "evaluation quality"
pronunciation dictionary has been replaced by the publicly available CMU
lexicon.  This causes a mismatch between training and testing
pronunciations and results in an additional 20-30% increase in word
error rate for this system (please note that the increased error rate is
due to the mismatch between lexica and is not necessarily due to the
quality of the CMU pronunciations).

From this you may total all these numbers for a 20k system and arrive at
about twice or three times the error rate for AbbotDemo than might
otherwise be achieved (we have not yet measured the combined
degradation).  If you don't use a 20k language model you can expect
considerably more errors.  Our typical error word rate on clean read
speech is about 11-17%.


Q10: Known bugs
---------------

This is the list of bugs that we know exist.  We will work on these when
we get the time/funds to do so.

  * AbbotAudio and x_show_guts have display problems if they are partially
    overlayed with another window
  * There is a mismatch between the pronunciations used for training
    the American English system and those provided in this package


Q11: Is this package supported?
------------------------------

No (but see Q2b).

If you know how to submit bug reports, then please do so.

Q12: Can I do phone recognition?
--------------------------------

Yes. Invoke AbbotDemo with the -phone option, eg. for
British English phone recognition

unix$ ./AbbotDemo -uk -phone

QN-2: Legalities
----------------

The user is granted a royalty free licence to use this software as is
for the purposes of evaluating speech recognition technology.  No
commercial use is permitted.  No changes may be made to this software or
any of the associated data files.  The complete package may be
redistributed provided that no charge is made other than reasonable
distribution costs.  The software may not be incorporated into any other
software without prior permission.


QN-1: Who is responsible for AbbotDemo?
---------------------------------------

Tony Robinson	(Cambridge University)
Mike Hochberg	(now Nuance)
Steve Renals	(Sheffield University)
Gary Cook	(Cambridge University)
Dan Kershaw	(Cambridge University)
Beth Logan	(Cambridge University)
Carl Seymour	(Cambridge University)
James Christie	(Cambridge University)

Much of the funding for the recent development of this system was
provided by the ESPRIT Wernicke Project with partners:

  CUED		Cambridge University Engineering Department, UK
  ICSI		International Computer Science Institute, USA
  INESC		Instituto de Engenharia de Sistemas e Computadores, Portugal
  LHS		Lernout Hauspie SpeechSystems, Belgium
  and associates:
  SU		Sheffield University, UK
  FPMs		Faculte Polytechnique de Mons, Belgium

Dedicated hardware for training the recurrent networks and system
software for that hardware were provided by ICSI.

The Perceptual Linear Prediction code was researched and implemented by
Hynek Hermansky (Oregon Graduate Institute).

The acoustic and language models for AbbotDemo were derived from materials
distributed by the Linguistic Data Consortium.
  ftp://ftp.cis.upenn.edu/pub/ldc

The CMU statistical language modelling toolkit was used to generate the
trigram language model.

The BEEP dictionary was used for British English pronunciations.
  ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/data/beep-0.7.tar.gz

The CMU dictionary was used for American English pronunciations.
  ftp://ftp.cs.cmu.edu/project/fgdata/dict/cmudict.0.4.Z

The CMU phone set was expanded using code provided by ICSI.

The X-windows interface for speech capture was derived from speech
processing software developed at the Laboratory for Engineering
Man/Machine Systems (LEMS) at Brown University.


QN: Where can I find out more?
------------------------------

Specific publications on this system include:

    Tony Robinson
	"The Application of Recurrent Nets to Phone Probability
	Estimation", IEEE Transactions on Neural Networks, Volume 5,
	Number 2, March 1994.

    M M Hochberg, A J Robinson and S J Renals
	"ABBOT: The CUED Hybrid Connectionist-HMM WSJ Speech Recognition
	System", Proc. of ARPA 	SLS Workshop, Morgan Kauffman, March 1994

    Mike Hochberg, Tony Robinson and Steve Renals
	"Large Vocabulary Continuous Speech Recognition using a Hybrid
	Connectionist HMM System", International Conference on Spoken
	Language Processing, pages 1499-1502, 1994.

    M M Hochberg, G D Cook, S J Renals, A J Robinson and R T Schechtman,
	"The 1994 Abbot Hybrid Connextionist-HMM Large-Vocabulary
	Recognition System", ARPA Spoken Language Systems, Morgan Kauffman,
	1995.

    Tony Robinson, Mike Hochberg and Steve Renals,
	"The use of recurrent networks in continuous speech recognition",
	chapter 19, Automatic Speech and Speaker Recognition - Advanced
	Topics, edited by C H Lee, K K Paliwal and F K Soong, Kluwer
	Academic Publishers, 1995 (hopefully).

    Steve Renals and Mike Hochberg,
	"Efficient Search Using Posterior Phone Probability Estimates",
	Proceedings of the IEEE International Conference on Acoustics,
        Speech, and Signal Processing (ICASSP), pages 596-599, 1995.


A good tutorial on speech recognition and hybrid connectionist/HMM
techniques is:

    Nelson Morgan and Herve Bourlard,
	"Continuous Speech Recognition", IEEE Signal Processing magazine,
	volume 12, number 3, pages 24-42, May 1995

The definitive book on this subject is:

    Herve Bourlard and Nelson Morgan,
	"Continuous Speech Recognition: A Hybrid Approach", Kluwer
	Academic Publishers, 1993

More general information on speech recognition and pointers to tutorial
articles and books can be found in the comp.speech FAQ http://
and http://svr-www.eng.cam.ac.uk/comp.speech.

The ABBOT home page is http://svr-www.eng.cam.ac.uk/~ajr/abbot.html

Please direct enqueries to AbbotDemo@softsound.com