← Back to team overview

wintermute-psych team mailing list archive

The current state of speech recognition.


Right, everyone.

Not everyone is familiar with speech recognition, so here are a few basic
elements, facts and some notes on the three 'champions'.

Open-source speech recognition systems could be better. There really are
only two contenders, described below.

GnomeVoiceControl et al either failed and died because of the complexity of
this technology, or could only make sense out of small (normally one-word)
commands which limited their usefulness.

Speech recognition is a hideously advanced technology, and NO-ONE has
perfected it yet; though many (commercial) companies are close to doing so;

Most Android devices have the capacity to record and send snippets of voice
data back to Google to be analysed and sent back to the device as text.
These snippets are very limited, and only sometimes accurate; this is most
likely the result of Google's system (licensed from Nuance Communications)
not being trained to each voice.

Speech recognition requires a low-noise environment (or a very good, but
expensive, microphone to deal with that issue). Only 'trained' systems can
really make any sense out of a human speaking normally (like Dragon, for
instance); for the best accuracy, you have to speak in a flat, mono-tone
voice, and do so very clearly and distinctly.

Speech recognition systems generally require two distinct models to operate;
the first is a vocal model, which will pick up human speech (as distinct
from environmental sounds) and in some systems these vocal models can be
trained to better recognize users or better recognize (and ignore) sounds
from their environments. The second is the language model, which sifts
through the data filtered from the vocal model and ascribes sounds to words.
There are currently no complete GPL models. (This is why VoxForge lives)

I am currently unaware of any speech recognition system capable of
recognizing and identifying multiple users at the same time; this system
will need to be built for Wintermute.

Both Microsoft and Apple's operating systems have inbuilt speech recognition

This technology is nowhere near as simple as a file browser or a music
player; in terms of complexity, it's on the same level of complexity as an
operating system kernel; you need a HELL of a lot of knowledge to build it.

Below are my notes for the three speech recognition engines; the first two
are open-source, the last is not, but we can learn, at least from a
user-experience level, where we should generally be heading from it.

In terms of licenses, both systems are open source; "Just the usual "include
this notice and disclaimer", with the addition of "if you edit the code, say
when and by whom" for Sphinx."

1. BSD-licensed project; open-source version.

2. Mostly used as a research project; expect spaghetti code.

3. Functioning models; some parts of it fully in public domain.

4. No GUI; mostly console; third party GUI's available.

5. Latest version is coded with Java.

6. Very, VERY fast recognition.

7. The most complete system available to us.

8. Used by multiple robotics labs around the world.

1. Julius started development in 1997

2. Julius' English models are complete, they are also under the HTK license;
they are not open-source and cannot be redistributed without permission.

3. Julius can only run with HTK models, though there is apparently a F/OSS
one, but it's Japanese (and before you ask; no, it isn't possible to just
're-code' it. It's a model, the whole of it  needs to be re-created

4. Julius itself is open-source, but under what appears to be a custom
license (needs to be checked) if so, this could make things difficult for

5. Used by the open-source 'Q.bo' robotics project.

*Dragon NaturallySpeaking;* a commercial (closed-source) speech recognition
1. Uses some simple tests (volume check, quality check) to make
modifications to audio input.

2. Has no capacity to recognize multiple speakers; it has a number of users
but these require switching between accounts.

3. Has a number of modes (Dictation mode, spell mode, command mode, numeral
mode) as it appears their system is not capable enough
to distinguish between numbers in words, or as numbers. Sometimes it
accidently triggers a command when you meant to dictate something, etc.
These modes have to be manually set. There *is* a: 'normal mode' which can
dictate, spell, enumerate and accept commands, but it's frequency of
misunderstandings is, of course, higher then the pre-set modes.

4. The quality of speech input depends upon the microphone; a microphone
operating in a noisy environment (say, next to a hard-drive, as some laptop
microphones are) is not nearly as effective as a headset microphone.

5. When the program has finished checking volume and quality, it then gets
the user to read aloud a prepared section of speech and, after doing so,
appears to use some form of hidden Markov model to adjust itself.

6. It has functions to pull in and establish the user's writing style from
multiple sources (email program, word processors, local documents) in an
attempt to expand it's vocabulary and to better guess the context of the
user's words.

7. Recognition rate of Dragon after installation with no training is 60-70%.
Recognition after one hours training shoots to 90% (dependant on
the environment). Estimated recognition rate after Dragon has processed six
hours of training with complete analysis of local sources of data: 99.65%.

8. It uses a 'best guess' method of establishing context; if it should write
'1' or 'one', or 'where' and 'ware' by getting a basic idea of what was
said, and where it was mentioned; previous examination of user's previous
works appears to help here.

9. It requires the user to speak punctuation, if, however, a modern grammar
checker is used with it; full documents can be grammatically correct with
only a slight margin of error.

10. The system cannot easily handle a change in tempo, pitch (or other)
alterations to your voice. This results in a rather nasty model of
computing, as someone speaking with a stuffy nose would have a very high
error rate.

11. Dragon is processor-intensive; it's modern version (version 11) requires
the machine be active but not doing anything at a user-defined time, upon
which the program will use the full resources of the computer to alter
itself on a more in-depth level. The initial training data provided during
the setup of the user account will take about half an hour to process on a
modern machine, despite being only 10 minutes worth of voice recordings. It
is estimated that six hours worth of data would take most of a day to
compute. It is also noteworthy to point out the data computed by Dragon in
this 'optimization' process can reach several dozen gigabytes in size

I'm unsure of Dragon 11, but Dragon 9 (an earlier version) would take about
10 minutes worth of recordings, and output 5 gigabytes of data.



-Danté Ashton

Vi Veri Veniversum Vivus Vici

Sent from Ubuntu