← Back to team overview

speechcontrolteam team mailing list archive

Re: [Speechcontrol-devel] Speech Recognition In Ubuntu


On Mon, Jan 24, 2011 at 7:15 AM, Jacky Alcine <jackyalcine@xxxxxxxxx> wrote:
> One of the vital components (in my opinion) of accessibility in any platform
> would be its prowess of speech recognition.

Agreed.  This is why I'm working on speech recognition in spare time,
as I'm hoping to build a FOSS SR system that's as good or better than
Naturally Speaking.  It may be wishful thinking, but I'm excited about
progress so far.  Here's links to the work so far:


I wrote libsonic because I believe formant synthesizers other than
espeak will be going away, as Dec Talk did, and "natural" speech
engines are too distorted at high speed.  Libsonic allows any natural
speech engine to generate low distortion high speed speech.  However,
another goal of libsonic is to help me normalize incoming speech by
adjusting pitch, volume, and time to more closely match recordings in
the database.  It also can be used to add prosody to generated speech.

I also put a ton of work into analyzing the speech signal, including
LPC analysis, log-scale DFTs (cochlear FFT), and testing my hearing
and in some cases Sina's.  The resulted in my re-discovery (it's an
old technique) of using time-aliased FFTs with Hanning windows.  I
believe I'm generating the cleanest speech spectrograms for SR

The code I wrote while doing all that testing is in the speechbox
repository.  It's pretty hacked up, because it's my speech sand-box,
but everything I've done is there.

Finally, I've started a new TTS project called mytts.  Currently it is
very similar to Chaumont Devin's TTS engine, but I hope to upgrade
this code as described below.

> This being said, I'm curious as to how one would go about to collect voices
> to further enhance one's experience with speech recognition. VoxForge seems
> to have a large GPL corpus, but it's a bit tedious to install (that of which
> is simply remedied in a later project). I'm considering having a means of
> compiling local voices that are calibrated to one's voice for local usage.

Chaumont Devin has generously allowed Vinux developers access to his
80,000 recordings of English words.  It is not FOSS, but projects
specifically devoted to accessibility are allowed to redistribute his
recordings.  I hope to use this database for initial work on single
word recognition.

Assuming I can get to solid single word recognition, I'll want lot's
of clean samples of continuous speech along with the text being read.
Fortunately, Librivox.org has over 4,000 recordings of ebooks, and
many of them are professional recording quality, closely matching the

> As the user speaks with the engine, it should be able to train
> asynchronously, thus giving it the ability to fluently understand a user.

Agreed.  Assuming we're able to match their spoken words to existing
words in the database, it should be fairly straight forward to adapt
the model to the new words.

> I'm currently working on a presentational document to portray to the team
> regarding the such and if it's a feasible task. Please reply with your
> comments and suggestions.

Yes, it's feasible.  It's been done well several times, and what we
lack is a free software solution.  Here's a rough breakdown of my
development plan:

The next step is to match sounds in the entire database of Chaumont
Devin's speech recordings that sound similar to each other, and use
this to greatly compress the data.  I think this will be the most
important part to get right of the entire system, and I have some
algorithms I'm excited to try.  If it works well, it may will allow me
to get the mytts program data size down to the point that it can be
used even on mobile phones.  Of course, I wont be able to release any
of Chaumont's voice files without the an agreement with him, other
than for Vinux as he's already agreed.

For the next step, I want to try using the compressed speech database
for speech recognition.  Then, I want to create code to generate
speech for words that aren't in the databae.  This would probably use
espeak's text-to-phoneme translator.  I'm going to try and
automatically match the phonemes from espeak to best candidate sounds
in the compressed speech database.  If this works well, I'll be try to
generate speech by concatenating sounds in the database to phonemes
from espeak.  Getting this working well will be a huge task.  However,
once that works, mytts wont have to spell unrecognized words.  More
importantly, I may be able to generate most words from a much smaller
set of initial database of words.  If that works well, I'll be able to
help other people create TTS engines simply by recording some number
of words, maybe just a few thousand.  That would enable an explosion
in languages and dialects supported.  It would also enable speech
recognition for words that do not have recordings in the database.


Follow ups