← Back to team overview

speechcontrolteam team mailing list archive

Re: [Speechcontrol-devel] Speech Recognition In Ubuntu


On 01/24/2011 10:12 AM, Bill Cox wrote:
> On Mon, Jan 24, 2011 at 7:15 AM, Jacky Alcine <jackyalcine@xxxxxxxxx> wrote:
>> One of the vital components (in my opinion) of accessibility in any platform
>> would be its prowess of speech recognition.
> Agreed.  This is why I'm working on speech recognition in spare time,
> as I'm hoping to build a FOSS SR system that's as good or better than
> Naturally Speaking.  It may be wishful thinking, but I'm excited about
> progress so far.  Here's links to the work so far:
> http://vinux-project.org/sonic/
> http://vinux-project.org/time-aliased-hann/
> http://vinux-project.org/gitweb/?p=speechbox.git;a=summary
> http://vinux-project.org/gitweb/?p=mytts.git;a=summary
> I wrote libsonic because I believe formant synthesizers other than
> espeak will be going away, as Dec Talk did, and "natural" speech
> engines are too distorted at high speed.  Libsonic allows any natural
> speech engine to generate low distortion high speed speech.  However,
> another goal of libsonic is to help me normalize incoming speech by
> adjusting pitch, volume, and time to more closely match recordings in
> the database.  It also can be used to add prosody to generated speech.
> I also put a ton of work into analyzing the speech signal, including
> LPC analysis, log-scale DFTs (cochlear FFT), and testing my hearing
> and in some cases Sina's.  The resulted in my re-discovery (it's an
> old technique) of using time-aliased FFTs with Hanning windows.  I
> believe I'm generating the cleanest speech spectrograms for SR
> anywhere.
> The code I wrote while doing all that testing is in the speechbox
> repository.  It's pretty hacked up, because it's my speech sand-box,
> but everything I've done is there.
> Finally, I've started a new TTS project called mytts.  Currently it is
> very similar to Chaumont Devin's TTS engine, but I hope to upgrade
> this code as described below.
>> This being said, I'm curious as to how one would go about to collect voices
>> to further enhance one's experience with speech recognition. VoxForge seems
>> to have a large GPL corpus, but it's a bit tedious to install (that of which
>> is simply remedied in a later project). I'm considering having a means of
>> compiling local voices that are calibrated to one's voice for local usage.
> Chaumont Devin has generously allowed Vinux developers access to his
> 80,000 recordings of English words.  It is not FOSS, but projects
> specifically devoted to accessibility are allowed to redistribute his
> recordings.  I hope to use this database for initial work on single
> word recognition.
> Assuming I can get to solid single word recognition, I'll want lot's
> of clean samples of continuous speech along with the text being read.
> Fortunately, Librivox.org has over 4,000 recordings of ebooks, and
> many of them are professional recording quality, closely matching the
> text.
>> As the user speaks with the engine, it should be able to train
>> asynchronously, thus giving it the ability to fluently understand a user.
> Agreed.  Assuming we're able to match their spoken words to existing
> words in the database, it should be fairly straight forward to adapt
> the model to the new words.
>> I'm currently working on a presentational document to portray to the team
>> regarding the such and if it's a feasible task. Please reply with your
>> comments and suggestions.
> Yes, it's feasible.  It's been done well several times, and what we
> lack is a free software solution.  Here's a rough breakdown of my
> development plan:
> The next step is to match sounds in the entire database of Chaumont
> Devin's speech recordings that sound similar to each other, and use
> this to greatly compress the data.  I think this will be the most
> important part to get right of the entire system, and I have some
> algorithms I'm excited to try.  If it works well, it may will allow me
> to get the mytts program data size down to the point that it can be
> used even on mobile phones.  Of course, I wont be able to release any
> of Chaumont's voice files without the an agreement with him, other
> than for Vinux as he's already agreed.
> For the next step, I want to try using the compressed speech database
> for speech recognition.  Then, I want to create code to generate
> speech for words that aren't in the databae.  This would probably use
> espeak's text-to-phoneme translator.  I'm going to try and
> automatically match the phonemes from espeak to best candidate sounds
> in the compressed speech database.  If this works well, I'll be try to
> generate speech by concatenating sounds in the database to phonemes
> from espeak.  Getting this working well will be a huge task.  However,
> once that works, mytts wont have to spell unrecognized words.  More
> importantly, I may be able to generate most words from a much smaller
> set of initial database of words.  If that works well, I'll be able to
> help other people create TTS engines simply by recording some number
> of words, maybe just a few thousand.  That would enable an explosion
> in languages and dialects supported.  It would also enable speech
> recognition for words that do not have recordings in the database.
> Bill
I don't know what to say besides Godspeed, Bill. I hope that your work
employs to the end, and if you ever need help of any sort, SpeechControl
is here.

Attachment: signature.asc
Description: OpenPGP digital signature