speechcontrolteam team mailing list archive
-
speechcontrolteam team
-
Mailing list archive
-
Message #00014
Re: [Speechcontrol-devel] Speech Recognition In Ubuntu
On 01/24/2011 10:12 AM, Bill Cox wrote:
> On Mon, Jan 24, 2011 at 7:15 AM, Jacky Alcine <jackyalcine@xxxxxxxxx> wrote:
>> One of the vital components (in my opinion) of accessibility in any platform
>> would be its prowess of speech recognition.
> Agreed. This is why I'm working on speech recognition in spare time,
> as I'm hoping to build a FOSS SR system that's as good or better than
> Naturally Speaking. It may be wishful thinking, but I'm excited about
> progress so far. Here's links to the work so far:
>
> http://vinux-project.org/sonic/
> http://vinux-project.org/time-aliased-hann/
> http://vinux-project.org/gitweb/?p=speechbox.git;a=summary
> http://vinux-project.org/gitweb/?p=mytts.git;a=summary
>
> I wrote libsonic because I believe formant synthesizers other than
> espeak will be going away, as Dec Talk did, and "natural" speech
> engines are too distorted at high speed. Libsonic allows any natural
> speech engine to generate low distortion high speed speech. However,
> another goal of libsonic is to help me normalize incoming speech by
> adjusting pitch, volume, and time to more closely match recordings in
> the database. It also can be used to add prosody to generated speech.
>
> I also put a ton of work into analyzing the speech signal, including
> LPC analysis, log-scale DFTs (cochlear FFT), and testing my hearing
> and in some cases Sina's. The resulted in my re-discovery (it's an
> old technique) of using time-aliased FFTs with Hanning windows. I
> believe I'm generating the cleanest speech spectrograms for SR
> anywhere.
>
> The code I wrote while doing all that testing is in the speechbox
> repository. It's pretty hacked up, because it's my speech sand-box,
> but everything I've done is there.
>
> Finally, I've started a new TTS project called mytts. Currently it is
> very similar to Chaumont Devin's TTS engine, but I hope to upgrade
> this code as described below.
>
>> This being said, I'm curious as to how one would go about to collect voices
>> to further enhance one's experience with speech recognition. VoxForge seems
>> to have a large GPL corpus, but it's a bit tedious to install (that of which
>> is simply remedied in a later project). I'm considering having a means of
>> compiling local voices that are calibrated to one's voice for local usage.
> Chaumont Devin has generously allowed Vinux developers access to his
> 80,000 recordings of English words. It is not FOSS, but projects
> specifically devoted to accessibility are allowed to redistribute his
> recordings. I hope to use this database for initial work on single
> word recognition.
>
> Assuming I can get to solid single word recognition, I'll want lot's
> of clean samples of continuous speech along with the text being read.
> Fortunately, Librivox.org has over 4,000 recordings of ebooks, and
> many of them are professional recording quality, closely matching the
> text.
>
>> As the user speaks with the engine, it should be able to train
>> asynchronously, thus giving it the ability to fluently understand a user.
> Agreed. Assuming we're able to match their spoken words to existing
> words in the database, it should be fairly straight forward to adapt
> the model to the new words.
>
>> I'm currently working on a presentational document to portray to the team
>> regarding the such and if it's a feasible task. Please reply with your
>> comments and suggestions.
> Yes, it's feasible. It's been done well several times, and what we
> lack is a free software solution. Here's a rough breakdown of my
> development plan:
>
> The next step is to match sounds in the entire database of Chaumont
> Devin's speech recordings that sound similar to each other, and use
> this to greatly compress the data. I think this will be the most
> important part to get right of the entire system, and I have some
> algorithms I'm excited to try. If it works well, it may will allow me
> to get the mytts program data size down to the point that it can be
> used even on mobile phones. Of course, I wont be able to release any
> of Chaumont's voice files without the an agreement with him, other
> than for Vinux as he's already agreed.
>
> For the next step, I want to try using the compressed speech database
> for speech recognition. Then, I want to create code to generate
> speech for words that aren't in the databae. This would probably use
> espeak's text-to-phoneme translator. I'm going to try and
> automatically match the phonemes from espeak to best candidate sounds
> in the compressed speech database. If this works well, I'll be try to
> generate speech by concatenating sounds in the database to phonemes
> from espeak. Getting this working well will be a huge task. However,
> once that works, mytts wont have to spell unrecognized words. More
> importantly, I may be able to generate most words from a much smaller
> set of initial database of words. If that works well, I'll be able to
> help other people create TTS engines simply by recording some number
> of words, maybe just a few thousand. That would enable an explosion
> in languages and dialects supported. It would also enable speech
> recognition for words that do not have recordings in the database.
>
> Bill
I don't know what to say besides Godspeed, Bill. I hope that your work
employs to the end, and if you ever need help of any sort, SpeechControl
is here.
Attachment:
signature.asc
Description: OpenPGP digital signature
References