BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

SIGAI Workshop on Emerging Research Trends in AI

The Special Interest Group in Artificial Intelligence (SIGAI) of Computer Society of India (CSI) is conducting an one day workshop on AI at C-DAC Mumbai.

For more details visit - http://sigai.cdacmumbai.in/

Happy hacking !!!!!!!!!

Related Entries:
Python workshop at Kongu Engineering College, Perundurai
 Permalink

Playing with TIMIT corpus in NLTK - Part - I

The corpus collection in NLTK contains TIMIT Corpus Sample. I just explored the TMIT corpus and TimitCorpusReader function. If you are interested to know more about TIMIT please check http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1.

Some description available in NLTK code is given below.
"This corpus contains selected portion of the TIMIT corpus.

 - 16 speakers from 8 dialect regions
 - 1 male and 1 female from each dialect region
 - total 130 sentences (10 sentences per speaker.  Note that some
   sentences are shared among other speakers, especially sa1 and sa2
   are spoken by all speakers.)
 - total 160 recording of sentences (10 recordings per speaker)
 - audio format: NIST Sphere, single channel, 16kHz sampling,
  16 bit sample, PCM encoding"

If you explore the TMIT directory in nltk data you will find the following items in it

allfilelist.txt   allsenlist.txt  dr2-faem0  dr4-falr0  dr6-fapb0  dr8-fbcg1     prompts.txt  sentences.pos   spkrsent.txt  wrdalign.timit
allphonedur.txt   allsentime.txt  dr2-marc0  dr4-maeb0  dr6-mbma1  dr8-mbcg0     README       sentences.ppp   testset.doc
allphonelist.txt  dr1-fvmh0       dr3-falk0  dr5-ftlg0  dr7-fblv0  l             readme.doc   sentences.tags  timitdic.doc
allphonetime.txt  dr1-mcpm0       dr3-madc0  dr5-mbgt0  dr7-madd0  phoncode.doc  sentences    spkrinfo.txt    timitdic.txt

The items having 'dr' in the name are directories which contains spoken corpora files. In each directory some .wav files and associated transcriptions are located. We can acces and explore these files with the corpus api of NLTK.

Let's see how to explore the data.

Importing the timit corpus from NLTK

    >>> from nltk.corpus import timit
Now the corpus is imported

To view the fileid in the corpus

    >>> timit.fileids()

It will print a long list like
    ......................
    'dr8-mbcg0/sx417.wav',
     'dr8-mbcg0/sx417.wrd',
     'dr8-mbcg0/sx57.phn',
     'dr8-mbcg0/sx57.txt',
     'dr8-mbcg0/sx57.wav',
     'dr8-mbcg0/sx57.wrd',
     'spkrinfo.txt',
     'timitdic.txt']

You might have noticed that there is  .wav file, .wrd file, .phn file and .txt files are there. The .wav file contain utterances, .wrd file contains time marked word file, .phn file contains time marked phone list ..... For each utterance file there will be a .ph, .wrd and .txt file.

From the file ids we can get list of utterance in the NLTK TIMIT corpus.

    >>> timit.utteranceids()

It will print a long list like
    .....................
     'dr8-mbcg0/sa1',
     'dr8-mbcg0/sa2',
     'dr8-mbcg0/si2217',
     'dr8-mbcg0/si486',
     'dr8-mbcg0/si957',
     'dr8-mbcg0/sx147',
     'dr8-mbcg0/sx237',
     'dr8-mbcg0/sx327',
     'dr8-mbcg0/sx417',
     'dr8-mbcg0/sx57']

We can store the utterance ids to a variable for future use.

    >>> utid = timit.utteranceids()

Now the list 'utid' contains all the utterance ids in the TIMIT corpora. The utterance id is in a specified pattern. This pattern helps the corpus reader module to identify information regarding the utterances. Letus see how this is happening.

To get the speaker id from an utterance id. Let us take the first utterance id from the 'utid' list.

    >>> timit.spkrid(utid[0])
    'dr1-fvmh0'
For future use lets store the id in a variable

    >>> sid = timit.spkrid(utid[0])

For the selected utterance we can get the sentence id also.

    >>> timit.sentid(utid[0])
    'sa1'
'sa1' is the sentence id for the given utterence.

    >>> sentid = timit.sentid(utid[0])

Now I am storing the sentence id to a variable for future use.


I think from the above example you might have got an idea about the file name pattern of the timit corpus in NLTK.

Now we can try to get a list of utterance with an utterance is and sentence id.

    >>> timit.utterance(utid, sentid)

Now you will get a long list like this
"['dr1-fvmh0/sa1', 'dr1-fvmh0/sa2', 'dr1-fvmh0/si1466', 'dr1-fvmh0/si2096', 'dr1-fvmh0/si836', 'dr1-fvmh0/sx116', 'dr1-fvmh0/sx206', 'dr1-fvmh0/sx26', 'dr1-fvmh0/sx296',
...................

Now I am going to store the list of utterances to a list.

    >>> utt = timit.utterance(utid, sentid)

We can access the utterance based on the speaker id .

    >>> sputid = timit.spkrutteranceids(sid)

It will give all the utterance associated with the id 'sid'.
Those ids are
    ['dr1-fvmh0/sa1',
     'dr1-fvmh0/sa2',
     'dr1-fvmh0/si1466',
     'dr1-fvmh0/si2096',
     'dr1-fvmh0/si836',
     'dr1-fvmh0/sx116',
     'dr1-fvmh0/sx206',
     'dr1-fvmh0/sx26',
     'dr1-fvmh0/sx296',
     'dr1-fvmh0/sx386']

I stored all he ids to a list 'sputid'.

Let's try to get details about the speaker having the id 'dr1-fvmh0'. This id is stored in the variable 'sid' earlier.

    >>> timit.spkrinfo(sid)
    SpeakerInfo(id='VMH0', sex='F', dr='1', use='TRN', recdate='03/11/86', birthdate='01/08/60', ht='5\'05"', race='WHT', edu='BS', comments='BEST NEW ENGLAND ACCENT SO FAR')

Now it is giving the details about the speaker such as gender, birth date, and race etc.......

Continued ..............

Happy Hacking !!!!!!!!1

Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
WordNet sense similarity with NLTK: some basics
NLTK new version released .
Graphical works with NLTK
NLTK and Indian Language corpus processing Part-III
 Permalink
1-2/2