BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

How to plot spectrogram with Python


To read more about spectrogram please see http://en.wikipedia.org/wiki/Spectrogram.
Here I am going to demonstrate how to plot spectrogram with Python and audiolab.
Noramally people who is working in Speech processing will be plotting spectrogram with 'Praat' or 'wavesurfer' or 'Speech Analyser (Not Open Source)'. Some advanced users will be writing Matlab scripts to deo the same. It is possible to do the same with python also.

The requirements for this script are listed below.

    1) Python 2.5 or >
    2) Scipy
    3) Pylab
    4) numpy
    5) Audiolab


If these modules are not available in your system download and install it. Except audiolab everything else can be installed with apt in Ubuntu.

if everything == "ok":
    print "copy the script and start"
else:
    print "please complete setup and do it"
(This is not a code. Just for fun I written like this)


so everything is Ok. Then copy the below given script and save it to a file called 'spc.py'. Get a word recorded and stored in .wav format. Run the script as 'python spc.py <your wavefile>'. It will plot the spectrogram, you can save it also.

===== Code Begin =================

#!/usr/bin/env python
import sys
from scipy import *
from pylab import *
from numpy import *
import scikits.audiolab as audiolab
import struct

def show_Specgram(speech):
    '''
    Reads .wav file from STDIN and plots the spectrogram
    '''
    sound = audiolab.sndfile(speech,'read')
    # Reads wav file with audiolab
    sound_info = sound.read_frames(sound.get_nframes())
    # Extracts feature info from sound file with scipy module
    spectrogram = specgram(sound_info)
    #Generates soectrogram with matplotlib specgram
    title('Spectrogram of %s'%sys.argv[1])
    show()
    sound.close()
    return spectrogram


file = sys.argv[1]
show_Specgram(file)

==== Code End ====================
See the spectrogram which I plotted from a .wav file.


I got fragments of code from net. I am not remembering all the sources. Any how I acknowledge thanks to those people and Google.

Happy hacking!!!

Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
Comments (3)  Permalink

N-Gram library for Indian Languages

I have written an N-Gram libraray for Indic languages in Python. It can be used to generate unigram, or bigram or trigram from a raw Unicode text in Indian languages. I dont know how it will behave to pashto arabic scripts. It can be downloaded from http://pypi.python.org/pypi/indicngramlib/1.0

It can be used to generate unigram or bigram or trigram of words in a text.

I am copying the README hre. Please send your suggestions, criticisms etc.. to me.

 README


indicngramlib
=============
This script is for to generate n-gram(unigram, bigram, trigram) from Indic Unicode text.
To read more about N-Gram please visit http://en.wikipedia.org/wiki/N-gram.

This script is released under GNU GPL licence.
No commercial use is allowed.
If you are using the script for your research purpose please refer it.
Also send a mail to jaganadhg@gamil.com (because I am happy to know that some body is using this)

The software is provided as such. No warranty.

Installation
===========

To install it run python setup.py install

Usage
=======
This library provides the following functions.
unigram()
bigram()
trigram()
printUnifreq()
printBigram()
printTrigram()

The library is designed such a way that one can use it for
1) To view the output data (in command line)
2) To store the data in to a file
3) To use the output of this library in a program where it is utilised.

The unigram() and printUnifreq() function
=========================================

The unigram() function is used to generate unigram from a given text.
The text which is giving as input should be in UTF8 format.

The printUnifreq() function can print the unigram with frequency info.
It can be use in two ways.
    1) To print the unigram and frequency info to the command line
    2) To store unigram and frequency info to a file.

A sample code snippet is given below.

    #!/usr/bin/env python
    from indicngramlib import *
    ngram = indicNgram()
    ngram.unigram("my_lang.txt")
    ngram.printUnifreq() # It will print the output to command line

If you would like to store the content to a file, then replace the last line with the following line.

    ngram.printUnifreq("your_output.txt")

The bigram() and printBigram() function
======================================

The bigram function is used to generate bigram from a text.
The text which is given as input should be in UTF8 format.

The printBigram() function can print the bigram with frequency info.
It can be used in two ways.
    1) To print the bigram and frequency info to the command line.
    2) To store bigram and frequency info to a file.

A sample code snippet is given below.

    #!/usr/bin/env python
    from indicngramlib import *
    ngram = indicNgram()
    ngram.bigram("your_text.txt")
    ngram.printBigram() # It will print the output to command line

If you would lke to store the bigram and frequency infro to file, then replace the last line with the following line.

    ngram.printBigram("your_out.txt")

The trigram() and printTrigram() function
=========================================

The trigram() function is used to generate trigram from a given text.
The text whixh s given as input should be in UTF8 format.

The printTrigram() function can print the trigram with frequency info.
It can be used in two ways.
    1) To print the trigram and frequency info to command line.
    2) To store trigram and frequency info to a file.

A sample code snippet is given below.

    #!/usr/bin/env python
    from indicngramlib import *
    ngram = indicNgram()
    ngram.trigram("your_text.txt")
    ngram.printTrigram() # It will print the output to command line

If you would like o store the output to file replace last line with

    ngram.printTrigram("your_out.txt")

How to use the output of this library in my Python Program?
===========================================================

It is very easy!!
Import the indicngramlib in your Python script.
If you plan to generate unigram from a text and to use that in your program, follow the steps mentioned below.

    from indicngramlib import *

    ngram = indicNgram()
    myunigram = ngram.unigram("your_text.txt")

So now the 'myunigram', which is a dictioney will contain unigram with frequency infor.
If you want only unigram just get the keys from 'myunigram'

Like wise you can use bigram() and trigram()

There is a __textReader() function in your lib. May I can use it?
=================================================================

No!!
It is a private function. You cant use that function.

Happy Hacking !!

Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
Comments (7)  Permalink

The Snack toolkit with Python

Snack is a toolkit for creating powerful multi-platform audio applications. With just few lines we can create some applications. The tool is developed by Speech, Music and Hearing part of School of Computer Science and Communication, Royal Institute of Technology (KTH), Sweden. We can write applications in Tcl/Tk or in Python.

For the last two years I was trying to properly install it in a Fedora core 8 machine. It showed so many errors. Finally I decoded to drop the plan to play with Snack :-).

Today again I tried snack with Ubuntu9.0.4 (Jaunty Jackalope). It was successful. Please follow the below given instructions to install Snack on Ubuntu9.0.4.

    1)  sudo apt-get install libsnack2
    2) sudo apt-get install libsnack2-dev
    3) sudo apt-get install libsnack2-alsa
    4) sudo apt-get install python-tksnack


A small tutorial cum guide is available for Snack Python programming. I just tried to plot signal and spectrogram with Sansk and Python. To do the same you have to record some utterence and save it as .wav file. Ytterence means speak a word or two or more (As you like).
Then start python interpreter and type the following line one by one. Dont copy paste and make file to run it. It wont work(I dont know why!!).

 >>> from Tkinter import *
>>> import tkSnack
>>> root = Tk()
>>> tkSnack.initializeSnack(root)
>>> mysound = tkSnack.Sound()
>>> mysound.read('/home/jaganadh/Desktop/syllables.wav') ## Here you have to give the file name of your wav file.
>>> c = tkSnack.SnackCanvas(root, height=400)
>>> c.pack()
>>> c.create_waveform(0,0, sound=mysound, height=100, zerolevel=1)
1
>>> c.create_spectrogram(0,150,sound=mysound, height=200)
2
>>>


The result is like this.


I will make more experiments with this tool and will post.

Happy hacking !!!!!!!!!

Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
NLTK and Indian Language corpus processing - Part-II
Generating pronunciation of English words with Perl.
Finding bigrams with NLTK.
Converting word sequence to title case in Perl and Python.
Comments (3)  Permalink

Generating pronunciation of English words with Perl.

Today we can discuss a Perl module called Lingua::EN::Phoneme. This module is used for generating pronunciation of words in a given English text. The lexicon which is used in the module is CMU Pronunciation lexicon.  CMU have an Open Source Speech Recognition tool called 'Sphinx'. Sphinx based Speech Recognition work for Indian languages is in progress. Some individuals, groups and organisations are engaged in this. Here we are not going to discuss about Speech Recognition. 

The module Lingua::EN::Phoneme converts a given word to its pronunciation representation in ARPABET, e.g testing => TEH1STIH0NG. If you give your name as input it may not work! Because your name may not be present in the CMU pronunciation dictionary. 

Let us see how to install it.

Open terminal. Invoke the 'cpan' terminal as root user. Type 'install Lingua::EN::Phoneme'. Follow the instructions.

For testing it a sample code is given in the CPAN site. I am just copying it here. I will explain how to use it in a raw text file too.

The code for test run of the module.

=== Code Begin ==========

#!/usr/bin/env perl

use Lingua::EN::Phoneme;
my $lep = new Lingua::EN::Phoneme();
for ($lep->phoneme('cakes')) { print "$_ is a phoneme"; }

==== Code End=============

Copy and paste the code to your favourite editor and save it as 'phoneme.pl'. Run it ie 'perl phoneme.pl'. If you are getting a message like this "K is a phonemeEY1 is a phonemeK is a phonemeS is a phoneme", your module is working in your system.

Let us see how it can be applied over a raw English text file. I am giving another piece of code below.

==== Code Begin ===========

#!/usr/bin/env perl
use Lingua::EN::Phoneme;
my $cmu = new Lingua::EN::Phoneme;
open( TEF, "<$ARGV[0]" ) or "Failed open the file \n";
$text = <TEF>;
#Opens file and stores the content to $text
$text =~ s/ +/ /g;
$text =~ s/\n/ /g;
@words = split( /\s/, $text );
# Replaces morethan one space with single space and splits in to words
for ( $p = 0 ; $p <= $#words ; $p++ ) {
    $words[$p] =~ s/\.//g;
    $words[$p] =~ s/\,//g;
    #Replaces full stop and comma. IF it is attached to word lexical lookup will fail
    print $words[$p], "\t\t", $cmu->phoneme( $words[$p] ), "\n";
    #Prints the word and its pronounciation

}

==== Code End =============

Copy and paste this code to your favourite browser and save it as 'phone.pl'. Also you crate a flat file with some English sentences. Run the script as 'perl phone.pl <your_word_list>'. Now you will get pronunciation of each and every word in the file.  One thing you have to remember, for all most all Indian names it may not be giving a result.

To get a clear understanding of the output you have to look in to the ARPABET schema. Some numerals '0','1' and '2' will be appearing in the output. '0' indicates no primery stress, '1' indicates primary stress and '2' indicates secondery stress(Applicable to vowels).

Another matter (I think it is my mistake). When you are running the second code it may throw an error message "Use of uninitialized value in split at /usr/local/share/perl/5.10.0/Lingua/EN/Phoneme.pm line 31, <TEF> line 1.". I am trying to resolve it!. One thing if you collect the output to a file the error message will not appear.

If pronunciation lexicon for Indian languages comes under GPL then some one can write Lingua::IN::Phoneme too!! :-)

Happy Hacking!!!!!

Related Entries:
The Snack toolkit with Python
Converting word sequence to title case in Perl and Python.
Finding Bigrams with t-score in Perl
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
NLTK and Indian Language corpus processing - Part-II
Comments (2)  Permalink

Finding bigrams with NLTK.



In one of my previous post we discussed how to find bi-grams with Perl. Here we can see how it can be dome with Python and NLTK. NLTK is a Python based tool kit for Natural Language Programming. A Nice book is available from O'RIELLY., have look on it.

There is a function available in NLTK called bigrams(). Using this function we can generate bi-grams from text given in any language. Let us see how it can be used in an English text. After that I will show how it can be used to generate bi-grams in Indic Languages.

I Assumes that you have installed NLTK in your system. For installation instructions please visit the NLTK site.

Start Python interpreter. Then type the following commands in the python interpreter.

    >>> from nltk import bigrams

Just we are importing bigrams function from NLTK

    >>> a = "Jaganadh is testing this application"

Creating a string for generating bi-grams

    >>> tokens = a.split()

Converting string in to tokens

    >>> bigrams(tokens)
    [('Jaganadh', 'is'), ('is', 'testing'), ('testing', 'this'), ('this', 'application.')]
    >>>

We are passing tokens to the bigrams() function. That is all folks.

When we comes to Indic or other languages it can be done in a tricky way. Because your text will be in Unicode. So we have to use 'codecs' to open and read the Unicode contents.

Below I am giving a code for generating bigrams from a Unicode text with NLTK. Copy and paste the code to your favourite editor and save it as 'mlbigram.py'.

==== Code Begin =====

#!/usr/bin/env python
import codecs
import sys
from nltk import bigrams

def gen_ML_Bigram(text):
        texfbig = codecs.open(text,'r','utf8').read()
        tokens = texfbig.split()
        ml_bigram = bigrams(tokens)
        out = codecs.open("ml_bigram.txt",'w','utf8')
        for ml in ml_bigram:
                out.write(" ".join(ml))
                out.write("\n")


inp = sys.argv[1]

gen_ML_Bigram(inp)


=== Code End =========

Save your Unicode contents in a text file. Run python mlbigram.py <your_unicode_text>. After execution the output will be stored in 'ml_bigram.txt'. The code may seem slow even if you run in a small text file, because it may take time to load NLTK.

Happy hacking!!

Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
NLTK and Indian Language corpus processing - Part-II
WordNet sense similarity with NLTK: some basics
Graphical works with NLTK
NLTK and Indian Language corpus processing Part-III
Comments (11)  Permalink

Converting word sequence to title case in Perl and Python.

People working in Natural Language Processing often extract mutiword units. Some multiword units may be names of Organisation or departments like. Some times they may require to convert it to exact title case(department of physics to Department of Physics ). There is a perl module for performing this operation called Lingua::EN::Titlecase . If you are a Pythonist (me tooo) you may tell that it is easy in Python. Because there is a function called title() in Python. But this perl module is more intelligent than  the title() in Python. The out put of title() function in Python is like this.

$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "department of physics"
>>> x.title()
'Department Of Physics'
>>>

See the perl output :
~/perl$ perl ftit.pl p

Department of Physics

I think now you got why I am telling that the perl module is more intelligent that Pthon title().

Let us stop debating. First install the module Lingua::EN::Titlecase in your system.
To install the module in GNU/Linux system follow the below given steps.
Open terminal. Type 'cpan' as root user. Type 'install Lingua::EN::Titlecase' and follow the instructions.
Then copy the below given code to your favourite editor and save it as 'title.pl'. Now prepare a list of word like 'department of physics'. Now run the script. To run the script open terminal and go the directory where the 'title.pl' is located. Run 'perl title.pl <your_list>.
 

=========Code Begin =========
#!/usr/bin/env perl
use Lingua::EN::Titlecase;
my $tle_case = new Lingua::EN::Titlecase;

open( TCF, "<$ARGV[0]" ) or die "Failed to open\n";

while (<TCF>) {
    my ($totc) = $_;
    print $tle_case->title($totc);

}


========Code End ===========

As I mentioned earlier the same can be done in python with out any external module.
Copy and paste the below given code to your favourite editor and save as 'tc.py'. Run 'python tc.py <your_wordlist>. The result will be slightly different from the perl code. Spot it out.

======== Code Begin===========
#!/usr/bin/env python
import sys

def con2Title(words):
        word_list = open(words,'r').readlines()
        for w in word_list:
                print w.title()


word = sys.argv[1]

con2Title(word)


======= Code End =============


Happy hacking!!!!!!!!!!1

Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
NLTK and Indian Language corpus processing - Part-II
The Snack toolkit with Python
Generating pronunciation of English words with Perl.
Finding bigrams with NLTK.
Comments (6)  Permalink

Finding Bigrams with t-score in Perl

In this post I will tell how we can find bigrams with tscore from an English text. If you are interested to read more on Bigram see the wiki article at http://en.wikipedia.org/wiki/Bigram.

There is a module called Lingua::EN::Bigram in the CPAN directory. I know that if you are a Perl NLP man you might have created your own perl code for doing the same. Let it be there. We can see how it can be made possible with the above said module.

To install the module run 'cpan' as root(Assumes that you are connected to Internet). Type 'install Lingua::EN::Bigram' in the cpan terminal and follow the instructions. If the installation is ok then copy and paste below given code to your favourite editor and save it as 'bigram.pl' .

====Code Begin ========
#!/usr/bin/env perl
use Lingua::EN::Bigram;

$bigram = new Lingua::EN::Bigram;

while (<>) {
    $text = $_;
    $text =~ tr/A-Z/a-z/;
    $text =~ s/\.//g;
    $text =~ s/\,//g;
    $text =~ s/\?//g;
    $text =~ s/\!//g;
    $text =~ s/\://g;
    $text =~ s/\;//g;
    $bigram->text($text);
    $tscore = $bigram->tscore;
    foreach ( sort { $$tscore{$b} <=> $$tscore{$a} } keys %$tscore ) {
        print "$$tscore{$_} \t " . "$_\n";
    }
}


===Code End============


To run the program, open terminal and go to the directory where the bigram.pl is located. Type perl bigram.pl <your_textfile> .


Happy hacking !!!!

Related Entries:
Generating pronunciation of English words with Perl.
Converting word sequence to title case in Perl and Python.
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
NLTK and Indian Language corpus processing - Part-II
The Snack toolkit with Python
Comments (2)  Permalink

Fun with your name

Just I was exploring the Lingua::EN:: modules in CPAN. I saw a funny module called Lingua::EN::Namegame by 'Tim Maher'. If you give your name as input it will generate some verses. I will tell you how to install and use it.

Open terminal. Type 'cpan' as root. Type 'install Lingua::EN::Namegame' in the cpan terminal . Follow the instructions. If the installation is successful copy and paste the below given code in to your text editor and save it as 'namegame.pl'. Run it an enjoy.

=========Code Begin==========
#!/usr/bin/env perl
use Lingua::EN::Namegame;

print "Enter your name \n";
$name = <STDIN>;
chomp($name);
$verse = name2verse ($name);

print $verse,"\n";

print "How is it\n";


=========Code End============

Happy hacking !!!!!!!

Related Entries:
Generating pronunciation of English words with Perl.
Converting word sequence to title case in Perl and Python.
Finding Bigrams with t-score in Perl
Named Entity Recognition in Perl
Using Perl Lingua::EN::Sentence module for sentence splitting
 Permalink

POS Tagging with Python(Brill Tagger in Python)

In my previous post I demonstrated how to do POS Tagging with Perl. Being a fan of Python programming language I would like to discuss how the same can be done in Python. I downloaded Python implementation of the Brill Tagger by Jason Wiener . I just downloaded it. Nice one.

Download the program from http://wareseeker.com/Software-Development/simple-nlp-part-of-speech-tagger-in-python-1.0.zip/200a88782. Create a folder and move the .zip file to the folder. Unzip it. Open the terminal and run the MakeLex.py(Python MakeLex.py). It will create lexicon for the POS Tagger. Now you can run the NLPlib.py program(python NLPlib.py). It will show the below given result.

/Desktop/postag$ python NLPlib.py
beginning test
unpickle the dictionary
Initialized lexHash from pickled data.
Tiger ( NNP )
Woods ( NNP )
finished ( VBD )
the ( DT )
big ( JJ )
tournament ( NN )
at ( IN )
par ( NN )

I just created a small python script to tag a given English text file. Before using the script you have to comment lines 119 to end in the NLPlib.py file. Otherwise always you will be getting the above given output.

=== Code Begin=============
#!/usr/bin/env python
import sys
from  NLPlib import *
nlp = NLPlib()

def tagText(t):
        text = open(t,'r').readlines()
        for line in text:
                tokens = nlp.tokenize(line)

                tagged = nlp.tag(tokens)
                for i in range(len(tokens)):
                        print tokens[i],"(",tagged[i],")"



inp = sys.argv[1]

tagText(inp)

===Code End================

Just copy the code in to an editor. Save it as tager.py. You should run the program in the same directory where you extracted the tagger. Run it like

python tager.py <your_text>

This software is licensed under the GNU GPL Licence. If you are interested to use it commercial you have to buy a licence.


Happy hacking !!!!!!!

Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
NLTK and Indian Language corpus processing - Part-II
The Snack toolkit with Python
Finding bigrams with NLTK.
Converting word sequence to title case in Perl and Python.
Comments (3)  Permalink

POS Tagging with Perl

POS Tagging (Parts of Speech Tagging ) is the process of assigning correct Parts of Speech Tag to words in a text. You can find a more elaborated definition from the wiki article "Part-of-speech tagging".

So many POS Taggers are available for English under GPL and other licences. Some of the famous taggers are 'Stanford POS Tagger', Brill POS Tagger', 'Genia Tagger' etc... Have a look on the Stanford NLP repository link.

Here in this post I would like to show how POS Tagging can be done with a simple Perl script.

There is Pertl module available in the CPAN site called Lingua::EN::Tagger . It is a tagger module for English and some other Europian Languages. Let us see how it can be used for POS Tagging.

Installing Lingua::EN::Tagger in GNU/Linux

Open Terminal. Run cpan as root. Type 'install Lingua::EN::Tagger' enter(Assumes that you are connected to Internet). Follow the instructions. Make sure that tagger is installed. Exit CPAN.

Using Lingua::EN::Tagger

Now open your favourite editor (I prefer VIM). Copy and paste/type the below given code and save it in to a file called 'tager.pl'. Run 'perl tager.pl your_text_file .
 

====Code Begin===========

#!/usr/bin/env perl
use Lingua::EN::Tagger qw(add_tags);
my $postagger = new Lingua::EN::Tagger;

while (<>) {
    chomp $_;
    my $text = $_;

    my $tagged = $postagger->add_tags($text);

    print $tagged, "\n";
}

===Code End==============



Code Explanation

use Lingua::EN::Tagger qw(add_tags);

Imports Lingua::EN::Tagger

my $postagger = new Lingua::EN::Tagger;

Creates object of the tgger.

Then it reads a text file given as argument and performs the tagging.

Happy Hacking!!!!!

Related Entries:
index
POS Tagging with Python(Brill Tagger in Python)
Sentence Boundery detection algo
Life update
Opinion Mining and Sentiment Analysis papers from Computational Linguistics Open Access Journal
 Permalink
Next1-10/24