BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

Using Yahoo! Term Extractor web service with Python

Yesterday I was listening Van Lindberg's talk in PyCon US 2011 about Patent Mining. In his talk he mentioned about the Yahoo! Term Extractor Web Service. Before some times I heard that the service is not available now. Again I checked the web site and found that it is working now. I just played with the web service using a simple python script. Soon after seeing that it is working fine i created a dirty Python API for the web service. I am sharing code and documentation here
Code:
https://bitbucket.org/jaganadhg/yahootermextract/overview
Sample :
https://bitbucket.org/jaganadhg/yahootermextract/wiki/Home

After finishing the work I searched in the net and found that some similar scripts are available already :-(
 http://kwc.org/blog/archives/2005/2005-04-04.yahoo_term_extraction_examples.html

http://effbot.org/zone/yahoo-term-extraction.htm

Happy Hacking !!!!!!!!!!!!

Related Entries:
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Comments (0)  Permalink

Python workshop at Kongu Engineering College, Perundurai

Last Thursday (17th March 2011) I conducted a Python training session at Kongu Engineering College, Perundurai, Erode ad part of the Infocruise program. I got the opportunity through the ILUGC mailing list. Thanks Sreenivasan T.

I started my Journey from Coimbatore on Thursday morning in a TNSTC bus. I reached Perundurrai by 10.00 A.M. Maharajan student from Kongu and his friend reached the bus stand to pick me up to the college. By 10.15 we reached the college. The college is located in a nice calm and cool place. It is a green campus. After a small refreshment we went to the workshop venue. Around 60 students from different colleges participated in the workshop. The workshop was a partial hands on one. There was some power issues in the lab. So we were not able to do a full fledged hands on session on Python. The students were excited to learn such a simple and elegant programming language. They came up with copuple of interesting question regarding Python Job opportunities, application development etc..

The workshop ended by 4.00 PM. Maharajan dropped me at Perundurai bus stand. I cached a bus to Coimbatore and reached office by 06.30 P.M.

Some snaps (Thanks to Maharajan for sharing the photos)

From Knogu
From Knogu
Related Entries:
Using Yahoo! Term Extractor web service with Python
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Comments (0)  Permalink

FOSS Workshop at PSR Engineering College Sivakasi

Ane one day workshop on Free and Open Source Software has been conducted at PSR Engineering College, Sevalpatti, Sivakasi,Tamilnadu. I was invited to give an introduction to Python in the workshop. Mr. Chidambaresan an alumni of the PSR Engineering college picked me for the workshop. I reached Sattur by morning 4.30 and Chidambaresan picked me to a lodge for refreshments. By 7.30 A.M Chidambaresan and his friend arrived at lodge and we started to the college. On the way we picked Suthan HOD, MCA, Sivanthi Engineering College from Kovilpatti. After taking breakfast from Kovilpatti we headed towards the college. During the journy we discussed about FOSS and Engineering Syllabus, FOSS, ILUGC, and ILUGCBE and bit of politics too ;-) . We reached the college by 09.30 AM and we met the HOD, faculty members and Principal of the College. We had a nice discussion about students and their learning mentality, the necessity of motivating students to learn FOSS and contribute to FOSS. The college is located in a very nice and ambient village called Sevalpatti. Most of the students are from nearby villages or towns.

We started the workshop by 10.10 A.M. There was a small meeting which introduced Suthan and me to the students. after the introduction Suthan started his session on Introduction FOSS and FOSS Philosophy. He started the talk with nice examples and explained the concept of FOSS and its history. Even he explained it Tamil too, to reach the message up to the heart of the students. After the talk he gave a demo on how to install Ubuntu. The students came up with lots of doubts about on FOSS. After lunch we started the second session on Python programming. Jaganadh G (me) gave an interactive lecture on the basics of Python programming. The students were surprised to learn such a simple and powerful language. Then Chidambaresan gave a small and insperative talk on the necessity od students to learn and contribute to FOSS.  After the Python session there was a valedictory function too. HOD and faculty members of the IT department was present in the function. HOD of IT department distributed certificates and Ubuntu CD to the participants. Some of the students came forward to give feedback on the workshop. This is the first time FOSS is being introduced in the college. Chidambaresan took the pain to make it happen in the college. We hope that some of the students will following the message of FOSSS.

The workshop ended by 4.00 PM and we returned to Sattur by car, hoping that FOSS will be flourishing at Sivkasi region.

Some snaps from the workshop

PSR_Sivakasi_Workshop

Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Comments (0)  Permalink

CSV to CouchDB data importing, a Python hack

Last month I was playing with Apache CouchDB. Just some introductory stuff, map reduce etc... Soon I received some Linguistic data in .cvs format, as part of the project which I was managing. There was a need to analyze it. Usually we used MySQL/Spreadsheets  to store and analyze the data. Suddenly I thought why can't I do it with CouchDB ?? . There was no direct option for import CSV data to CouchDB. I searched in the web and ended with a hint. Manilal a friend of mine also pointed to the same hint http://www.apacheserver.net/Load-CSV-file-into-couchdb-at1056996.htm .

Soon I created a small script to do the job aka load CSV file to CouchDB. The script is available in my Bitbucket repo https://bitbucket.org/jagan/misc/src/84cefb61c86a/csv2couch.py . It is a quick solution. May be you may be have a better version !!! I thought putting it in the web may help somebody else.


Happy Hacking !!!

Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Comments (0)  Permalink

Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei

Python 2.6 Text Processing Beginner's Guide by Jeff McNeil is one of the latest books by Packt Publishers. I received the review copy of this book before one and half months or so. Due to busy schedule I was not able to finish the review process. Finally I got enough time to review it. The book gives good insight to on different technical aspects and use of Python standards and third party libraries for text processing. It is filled with lots of examples and practical projects. I think I might have took almost one year to gather knowledge in  the topic discussed in this book, when I started my career in Natural Language Processing domain. I am giving a bit detailed review on the book here.

The first chapter of this book gives some practical and interesting exercises like implementing cypher, some basic tricks with HTML. It also discusses how to setup a Python virtual environment for working with the examples in the book. The section of setting virtual environment is nice an well written one. It gives a clear idea of how to setup virtual environments.

The second chapter deals with Python IO module. It narrates the basic file operations with Python. The use of context manager(with function) for for file processing is discussed in this chapter. I am suing Python for text processing for lat three to four years. But after reading only I found that there is something called "fileinput" in Python programming language for multiple file access. The chapter discuss how to access remote files and StringIO() module in Python. At the end of this chapter there is a discussion about IO in Python 3 too.

The third chapter is about Python String Services. It deals with string formatting, templating, modulo formatting etc. Every concept is explained with necessary mini projects which followed from chapter two. The chapter gives a comprehensive view on advanced string services in Python.

The fourth chapter is entitled as Text Processing Using the Standard Library. This chapter deals with topic like reading wnd writing csv files(csv file processing), playing with application config files(.ini files), and working with JSON. The examples are bit long one but worth practicing for better understanding.

The fifth chapter deals with one of the key aspect in text processing "Regular Expressions". The chapter teaches basics syntax of regular expression in Python. The chapter also discusses about advanced processing like regex grouping, look ashed and look behind assertion in regular expressions. The look behind operation in regular expression is the most tricky part in dealing with regex. I think only masters in regex can do it effectively ;-) .The chapter dscuss basics of Unicode regular expressions too. The chapter is filled with enough examples for each and every concept discussed.

The sixth chapter deals with Markup Languages. The chapter discusses about XMl and HTML processing with Pytho standard libraries. xml.dom.minido, SAX,laxm and BeautifulSoup packages are discussed with illustrative examples.

The seventh chapter is entitled as Creating Templates. "Templating involves the creation of text files, or templates, that contain special markup. When a specialized parser encounters this markup, it replaces it with a computed value". The templating concept was quite new to me. But I got a good insight on the topic from this chapter. The chapter discusses some libraries like "Makeo" for templating task.

The eight chapter deals with localization (l1on) and encoding. If you are working with non-English data this chapter is a must read for you.The chapter discuses about character encoding, Unicode processing and Python3 too. Apart from mere Python stuff this chapter gives a good insight about charter encoding too.


The ninth chapter Advanced Output Formats is quite useful if you are trying to create output in PDF, CSV orExcel format. This chapter discuss about ReportLab a PDF generation library in Python.The only disadvantage which I found in ReportLab is its lack of complete Unicode Support. The chapter also discusses about creating excel files with xlwt module. Finally the chapter deals with handling OpenDocument format woth ODFPy module too. I used to read excel file from Python. But after going through this book I am able to even write Excel output too.

The tenth chapter deals with Advanced Parsing and Grammars. This is one of the key skill which required for Python text processing peoples. Creating custom grammars for parsing specific data. Through out my career I spent lot of time to train Engineers to understand parsing and BNF grammar. This time I got a good pointer for my people to start with BNF and Python programming. Also this chapter discusses about some parsing module in NLTK my favorite Python library. Some advanced topics in PyParsing also discussed in this chapter.

The eleventh and last chapter is the most interesting one in the book. The chapter deals with Searching and Indexing. PyLucene is the bset known Searching Index library in Python. But it is a wrapper to the apache Lucene. But his chapter discusses about another Python tool Nucular. Practical examples for creating search index etc are given in this chapter. This is the first time I am using the Nucular tool. I feel it as a nice and easy one compared to PyLucene. But I dont think this is superior than Lucene. I will play more with this tool and will update it in another blog post.

There are two appendix . The first appendix gives pointers to Python resources. The next one is answer o the pop quiz in the chapters.

I will give 9 out of 10 for this book. If you are dealing with rigorous text processing this book is a must have reference for you.

Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Again Python programming in Malayalam
Python3 ZWJ and Malayalam; some doubts
Python3 is wonderful
HBase Administration Cookbook by Yifeng Jiang : Review
Comments (1)  Permalink

New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'

Packt Publishing releases a new book 'Python Text Processing with NLTK 2.0 Cookbook' by Jacob Perkins (http://streamhacker.com). I received review copy of the book today. I will put a review of the book here soon. The book comes with lot of practical examples and tips.

An extracted chapter from the book is available for download at https://www.packtpub.com/sites/default/files/3609-chapter-3-creating-custom-corpora.pdf


Related Entries:
NLTK and Indian Language corpus processing - Part-II
Finding bigrams with NLTK.
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt: MySQL for Python
WordNet sense similarity with NLTK: some basics
 Permalink

New book by Packt: MySQL for Python

Packt Publishing provided an e-book of their new book "MySQL for Python" by Albert Lukaszewski. I will be reading the book during my deepavali holidays and I will put a review on the book here.

They provided link to a free chapter (Chapter No. 4 - Exception Handling) also . If you are interested in Python and MySQL feel free to read it.


https://www.packtpub.com/sites/default/files/imagecache/productview/0189OS_MockupCover_0.jpg
Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Hadoop Database access experiment
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
 Permalink

Speech Recognition with Python

Speech Recognition with Python
Recently I saw a talk in listed in the CMU Sphinx site about Speech recognition with Python and Pocket Sphinx. I have downloaded the video and replicated the experiments. It was successful . To play with Python and speech recognition we have to install the following packages.
python-pocketsphinx
pocketsphinx-lm-wsj
pocketsphinx-hmm-wsj

These three packages are available in the Ubuntu repo. So using apt-get install you can install it.

If you are using fedora install the following packages with yum :

pocketsphinx-devel
pocketsphinx-libs
pocketsphinx-plugin
pocketsphinx-python
pocketsphinx

The language model and hmm files are not available in the fedora repo so I downloaded the source files from http://packages.ubuntu.com/en/maverick/pocketsphinx-lm-wsj and http://packages.ubuntu.com/fi/maverick/pocketsphinx-lm-wsj .

Then following the instruction in the video I prepared a python code. To test the code I used wav files from NLTK data (timit corpus).

The python code which I used is given below
The code is available at my bitbucket "http://bitbucket.org/jaganadhg/blog/src/tip/asr/asr.py"

#!/usr/bin/env python
import sys,os

def decodeSpeech(hmmd,lmdir,dictp,wavfile):
"""
Decodes a speech file
"""
try:
import pocketsphinx as ps
import sphinxbase
except:
print """Pocket sphinx and sphixbase is not installed
in your system. Please install it with package manager.
"""
speechRec = ps.Decoder(hmm = hmmd, lm = lmdir, dict = dictp)
wavFile = file(wavfile,'rb')
wavFile.seek(44)
speechRec.decode_raw(wavFile)
result = speechRec.get_hyp()

return result[0]

if __name__ == "__main__":
hmdir = "/home/jaganadhg/Desktop/Docs_New/kgisl/model/hmm/wsj1"
lmd = "/home/jaganadhg/Desktop/Docs_New/kgisl/model/lm/wsj/wlist5o.3e-7.vp.tg.lm.DMP"
dictd = "/home/jaganadhg/Desktop/Docs_New/kgisl/model/lm/wsj/wlist5o.dic"
wavfile = "/home/jaganadhg/Desktop/Docs_New/kgisl/sa1.wav"
recognised = decodeSpeech(hmdir,lmd,dictd,wavfile)
print "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
print recognised
print "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"

Happy Hacking
Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
Comments (4)  Permalink

WordNet sense similarity with NLTK: some basics

WordNet sense similarity with NLTK: some basics

The Natural Language Tool Kit (NLTK) has a WordNet module which enables a developer to play around with it in different ways. Recently I saw a question regarding Word Sense Disambiguation with NLTK. I just planed to try finding sense similarity. NLTK has implementation of the following sense similarity algorithms.

1) Wu-Palmer Similarity:
Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
2) Resnik Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

3) Path Distance Similarity:
Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy.
4) Lin Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.
5) Leacock Chodorow Similarity:
Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur.
6) Jiang-Conrath Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.

(Description taken from the NLTK help guide)

I just wrote a small piece of code to find Path Distance Similarity and Wu-Palmer Similarity. IT returns all the score by comparing all the sysnsets of two given words.

The code is available at my bitbucket repo
#!/usr/bin/env python

from nltk.corpus import wordnet as wn

def getSenseSimilarity(worda,wordb):
"""
find similarity betwwn word senses of two words
"""
wordasynsets = wn.synsets(worda)
wordbsynsets = wn.synsets(wordb)
synsetnamea = [wn.synset(str(syns.name)) for syns in wordasynsets]
synsetnameb = [wn.synset(str(syns.name)) for syns in wordbsynsets]

for sseta, ssetb in [(sseta,ssetb) for sseta in synsetnamea\
for ssetb in synsetnameb]:
pathsim = sseta.path_similarity(ssetb)
wupsim = sseta.wup_similarity(ssetb)
if pathsim != None:
print "Path Sim Score: ",pathsim," WUP Sim Score: ",wupsim,\
"\t",sseta.definition, "\t", ssetb.definition

if __name__ == "__main__":
#getSenseSimilarity('cat','walk')
getSenseSimilarity('cricket','score')

The code is available at my bitbucket repo

free web counter
AIComputational LinguisticsNatural Language ProcessingNLTKPython
Related Entries:
NLTK and Indian Language corpus processing Part - I
NLP with Python NLTK Book
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Graphical works with NLTK
NLTK and Indian Language corpus processing Part-III
Comments (3)  Permalink

PyLucene in Action - Part I

PyLucene is a Python wrapper aroung the Java Lucene. The goal of this tool is use Lucene's text indexing and searching capabilities from Python. It is compatible with the latest version of Java Lucene. PyLucene is embeds a Java VM with Lucene into Python process. More details on PyLucene can be found at http://lucene.apache.org/pylucene/.

In this blog post I am going to demonstrate how to build a search index and query a search index with PyLucene. You can see the installation instruction for PyLucene im my previous blog post ....
1) Creating an index with Pylucene

I am using the below given code to create an index with the PyLucene

===Code Python BEGIN ===========================================

#!/usr/bin/env python
import os,sys,glob
import lucene
from lucene import SimpleFSDirectory, System, File, Document, Field, \
StandardAnalyzer, IndexWriter, Version
"""
Example of Indexing with PyLucene 3.0
"""
def luceneIndexer(docdir,indir):
"""
Index Documents from a dirrcory
"""
lucene.initVM()
DIRTOINDEX = docdir
INDEXIDR = indir
indexdir = SimpleFSDirectory(File(INDEXIDR))
analyzer = StandardAnalyzer(Version.LUCENE_30)
index_writer = IndexWriter(indexdir,analyzer,True,\
IndexWriter.MaxFieldLength(512))
for tfile in glob.glob(os.path.join(DIRTOINDEX,'*.txt')):
print "Indexing: ", tfile
document = Document()
content = open(tfile,'r').read()
document.add(Field("text",content,Field.Store.YES,\
Field.Index.ANALYZED))
index_writer.addDocument(document)
print "Done: ", tfile
index_writer.optimize()
print index_writer.numDocs()
index_writer.close()
==== Code Python END ============================================

You have to supply two parameter to the luceneIndexer().
a) A path to the directory to where the documents for indexing is stored
b) A path to the directory where the index can be saved

2) Querying an index with Pylucene

The below given code is for querying an index with PyLucene

======= Code Begin Python =======================================
#!/usr/bin/env python
import sys
import lucene
from lucene import SimpleFSDirectory, System, File, Document, Field,\
StandardAnalyzer, IndexSearcher, Version, QueryParser
"""
PyLucene retriver simple example
"""
INDEXDIR = "./MyIndex"
def luceneRetriver(query):
lucene.initVM()
indir = SimpleFSDirectory(File(INDEXDIR))
lucene_analyzer = StandardAnalyzer(Version.LUCENE_30)
lucene_searcher = IndexSearcher(indir)
my_query = QueryParser(Version.LUCENE_30,"text",\
lucene_analyzer).parse(query)
MAX = 1000
total_hits = lucene_searcher.search(my_query,MAX)
print "Hits: ",total_hits.totalHits
for hit in total_hits.scoreDocs:
print "Hit Score: ",hit.score, "Hit Doc:",hit.doc, "Hit String:",hit.toString()
doc = lucene_searcher.doc(hit.doc)
print doc.get("text").encode("utf-8")
luceneRetriver("really cool restaurant")
===============================================================

In the code I have manually specified the index dir """INDEXDIR = "./MyIndex" """. Instead of this one can receive the index directory as a command line parameter (sys.argv) too.

When using the function luceneRetriver() you have to give a query as parameter.

The source code is available in bitbucket https://bitbucket.org/jaganadhg/pyluceneia

Happy Hacking !!!!!!!!!!
Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
Comments (2)  Permalink
Next1-10/32