BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

Opinion Mining and Sentiment Analysis papers from Computational Linguistics Open Access Journal

 



blog comments powered by Disqus

 Permalink

Taming Text : Review

    We are living in the era of Information Revolution. Everyday wast amount of information is being created and disseminated over World Wide Web(WWW). Even though each piece of information published in the web is useful in some way; we may require to identify and extract relevant/useful information.Such kind of information extraction includes identifying Person Names, Organization Names etc.. ,finding category of a text, identifying sentiment of a tweet etc ... Processing large amount text data from web is a challenging task, because there is an information overflow. As more information appears there is a demand for smart and intelligent processing and text data. The very field of text analytics has been attracted attention of developers around the glob. Many practical as well as theoretical books has been published on the topic.

This book, "
Taming Text", written by Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris is an excellent source for Text Analytics Developers and Researchers who is interested to learn Text Analytics. The book focuses on practical Text Analytics techniques like Classification,Clustering, String Matching, Searching and Entity Identification. The book provides easy-to follow examples in using well-known Open Source Text Analytics tools like Apache Mahout, Apache Lucece, Apache Solr, OpenNLP etc.. The entire book is based on the author's experience in contributing to relevant Open Source tools, hands on experience and their industry exposure. It is a must-read for Text Analytics developers and Researchers. Given the increasing importance of Text Analytics this book can be served as a hand book for budding Text Analytics Developers and Industry People. Definitely it can be used in Natural Language Processing, Machine Learning and Computational Linguistics courses.

Chapter 1: Getting Started Taming Text
The first chapter of the book introduces what is Taming Text? The authors gives list of challenges in text processing with brief explanations. The chapter is mostly an introductory stuff.

Chapter 2: Foundations of Taming Text
This chapter gives a quick warm up of your high school English grammar. Starting from words, the authors presents essential linguistic concepts required for text processing.  I think "Taming Text" will be the first technical book which gives a good warm up on basics of Language and grammar. The chapter gives a detailed introduction to words, parts of speech, phrases and morphology. This introduction is sufficient enough to capture the essential linguistic aspects of Text Processing for a developer. The second part of this chapter deals with basic text processing tasks like, tokenization, sentence splitting, Part of Speech Tagging (POS Tagging) and Parsing. Code snippets for each of the task has been given in the chapter. All the code examples are narrated with the tool
OpenNLP . The chapter gives some basic of handling different file formats using Apache Tika. This chapter gives a step by step intro to the preliminaries of Text Processing.

Chapter 3: Searching
This chapter introduces the art of Search. It gives a brief but narrative description of the Search mechanism and scene behind the curtains. The chapter discusses the basics of Search with the help of
Apache Solr. There is an interesting discussion on search evaluation and search performance enhancements and page rank too. The chapter gives a detailed list of Open Source search engines. But I think the authors forgot to add the "Elasticsearch" library  to the list. I hope that it may be added in the final print version of the book.

Chapter 4: Fuzzy String Matching
Everybody might have wondered how the "Did you mean:" feature in Google or any other search engine works. Long ago I saw a question in Stackoverflow; querying about the availability of source code for  "Did you mean:" feature !!! (something similar I think). If you wonder how this feature is working this chapter will give you enough knowledge to implement something similar. There is a simple discussion on different fuzzy string matching algorithms with code samples. There is practical examples on how to implement the "Did you Mean" and type ahead (auto suggest) utility on Apache Solr. Over all this chapter gives a solid introduction and hands on experience on Fuzzy String Matching.

Chapter 5: Identifying People, Places and Things
Diving deeper into text processing ocean, the authors narrates many deeper concepts in Text Processing starting from this chapter. The main focus of this chapter is Named Entity Identification (NER), one of the trivial tasks in Information Extraction and Retrieval. The chapter gives a good introduction to the task on Named Entity Identification along with code samples using OpenNLP. The code samples will help you to make your hands dirty. There is a section which deals with how to train OpenNLP to adopt a new domain. This will be one of the most useful tip for working professionals. The only thing which I feels to be missing is a mention about
GATE and Apache UIMA. Both of the tools are famous for their capability to accomplish the NER task.

Chapter 6: Clustering Text
The sixth chapter mainly deals with Clustering. Clustering is an unsupervised (i.e. no human intervention required) task that can automatically put related content into buckets.[Taken from the book "Taming Text"]. The initial part of this chapter narrates clustering with reference to real world applications. A decent discussion on clustering techniques and clustering evaluation is also there. Code examples for clustering is given in this chapter.
Apache Solr, Apache Mahout and Carrot are used to give practical examples for clustering.

Chapter 7: Classification, Categorization and Tagging
Seventh chapter deals with document classification. As like in the other chapters there is a reasonable discussion on document classification techniques. This chapter will teach you how to perform document classification with Apache Lucene, Apache Solr, Apache Mahout and OepnNLP. There is interesting project called 'tag recommender' in this chapter. The only hiccup which I faced with this chapter is the "TT_HOME" environment variable which used through out the book. I think the authors forgot to mention how to set TT_HOME. I was familiar with Apache Mahout so ther was no issue with MAHOUT_HOME environment variable. A totally newbie will find it difficult to spot the TT_HOME and MAHOUT_HOME used in the code samples. A little bit light on setting these variables may help reader a lot. I think this will be included in the final copy(I am reading a MEAP version).

Chapter 8: An Example Application: Question Answering

This chapter gives a hands on experience in Taming Text. The entire chapter is dedicated for building a Question Answering project using the techniques discussed in all the chapters. A simple make your hands dirty by Taming Text chapter. Here also you will be caught with the TT_HOME ghost.

Chapter 9: Untamed Text: Exploring the Next Frontier

The last chapter "Untamed Text: Exploring the Next Frontier" mentions other ares in Text processing such as Semantics Pragmatics and Sentiment Analysis etc.. Brief narration on each of these field are included in this chapter. There are a lots of pointers to some useful tools for advanced Text processing tasks like Text Summarisation and Relation Extraction etc ..

Conclusion
Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris have done a nice job by authoring this book with lucid explanations and practical examples for different Text Processing Challenges. With the help of simple and narrative examples the authors demonstrates how to solve real world text processing challenges using Free and Open Source Tools. The algorithm discussions in the book is so simple; even a newbie can follow the concepts without much hiccups. It is a good desktop reference for people who would like to start with Text Processing. It provides comprehensive and hands-on experience in Text Processing. So grab a copy soon and be ready for Big Data Analysis.

Free and Open Source Tools Discussed in the Book
Apache Solr
Apache Lucene
Apache Mahout
Apache OpenNLP
Carrot2.

Disclaimer : I received a review copy of the book from Manning

Related Entries:
Mahout in Action: Review
Comments (2)  Permalink

Using Yahoo! Term Extractor web service with Python

Yesterday I was listening Van Lindberg's talk in PyCon US 2011 about Patent Mining. In his talk he mentioned about the Yahoo! Term Extractor Web Service. Before some times I heard that the service is not available now. Again I checked the web site and found that it is working now. I just played with the web service using a simple python script. Soon after seeing that it is working fine i created a dirty Python API for the web service. I am sharing code and documentation here
Code:
https://bitbucket.org/jaganadhg/yahootermextract/overview
Sample :
https://bitbucket.org/jaganadhg/yahootermextract/wiki/Home

After finishing the work I searched in the net and found that some similar scripts are available already :-(
 http://kwc.org/blog/archives/2005/2005-04-04.yahoo_term_extraction_examples.html

http://effbot.org/zone/yahoo-term-extraction.htm

Happy Hacking !!!!!!!!!!!!

Related Entries:
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Comments (0)  Permalink

Book Review:Python Text Processing with NLTK 2.0 Cookbook by Jacob Perkins

Python Text Processing with NLTK 2.0 Cookbook by Jacob Perkins is one of the latest books published by Packt in the Open Source series. The book is meant for people who started learning and practicing the Natural Language Tool Kit(NLTK).NLTK is an Open Source Python library to learn practice and implement Natural Language Processing techniques. The software is licensed under the Apache Software license. It is one of the most widely recommended tool kit for beginners in NLP to make their hands dirty. The toolkit is part of syllabus for many institutions around the globe where Natural Language Processing/ Computational Linguistics courses are offered. Perkins book work is the second book published on the toolkit NLTK. The first book is written by core developers of NLTK; Steven Bird, Ewan Klein, and Edward Loper, published by O'rielly. Steven et.all's book is a comprehensive introduction to the toolkit with basic Python lessons. People who has gone through the book may definitely like the new book by Perkin. The book is must have desktop reference for students, professionals, and faculty members interested in the area of NLP, Computational Linguistics and NLTK. Perkins handles the topic in an elegant way. Most of the people who searched for some NLTK tips might have gone through the author's blog. He maintains same simplicity and explanation style and hands-om approach throughout the book; which makes the reader to digest the topic with much easiness. The book is a collection of practical and working recipes related to NLTK.

The first chapter of the book "Tokenizing Text and WordNet Basics" deals with tokenizing text in to words sentences and paragraphs. The chapter also deals with tips and tricks with WordNet module in NLTK. Perkin discusses about Word Sense Disambiguation(WSD) techniques in this chapter. The missile part in WordNet is the use of wordnet 'ic' function. Tips for extracting collocations from a corpus is also included in the first chapter. The chapter "Replacing and Correcting Words"(IInd chapter) discusses stemming , lemmatization and spelling correction. He introduces another Python module named Python-Enchant for discussing about the spell checking technique. The chapter also discusses techniques like replaces negation with antonyms and replacement of repeating characters. The third chapter deals with Corpora. This chapter mainly discusses how to load user generated corpora in to NLTK with corpus readers implemented in NTLK. The most attracting part of this chapter is discussion about MonngoDB blackened for corpus reader in NLTK. MongoDB is a text based DB, which belongs to the NoSQL family. This part will be very useful for students in NLP and working professionals. The fourth chapter deals with POS Tagging techniques. It discusses mainly about training different POS taggers and using it. It is also quiet useful for people who would like to extend the functionality in NLTK for their projects and people who is interested to extend POS taggers in language other than English. Some part of this chapter content was published in the authors blog before one year. Chapter five of the book deals with Chunking and Chinking techniques with NLTK. Named Entity Identification and Extraction techniques are also discussed in this chapter. It gaves good insight to train NLTK chunking module for custom chunking tasks. With the help of this chapter I was able to create a small named entity extraction script with some Indian names. The sixth chapter is named as "Transforming Chunks and Trees" which deals with verb form correction, plural to singular correction, word filtering, and playing trees structures. Many time I saw that people used to raise question about handling tree data in NLTK. I think people can refer this chapter for getting good insight to play with NLTK parse tree data. The seventh chapter deals with most wanted topic of the time "Text Classification". Some part of this chapter appeared as blog post in Perkin's blog. There was many requests in freelancing web sites for text classification with NLTK. I found that some of them were not bided too. The chapter discusses the task of Text Classification in details with all the classification implementations available in NLTK. Training NLTK classifier is discussed very clearly. Apart form the classifier training, classification the chapter discusses classifier evaluation and tuning too. The eight chapter a revolutionary one which deals with Distributed data processing and handling large scale data with NLTK. I was not able to fully play with the total code in this chapter (Yes I worked out the code in other chapters and it was quite exciting. It contributed to my professional life too) . This chapter will be really helpful for industry people who is looking for to adopt NLTK in to NLP projects. Some basic insights of the contents in this chapter was also published in Perkin's blog. After Nithin Madanini's talk in US Python Conference on corpus processing with Dumbo and NLTK I think this is the only existing resource for practical large scale data processing with NLTK. The ninth and last chapter is about Parsing Scientific data with Python. This chapter deals with some Python modules rather than the NLTK tool. It discusses about URL extraction, timezone look-up, character conversion etc.. This chapter is good for people who plays with web data processing like harvesting. There is an appendix for the book which contains "Penn Treebank". It give list of all tags with its frequency in treebank corpus.

For the last three four years I am using NLTK to teach and develop prototypes of NLP applications. I was very much when I went through each of the recipes in this book. The author provides UML diagrams for the modules in NLTK which helps the reader to get good insight on the functionality of each module. This will be a good book not only for students and practitioners but also for people would like to contribute to NLTK project too. Also this book will help students in NLP and Computational Linguistics to do their projects with NLTK and Python. I give 9 out of 10 for the book. Natural Language Processing students, teachers, professional hurry and bag a copy of this book.

Thanks to Packt publihsres for the review copy of the book.

Comments (0)  Permalink

New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'

Packt Publishing releases a new book 'Python Text Processing with NLTK 2.0 Cookbook' by Jacob Perkins (http://streamhacker.com). I received review copy of the book today. I will put a review of the book here soon. The book comes with lot of practical examples and tips.

An extracted chapter from the book is available for download at https://www.packtpub.com/sites/default/files/3609-chapter-3-creating-custom-corpora.pdf


Related Entries:
NLTK and Indian Language corpus processing - Part-II
Finding bigrams with NLTK.
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt: MySQL for Python
WordNet sense similarity with NLTK: some basics
 Permalink

"Blekko" a new Search Engine launches tomorrow (1st Nov. 2010)

"Blekko" a new Search Engine launches tomorrow (1st Nov. 2010)

Starting a new search start-up is bit difficult and requires extra ordinary preparation; both technical and mentally. With in the past few years we met many new search engines. "Cuil" which was one among them. It was considered as one o the successful startup in Siliconvally. It was started by a group of Ex Google employees. But on on September 17, 2010 Cuil disappoint from the World Wide Web :-( . Alo we have seen the rise of search engines with difference like Wolfram|Alpha and Hakia . I dont know how many search engine users in India is using Guruji an India Specific Search Engine . Alexa rank of Guruji is very poor :-( .

Let's come to the brand new search engine "Blekko" !!. Blekko started the ground work for the search engine befor 2.5 years. One of the co-foundrs of Blekko is a well-known person "Rich Skrenta": who created the first computer virus "Elk Cloner" . In June 2008 Techcrunch published an article about Blekko titled "The Next Google Search Challenger: Blekko" . Then onwards people were waiting anxiously for the release of Blekko service. It is going to happen on 1st November 2010 .

Blekko remains private alpha up to today midnight. I got an opportunity to use the system in its private beta stage. Thanks for the Blekko team for providing access to the system.

Is Belkko different ?

As like any typical search engine Blekko is a full search engine. I don't know whether they can beat Google in the size of index, search speed and relevancy .The company says that they are on Par with Google and Bing for almost all queries. I just searched my name and results was nice. When I search google with keyword "jaganadh" it displays result for "jagannath". But blekko return result for "jaganadh". But if you ask it is good or bad I am bit confused? because there are obvious reasons for saying it as good and bad at the same time. The one reason is my name is pure Indian in nature and can be written with many spellings and its is name of a famous god too. It may be difficult for a search engine to assume what is the intention of the user. At the same time I am happy that Blekko gives exact result for a given spelling, because I searched for mentioning my name. But for a person who is searching about god; I don't know :-) :-(. I tried many other keywords and I was satisfied with the results. I created a couple of slash tags too :-)

The new feature which they introduces is called 'slash tag'. Users can create their own slashtags based on a group of URLs. The slogan which they put in the face page of "Blekko" is "get ready to slash the web" . Lets wait a few more hour and see whether Blekko can "slash" the web or not. I know that there is nothing like "Google Killer" . Still I am eager to see wheatear a search engine war will be starting or not :-)
Comments (1)  Permalink

Speech Recognition with Python

Speech Recognition with Python
Recently I saw a talk in listed in the CMU Sphinx site about Speech recognition with Python and Pocket Sphinx. I have downloaded the video and replicated the experiments. It was successful . To play with Python and speech recognition we have to install the following packages.
python-pocketsphinx
pocketsphinx-lm-wsj
pocketsphinx-hmm-wsj

These three packages are available in the Ubuntu repo. So using apt-get install you can install it.

If you are using fedora install the following packages with yum :

pocketsphinx-devel
pocketsphinx-libs
pocketsphinx-plugin
pocketsphinx-python
pocketsphinx

The language model and hmm files are not available in the fedora repo so I downloaded the source files from http://packages.ubuntu.com/en/maverick/pocketsphinx-lm-wsj and http://packages.ubuntu.com/fi/maverick/pocketsphinx-lm-wsj .

Then following the instruction in the video I prepared a python code. To test the code I used wav files from NLTK data (timit corpus).

The python code which I used is given below
The code is available at my bitbucket "http://bitbucket.org/jaganadhg/blog/src/tip/asr/asr.py"

#!/usr/bin/env python
import sys,os

def decodeSpeech(hmmd,lmdir,dictp,wavfile):
"""
Decodes a speech file
"""
try:
import pocketsphinx as ps
import sphinxbase
except:
print """Pocket sphinx and sphixbase is not installed
in your system. Please install it with package manager.
"""
speechRec = ps.Decoder(hmm = hmmd, lm = lmdir, dict = dictp)
wavFile = file(wavfile,'rb')
wavFile.seek(44)
speechRec.decode_raw(wavFile)
result = speechRec.get_hyp()

return result[0]

if __name__ == "__main__":
hmdir = "/home/jaganadhg/Desktop/Docs_New/kgisl/model/hmm/wsj1"
lmd = "/home/jaganadhg/Desktop/Docs_New/kgisl/model/lm/wsj/wlist5o.3e-7.vp.tg.lm.DMP"
dictd = "/home/jaganadhg/Desktop/Docs_New/kgisl/model/lm/wsj/wlist5o.dic"
wavfile = "/home/jaganadhg/Desktop/Docs_New/kgisl/sa1.wav"
recognised = decodeSpeech(hmdir,lmd,dictd,wavfile)
print "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
print recognised
print "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"

Happy Hacking
Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
Comments (4)  Permalink

WordNet sense similarity with NLTK: some basics

WordNet sense similarity with NLTK: some basics

The Natural Language Tool Kit (NLTK) has a WordNet module which enables a developer to play around with it in different ways. Recently I saw a question regarding Word Sense Disambiguation with NLTK. I just planed to try finding sense similarity. NLTK has implementation of the following sense similarity algorithms.

1) Wu-Palmer Similarity:
Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
2) Resnik Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

3) Path Distance Similarity:
Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy.
4) Lin Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.
5) Leacock Chodorow Similarity:
Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur.
6) Jiang-Conrath Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.

(Description taken from the NLTK help guide)

I just wrote a small piece of code to find Path Distance Similarity and Wu-Palmer Similarity. IT returns all the score by comparing all the sysnsets of two given words.

The code is available at my bitbucket repo
#!/usr/bin/env python

from nltk.corpus import wordnet as wn

def getSenseSimilarity(worda,wordb):
"""
find similarity betwwn word senses of two words
"""
wordasynsets = wn.synsets(worda)
wordbsynsets = wn.synsets(wordb)
synsetnamea = [wn.synset(str(syns.name)) for syns in wordasynsets]
synsetnameb = [wn.synset(str(syns.name)) for syns in wordbsynsets]

for sseta, ssetb in [(sseta,ssetb) for sseta in synsetnamea\
for ssetb in synsetnameb]:
pathsim = sseta.path_similarity(ssetb)
wupsim = sseta.wup_similarity(ssetb)
if pathsim != None:
print "Path Sim Score: ",pathsim," WUP Sim Score: ",wupsim,\
"\t",sseta.definition, "\t", ssetb.definition

if __name__ == "__main__":
#getSenseSimilarity('cat','walk')
getSenseSimilarity('cricket','score')

The code is available at my bitbucket repo

free web counter
AIComputational LinguisticsNatural Language ProcessingNLTKPython
Related Entries:
NLTK and Indian Language corpus processing Part - I
NLP with Python NLTK Book
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Graphical works with NLTK
NLTK and Indian Language corpus processing Part-III
Comments (3)  Permalink

PyLucene in Action - Part I

PyLucene is a Python wrapper aroung the Java Lucene. The goal of this tool is use Lucene's text indexing and searching capabilities from Python. It is compatible with the latest version of Java Lucene. PyLucene is embeds a Java VM with Lucene into Python process. More details on PyLucene can be found at http://lucene.apache.org/pylucene/.

In this blog post I am going to demonstrate how to build a search index and query a search index with PyLucene. You can see the installation instruction for PyLucene im my previous blog post ....
1) Creating an index with Pylucene

I am using the below given code to create an index with the PyLucene

===Code Python BEGIN ===========================================

#!/usr/bin/env python
import os,sys,glob
import lucene
from lucene import SimpleFSDirectory, System, File, Document, Field, \
StandardAnalyzer, IndexWriter, Version
"""
Example of Indexing with PyLucene 3.0
"""
def luceneIndexer(docdir,indir):
"""
Index Documents from a dirrcory
"""
lucene.initVM()
DIRTOINDEX = docdir
INDEXIDR = indir
indexdir = SimpleFSDirectory(File(INDEXIDR))
analyzer = StandardAnalyzer(Version.LUCENE_30)
index_writer = IndexWriter(indexdir,analyzer,True,\
IndexWriter.MaxFieldLength(512))
for tfile in glob.glob(os.path.join(DIRTOINDEX,'*.txt')):
print "Indexing: ", tfile
document = Document()
content = open(tfile,'r').read()
document.add(Field("text",content,Field.Store.YES,\
Field.Index.ANALYZED))
index_writer.addDocument(document)
print "Done: ", tfile
index_writer.optimize()
print index_writer.numDocs()
index_writer.close()
==== Code Python END ============================================

You have to supply two parameter to the luceneIndexer().
a) A path to the directory to where the documents for indexing is stored
b) A path to the directory where the index can be saved

2) Querying an index with Pylucene

The below given code is for querying an index with PyLucene

======= Code Begin Python =======================================
#!/usr/bin/env python
import sys
import lucene
from lucene import SimpleFSDirectory, System, File, Document, Field,\
StandardAnalyzer, IndexSearcher, Version, QueryParser
"""
PyLucene retriver simple example
"""
INDEXDIR = "./MyIndex"
def luceneRetriver(query):
lucene.initVM()
indir = SimpleFSDirectory(File(INDEXDIR))
lucene_analyzer = StandardAnalyzer(Version.LUCENE_30)
lucene_searcher = IndexSearcher(indir)
my_query = QueryParser(Version.LUCENE_30,"text",\
lucene_analyzer).parse(query)
MAX = 1000
total_hits = lucene_searcher.search(my_query,MAX)
print "Hits: ",total_hits.totalHits
for hit in total_hits.scoreDocs:
print "Hit Score: ",hit.score, "Hit Doc:",hit.doc, "Hit String:",hit.toString()
doc = lucene_searcher.doc(hit.doc)
print doc.get("text").encode("utf-8")
luceneRetriver("really cool restaurant")
===============================================================

In the code I have manually specified the index dir """INDEXDIR = "./MyIndex" """. Instead of this one can receive the index directory as a command line parameter (sys.argv) too.

When using the function luceneRetriver() you have to give a query as parameter.

The source code is available in bitbucket https://bitbucket.org/jaganadhg/pyluceneia

Happy Hacking !!!!!!!!!!
Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
Comments (2)  Permalink

Historic paper on the origin and development of Indian Language Technologies

Recently I got an interesting research paper on Indian Language technology. Title of the paper is "A Journey from Indian Scripts Processing to Indian Language Processing " by Dr.RMK Sinaha IIT Kanpur. The article appeared in 'IEEE Annals of the History of Computing' . The author is a pioneer and leading researcher in Indian Language Processing. Without his contribution Indian Language Technologies might not have been matured this much. I am not going to put any comment on his writing, because I am a humble disciple of Dr.RMK .

The article is pretty much larger one (24 pages). I will just give the main points which discussed in the paper.
The article begins with a good short and informative introduction to Indian languages. In the introductory section he discusses digital divide and its impacts in a multilingual country like India.Then he directly proceeds to the features of Indian scripts. In this section he discusses the logic of conjunct formation of Indian languages, similarities and dissimilarities in Indic scripts. The 'script composition grammar' of Indic languages are also discussed in this section. In the next section he discusses the history of printing in india. The developments from printing to computers. Developments regarding the typewriter for Devenagri is discussed in detail. Then he proceeds to the efforts in the computer processing of Indian Languages by IIT Kanpur and other people. He narrated the history of Integrated Devenagri Computer(IDC) and GIST technology, the evolution if ISCII, ISFOC and InScript standards. Detailed history of GIST/ISM technology is given in the article. A brief description on the history of Indian Language word processors are also there in the article. An account of the OCR technology for Indian language is also found in the article. Then he directly proceeds to the developments of other Natural Language Processing activities in India.

Definitely the article is a historical account of Indian Language Technology, which every person who is working in Indian Language Technology should read.

Bibliographic Details :

A Journey from Indian Scripts Processing to Indian Language Processing
Sinha, R.M.K.;
Annals of the History of Computing, IEEE
Volume 31,  Issue 1,  Jan.-March 2009 Page(s):8 - 31
Digital Object Identifier 10.1109/MAHC.2009.1

Comments (3)  Permalink
Next1-10/23