BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

"Blekko" a new Search Engine launches tomorrow (1st Nov. 2010)

"Blekko" a new Search Engine launches tomorrow (1st Nov. 2010)

Starting a new search start-up is bit difficult and requires extra ordinary preparation; both technical and mentally. With in the past few years we met many new search engines. "Cuil" which was one among them. It was considered as one o the successful startup in Siliconvally. It was started by a group of Ex Google employees. But on on September 17, 2010 Cuil disappoint from the World Wide Web :-( . Alo we have seen the rise of search engines with difference like Wolfram|Alpha and Hakia . I dont know how many search engine users in India is using Guruji an India Specific Search Engine . Alexa rank of Guruji is very poor :-( .

Let's come to the brand new search engine "Blekko" !!. Blekko started the ground work for the search engine befor 2.5 years. One of the co-foundrs of Blekko is a well-known person "Rich Skrenta": who created the first computer virus "Elk Cloner" . In June 2008 Techcrunch published an article about Blekko titled "The Next Google Search Challenger: Blekko" . Then onwards people were waiting anxiously for the release of Blekko service. It is going to happen on 1st November 2010 .

Blekko remains private alpha up to today midnight. I got an opportunity to use the system in its private beta stage. Thanks for the Blekko team for providing access to the system.

Is Belkko different ?

As like any typical search engine Blekko is a full search engine. I don't know whether they can beat Google in the size of index, search speed and relevancy .The company says that they are on Par with Google and Bing for almost all queries. I just searched my name and results was nice. When I search google with keyword "jaganadh" it displays result for "jagannath". But blekko return result for "jaganadh". But if you ask it is good or bad I am bit confused? because there are obvious reasons for saying it as good and bad at the same time. The one reason is my name is pure Indian in nature and can be written with many spellings and its is name of a famous god too. It may be difficult for a search engine to assume what is the intention of the user. At the same time I am happy that Blekko gives exact result for a given spelling, because I searched for mentioning my name. But for a person who is searching about god; I don't know :-) :-(. I tried many other keywords and I was satisfied with the results. I created a couple of slash tags too :-)

The new feature which they introduces is called 'slash tag'. Users can create their own slashtags based on a group of URLs. The slogan which they put in the face page of "Blekko" is "get ready to slash the web" . Lets wait a few more hour and see whether Blekko can "slash" the web or not. I know that there is nothing like "Google Killer" . Still I am eager to see wheatear a search engine war will be starting or not :-)
Comments (1)  Permalink

Speech Recognition with Python

Speech Recognition with Python
Recently I saw a talk in listed in the CMU Sphinx site about Speech recognition with Python and Pocket Sphinx. I have downloaded the video and replicated the experiments. It was successful . To play with Python and speech recognition we have to install the following packages.
python-pocketsphinx
pocketsphinx-lm-wsj
pocketsphinx-hmm-wsj

These three packages are available in the Ubuntu repo. So using apt-get install you can install it.

If you are using fedora install the following packages with yum :

pocketsphinx-devel
pocketsphinx-libs
pocketsphinx-plugin
pocketsphinx-python
pocketsphinx

The language model and hmm files are not available in the fedora repo so I downloaded the source files from http://packages.ubuntu.com/en/maverick/pocketsphinx-lm-wsj and http://packages.ubuntu.com/fi/maverick/pocketsphinx-lm-wsj .

Then following the instruction in the video I prepared a python code. To test the code I used wav files from NLTK data (timit corpus).

The python code which I used is given below
The code is available at my bitbucket "http://bitbucket.org/jaganadhg/blog/src/tip/asr/asr.py"

#!/usr/bin/env python
import sys,os

def decodeSpeech(hmmd,lmdir,dictp,wavfile):
"""
Decodes a speech file
"""
try:
import pocketsphinx as ps
import sphinxbase
except:
print """Pocket sphinx and sphixbase is not installed
in your system. Please install it with package manager.
"""
speechRec = ps.Decoder(hmm = hmmd, lm = lmdir, dict = dictp)
wavFile = file(wavfile,'rb')
wavFile.seek(44)
speechRec.decode_raw(wavFile)
result = speechRec.get_hyp()

return result[0]

if __name__ == "__main__":
hmdir = "/home/jaganadhg/Desktop/Docs_New/kgisl/model/hmm/wsj1"
lmd = "/home/jaganadhg/Desktop/Docs_New/kgisl/model/lm/wsj/wlist5o.3e-7.vp.tg.lm.DMP"
dictd = "/home/jaganadhg/Desktop/Docs_New/kgisl/model/lm/wsj/wlist5o.dic"
wavfile = "/home/jaganadhg/Desktop/Docs_New/kgisl/sa1.wav"
recognised = decodeSpeech(hmdir,lmd,dictd,wavfile)
print "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
print recognised
print "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"

Happy Hacking
Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
Comments (4)  Permalink

Laughlin comes soon Fedora14

Fedora 14 code named as "Laughlin" is coming soon !!!
Fedora 14 Laughlin released in 17 days.
Fedora 14 Laughlin released in 17 days.
Free and Open SourceFree SoftwareGNU/LinuxFedora
Comments (1)  Permalink

WordNet sense similarity with NLTK: some basics

WordNet sense similarity with NLTK: some basics

The Natural Language Tool Kit (NLTK) has a WordNet module which enables a developer to play around with it in different ways. Recently I saw a question regarding Word Sense Disambiguation with NLTK. I just planed to try finding sense similarity. NLTK has implementation of the following sense similarity algorithms.

1) Wu-Palmer Similarity:
Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
2) Resnik Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

3) Path Distance Similarity:
Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy.
4) Lin Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.
5) Leacock Chodorow Similarity:
Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur.
6) Jiang-Conrath Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.

(Description taken from the NLTK help guide)

I just wrote a small piece of code to find Path Distance Similarity and Wu-Palmer Similarity. IT returns all the score by comparing all the sysnsets of two given words.

The code is available at my bitbucket repo
#!/usr/bin/env python

from nltk.corpus import wordnet as wn

def getSenseSimilarity(worda,wordb):
"""
find similarity betwwn word senses of two words
"""
wordasynsets = wn.synsets(worda)
wordbsynsets = wn.synsets(wordb)
synsetnamea = [wn.synset(str(syns.name)) for syns in wordasynsets]
synsetnameb = [wn.synset(str(syns.name)) for syns in wordbsynsets]

for sseta, ssetb in [(sseta,ssetb) for sseta in synsetnamea\
for ssetb in synsetnameb]:
pathsim = sseta.path_similarity(ssetb)
wupsim = sseta.wup_similarity(ssetb)
if pathsim != None:
print "Path Sim Score: ",pathsim," WUP Sim Score: ",wupsim,\
"\t",sseta.definition, "\t", ssetb.definition

if __name__ == "__main__":
#getSenseSimilarity('cat','walk')
getSenseSimilarity('cricket','score')

The code is available at my bitbucket repo

free web counter
AIComputational LinguisticsNatural Language ProcessingNLTKPython
Related Entries:
NLTK and Indian Language corpus processing Part - I
NLP with Python NLTK Book
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Graphical works with NLTK
NLTK and Indian Language corpus processing Part-III
Comments (3)  Permalink
1-4/4