BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

Taming Text : Review

    We are living in the era of Information Revolution. Everyday wast amount of information is being created and disseminated over World Wide Web(WWW). Even though each piece of information published in the web is useful in some way; we may require to identify and extract relevant/useful information.Such kind of information extraction includes identifying Person Names, Organization Names etc.. ,finding category of a text, identifying sentiment of a tweet etc ... Processing large amount text data from web is a challenging task, because there is an information overflow. As more information appears there is a demand for smart and intelligent processing and text data. The very field of text analytics has been attracted attention of developers around the glob. Many practical as well as theoretical books has been published on the topic.

This book, "
Taming Text", written by Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris is an excellent source for Text Analytics Developers and Researchers who is interested to learn Text Analytics. The book focuses on practical Text Analytics techniques like Classification,Clustering, String Matching, Searching and Entity Identification. The book provides easy-to follow examples in using well-known Open Source Text Analytics tools like Apache Mahout, Apache Lucece, Apache Solr, OpenNLP etc.. The entire book is based on the author's experience in contributing to relevant Open Source tools, hands on experience and their industry exposure. It is a must-read for Text Analytics developers and Researchers. Given the increasing importance of Text Analytics this book can be served as a hand book for budding Text Analytics Developers and Industry People. Definitely it can be used in Natural Language Processing, Machine Learning and Computational Linguistics courses.

Chapter 1: Getting Started Taming Text
The first chapter of the book introduces what is Taming Text? The authors gives list of challenges in text processing with brief explanations. The chapter is mostly an introductory stuff.

Chapter 2: Foundations of Taming Text
This chapter gives a quick warm up of your high school English grammar. Starting from words, the authors presents essential linguistic concepts required for text processing.  I think "Taming Text" will be the first technical book which gives a good warm up on basics of Language and grammar. The chapter gives a detailed introduction to words, parts of speech, phrases and morphology. This introduction is sufficient enough to capture the essential linguistic aspects of Text Processing for a developer. The second part of this chapter deals with basic text processing tasks like, tokenization, sentence splitting, Part of Speech Tagging (POS Tagging) and Parsing. Code snippets for each of the task has been given in the chapter. All the code examples are narrated with the tool
OpenNLP . The chapter gives some basic of handling different file formats using Apache Tika. This chapter gives a step by step intro to the preliminaries of Text Processing.

Chapter 3: Searching
This chapter introduces the art of Search. It gives a brief but narrative description of the Search mechanism and scene behind the curtains. The chapter discusses the basics of Search with the help of
Apache Solr. There is an interesting discussion on search evaluation and search performance enhancements and page rank too. The chapter gives a detailed list of Open Source search engines. But I think the authors forgot to add the "Elasticsearch" library  to the list. I hope that it may be added in the final print version of the book.

Chapter 4: Fuzzy String Matching
Everybody might have wondered how the "Did you mean:" feature in Google or any other search engine works. Long ago I saw a question in Stackoverflow; querying about the availability of source code for  "Did you mean:" feature !!! (something similar I think). If you wonder how this feature is working this chapter will give you enough knowledge to implement something similar. There is a simple discussion on different fuzzy string matching algorithms with code samples. There is practical examples on how to implement the "Did you Mean" and type ahead (auto suggest) utility on Apache Solr. Over all this chapter gives a solid introduction and hands on experience on Fuzzy String Matching.

Chapter 5: Identifying People, Places and Things
Diving deeper into text processing ocean, the authors narrates many deeper concepts in Text Processing starting from this chapter. The main focus of this chapter is Named Entity Identification (NER), one of the trivial tasks in Information Extraction and Retrieval. The chapter gives a good introduction to the task on Named Entity Identification along with code samples using OpenNLP. The code samples will help you to make your hands dirty. There is a section which deals with how to train OpenNLP to adopt a new domain. This will be one of the most useful tip for working professionals. The only thing which I feels to be missing is a mention about
GATE and Apache UIMA. Both of the tools are famous for their capability to accomplish the NER task.

Chapter 6: Clustering Text
The sixth chapter mainly deals with Clustering. Clustering is an unsupervised (i.e. no human intervention required) task that can automatically put related content into buckets.[Taken from the book "Taming Text"]. The initial part of this chapter narrates clustering with reference to real world applications. A decent discussion on clustering techniques and clustering evaluation is also there. Code examples for clustering is given in this chapter.
Apache Solr, Apache Mahout and Carrot are used to give practical examples for clustering.

Chapter 7: Classification, Categorization and Tagging
Seventh chapter deals with document classification. As like in the other chapters there is a reasonable discussion on document classification techniques. This chapter will teach you how to perform document classification with Apache Lucene, Apache Solr, Apache Mahout and OepnNLP. There is interesting project called 'tag recommender' in this chapter. The only hiccup which I faced with this chapter is the "TT_HOME" environment variable which used through out the book. I think the authors forgot to mention how to set TT_HOME. I was familiar with Apache Mahout so ther was no issue with MAHOUT_HOME environment variable. A totally newbie will find it difficult to spot the TT_HOME and MAHOUT_HOME used in the code samples. A little bit light on setting these variables may help reader a lot. I think this will be included in the final copy(I am reading a MEAP version).

Chapter 8: An Example Application: Question Answering

This chapter gives a hands on experience in Taming Text. The entire chapter is dedicated for building a Question Answering project using the techniques discussed in all the chapters. A simple make your hands dirty by Taming Text chapter. Here also you will be caught with the TT_HOME ghost.

Chapter 9: Untamed Text: Exploring the Next Frontier

The last chapter "Untamed Text: Exploring the Next Frontier" mentions other ares in Text processing such as Semantics Pragmatics and Sentiment Analysis etc.. Brief narration on each of these field are included in this chapter. There are a lots of pointers to some useful tools for advanced Text processing tasks like Text Summarisation and Relation Extraction etc ..

Conclusion
Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris have done a nice job by authoring this book with lucid explanations and practical examples for different Text Processing Challenges. With the help of simple and narrative examples the authors demonstrates how to solve real world text processing challenges using Free and Open Source Tools. The algorithm discussions in the book is so simple; even a newbie can follow the concepts without much hiccups. It is a good desktop reference for people who would like to start with Text Processing. It provides comprehensive and hands-on experience in Text Processing. So grab a copy soon and be ready for Big Data Analysis.

Free and Open Source Tools Discussed in the Book
Apache Solr
Apache Lucene
Apache Mahout
Apache OpenNLP
Carrot2.

Disclaimer : I received a review copy of the book from Manning

Related Entries:
Mahout in Action: Review
Comments (2)  Permalink

Seven years of 'humanity to others' through a Free Operating System

 On  20th Oct. 2004 Canonical Ltd announced the first release of worlds most popular and sexy operating system Ubuntu. The first release was code named as "Warty Warthog".  From their onwards Ubuntu was a grant success with more than 20 million users across the glob. A totally free and open source operating system attains much popularity than any other similar proprietary operating systems with in seven years. This is a remarkable achievement by the humanity behind Ubuntu operating system. People who dedicated their free times to write code, test and use the operating system and Cannonical Ltd. made it possible. And they continues the journey to serve the humanity with better Operating System that satisfies the computing needs of "Common Man".  Kudos to the entire team behind Ubuntu !!!

blog comments powered by Disqus
Comments (0)  Permalink

Using Yahoo! Term Extractor web service with Python

Yesterday I was listening Van Lindberg's talk in PyCon US 2011 about Patent Mining. In his talk he mentioned about the Yahoo! Term Extractor Web Service. Before some times I heard that the service is not available now. Again I checked the web site and found that it is working now. I just played with the web service using a simple python script. Soon after seeing that it is working fine i created a dirty Python API for the web service. I am sharing code and documentation here
Code:
https://bitbucket.org/jaganadhg/yahootermextract/overview
Sample :
https://bitbucket.org/jaganadhg/yahootermextract/wiki/Home

After finishing the work I searched in the net and found that some similar scripts are available already :-(
 http://kwc.org/blog/archives/2005/2005-04-04.yahoo_term_extraction_examples.html

http://effbot.org/zone/yahoo-term-extraction.htm

Happy Hacking !!!!!!!!!!!!

Related Entries:
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Comments (0)  Permalink

FOSS Workshop at PSR Engineering College Sivakasi

Ane one day workshop on Free and Open Source Software has been conducted at PSR Engineering College, Sevalpatti, Sivakasi,Tamilnadu. I was invited to give an introduction to Python in the workshop. Mr. Chidambaresan an alumni of the PSR Engineering college picked me for the workshop. I reached Sattur by morning 4.30 and Chidambaresan picked me to a lodge for refreshments. By 7.30 A.M Chidambaresan and his friend arrived at lodge and we started to the college. On the way we picked Suthan HOD, MCA, Sivanthi Engineering College from Kovilpatti. After taking breakfast from Kovilpatti we headed towards the college. During the journy we discussed about FOSS and Engineering Syllabus, FOSS, ILUGC, and ILUGCBE and bit of politics too ;-) . We reached the college by 09.30 AM and we met the HOD, faculty members and Principal of the College. We had a nice discussion about students and their learning mentality, the necessity of motivating students to learn FOSS and contribute to FOSS. The college is located in a very nice and ambient village called Sevalpatti. Most of the students are from nearby villages or towns.

We started the workshop by 10.10 A.M. There was a small meeting which introduced Suthan and me to the students. after the introduction Suthan started his session on Introduction FOSS and FOSS Philosophy. He started the talk with nice examples and explained the concept of FOSS and its history. Even he explained it Tamil too, to reach the message up to the heart of the students. After the talk he gave a demo on how to install Ubuntu. The students came up with lots of doubts about on FOSS. After lunch we started the second session on Python programming. Jaganadh G (me) gave an interactive lecture on the basics of Python programming. The students were surprised to learn such a simple and powerful language. Then Chidambaresan gave a small and insperative talk on the necessity od students to learn and contribute to FOSS.  After the Python session there was a valedictory function too. HOD and faculty members of the IT department was present in the function. HOD of IT department distributed certificates and Ubuntu CD to the participants. Some of the students came forward to give feedback on the workshop. This is the first time FOSS is being introduced in the college. Chidambaresan took the pain to make it happen in the college. We hope that some of the students will following the message of FOSSS.

The workshop ended by 4.00 PM and we returned to Sattur by car, hoping that FOSS will be flourishing at Sivkasi region.

Some snaps from the workshop

PSR_Sivakasi_Workshop

Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Comments (0)  Permalink

CSV to CouchDB data importing, a Python hack

Last month I was playing with Apache CouchDB. Just some introductory stuff, map reduce etc... Soon I received some Linguistic data in .cvs format, as part of the project which I was managing. There was a need to analyze it. Usually we used MySQL/Spreadsheets  to store and analyze the data. Suddenly I thought why can't I do it with CouchDB ?? . There was no direct option for import CSV data to CouchDB. I searched in the web and ended with a hint. Manilal a friend of mine also pointed to the same hint http://www.apacheserver.net/Load-CSV-file-into-couchdb-at1056996.htm .

Soon I created a small script to do the job aka load CSV file to CouchDB. The script is available in my Bitbucket repo https://bitbucket.org/jagan/misc/src/84cefb61c86a/csv2couch.py . It is a quick solution. May be you may be have a better version !!! I thought putting it in the web may help somebody else.


Happy Hacking !!!

Related Entries:
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Comments (0)  Permalink

My Village comes to Openstreetmap

My village Kamukumchury(Belongs to Kollam District Kerala State) too comes in to Openstreetmap. Last week I payed visit to my village . I bought GPS device with me and mapped some parts of my village. Also I traced some roads and marked it. Some roads which passes  through my village was mere straight lines . Based on GPS traces I made correction too .

My village in Openstreetmap


View Larger Map

Comments (0)  Permalink

Playing with Openstreetmap

We (Biju B, Anand Ganeshan, and me) started working in Openstreetmap. Kenneth Gonsalvas offered a GPS device (GDL-3204 by Sparc System) for mapping. Initially we started marking shops buildings and some routes. Now my team became familiar with JOSM editor and Potlact.

We started mapping Ganapathy and Lakshmimills area at Coimbatore. Some of our edits can be viewed below. 

Lakshmi Mills Jn .


View Larger Map

Ganapathy Area


View Larger Map

Comments (0)  Permalink

Laughlin comes soon Fedora14

Fedora 14 code named as "Laughlin" is coming soon !!!
Fedora 14 Laughlin released in 17 days.
Fedora 14 Laughlin released in 17 days.
Free and Open SourceFree SoftwareGNU/LinuxFedora
Comments (1)  Permalink

സ്വതന്ത്ര മലയാളം കംപ്യൂട്ടിംഗ് പ്രാദേശികവത്കരണ ശിബിരം പാലക്കാട് 10,11 ജൂലൈ 2010

പാലക്കാടു്
ജൂലൈ 8, 2010
സ്വതന്ത്ര മലയാളം കമ്പ്യൂട്ടിങ്ങിന്റെ നേതൃത്വത്തില്‍ സിക്സ്‌വെയര്‍ ടെക്ലോളജിസിന്റേയും പാലക്കാട് ലിബര്‍ സോഫ്റ്റ്‌വെയര്‍ യൂസേര്‍സ് സൊസൈറ്റിയുടേയും സ്വതന്ത്ര ജനാധിപത്യ സഖ്യത്തിന്റേയും സഹകരണത്തോടെ രണ്ടു് ദിവസത്തെ പ്രദേശികവത്കരണ ശിബിരം ബിഗ് ബസാര്‍ സ്കൂളില്‍ (വലിയങ്ങാടി സ്ക്കൂളില്‍) വച്ചു് ജൂലൈ 10, 11 (ശനി, ഞായര്‍) തിയ്യതികളില്‍ നടത്തുന്നു. സ്വതന്ത്ര സോഫ്റ്റ്‌വെയറുകള്‍ മലയാളത്തില്‍ ലഭ്യമാക്കാനുള്ള പ്രവര്‍ത്തനത്തില്‍ സാധാരണക്കാരെ പങ്കെടുപ്പിയ്ക്കാനും ആവശ്യമായ പരിശീലനം നല്‍കാനും രണ്ടു് ദിവസത്തെ ശിബിരം ലക്ഷ്യമിടുന്നു. ശിബിരം നടക്കുന്ന പള്ളിക്കൂടവും അതിനടുത്ത റോഡുകളും സ്വതന്ത്ര ഭൂപട സംരംഭമായ ഓപ്പണ്‍സ്ട്രീറ്റ്മാപ്പില്‍ ചേര്‍ക്കുവാനും ശിബിരം ലക്ഷ്യമിടുന്നു. സ്വതന്ത്ര സോഫ്റ്റ്‌വെയര്‍ അടിസ്ഥാനമാക്കി മലയാളഭാഷാ കമ്പ്യൂട്ടിങ്ങ് രംഗത്തു് പ്രവര്‍ത്തിക്കുന്ന സന്നദ്ധപ്രവര്‍ത്തകരുടെ കൂട്ടായ്മയാണു് സ്വതന്ത്ര മലയാളം കമ്പ്യൂട്ടിങ്ങ്.
ശിബിരത്തില്‍ പങ്കെടുക്കാന്‍ നിബന്ധനകളൊന്നുമില്ല, മലയാളത്തില്‍ കമ്പ്യൂട്ടറുപയോഗിയ്ക്കാനും മലയാളം കമ്പ്യൂട്ടിങ്ങിന്റെ സാധ്യതകളേക്കുറിച്ചറിയാനും പ്രവര്‍ത്തനങ്ങളില്‍ പങ്കുചേരാനും താത്പര്യമുള്ള ആര്‍ക്കും പങ്കെടുക്കാം. പ്രവേശനം സൌജന്യമാണു്. പരിപാടിയില്‍ പങ്കെടുക്കുന്നവര്‍ താഴെ കൊടുത്ത വെബ്സൈറ്റില്‍ രെജിസ്റ്റര്‍ ചെയ്യുകയോ താഴെ കൊടുത്ത പ്രവര്‍ത്തകരെ വിളിച്ചറിയിയ്ക്കുകയോ ചെയ്യണം. കോഴിക്കോട്, പൂനെ, തിരുവനന്തപുരം, അങ്കമാലി, കൊച്ചി, കുറ്റിപ്പുറം തുടങ്ങി ആറിടങ്ങളില്‍ ഇതിനോടകം തന്നെ ശിബിരങ്ങള്‍ വിജയകരമായി പൂര്‍ത്തിയാക്കി.
കമ്പ്യൂട്ടറില്‍ മലയാളം എങ്ങനെ ഉപയോഗിയ്ക്കാമെന്നതിന്റെ പരിശീലനവും ഇതിന്റെ സാങ്കേതിക വശങ്ങളുടെ വിശദീകരണവും മലയാളം കമ്പ്യൂട്ടിങ്ങിന്റെ പ്രധാന്യത്തെക്കുറിച്ചുള്ള ചര്‍ച്ചയും ആദ്യ ദിവസത്തെ പരിപാടിയിലുണ്ടു്. സോഫ്റ്റ്‌വെയറുകള്‍ മലയാളത്തില്‍ ലഭ്യമാക്കുന്നതിനുള്ള സംവിധാനങ്ങളും ചിട്ടവട്ടങ്ങളും പരിചയപ്പെടുത്തുന്നതിനോടൊപ്പം ചില സ്വതന്ത്ര സോഫ്റ്റ്‌വെയറുകളുടെ മലയാളം പരിഭാഷ കൂട്ടായി ചെയ്യാനും ലക്ഷ്യമിടുന്നു. കമ്പ്യൂട്ടറിന്റെ ഉപയോഗത്തിലെ ചുമര്‍ചിത്രങ്ങള്‍ (wallpapers), സ്ക്രീന്‍സേവറുകള്‍ തുടങ്ങി കലാപരമായ വിഷയങ്ങളുടെ സാംസ്കാരികമായ പ്രാദേശികവത്കരണത്തിന്റെ ആവശ്യകതയെക്കുറിച്ചും അതിന്റെ സാങ്കേതിക വശങ്ങളെക്കുറിച്ചും ചര്‍ച്ച നടക്കും. ഇരുമ്പനം വിഎച്ച്എസ്എസ് സ്കൂളിലെ വിദ്യാര്‍ത്ഥികള്‍ ടക്സ്‌പെയിന്റ് എന്ന ചിത്രം വരയ്ക്കാനുള്ള സോഫ്റ്റ്‌വെയറില്‍ കേരളത്തിലെ പൂക്കള്‍ ചേര്‍ത്തു് നേരത്തെ തന്നെ ഈ മേഖലയില്‍ മാതൃക കാട്ടിയിട്ടുണ്ടു്
സ്വതന്ത്ര സോഫ്റ്റ്‌വെയര്‍ തത്വശാസ്ത്ര ലേഖനങ്ങളുടെ പരിഭാഷയും ഗുണനിലവാരം ഉറപ്പു് വരുത്തലുമായി രണ്ടാം ദിവസത്തെ പരിപാടി തുടങ്ങും. കെഡിഇ എന്ന സോഫ്റ്റ്‌വെയര്‍ ശേഖരത്തിലെ കളികളുടെ മലയാള പരിഭാഷയും രണ്ടാം ദിവസം തുടരും. സ്വതന്ത്ര മലയാളം കമ്പ്യൂട്ടിങ്ങിനെക്കുറിച്ചും ശിബിരത്തെക്കുറിച്ചുമുള്ള കൂടുതല്‍ വിവരങ്ങള്‍ക്കും ശിബിരത്തിനു് രെജിസ്റ്റര്‍ ചെയ്യാനും http://www.smc.org.in എന്ന വെബ്സൈറ്റ് സന്ദര്‍ശിയ്ക്കുകയോ താഴെ കൊടുത്ത നമ്പറുകളില്‍ ബന്ധപ്പെടുകയോ ചെയ്യുക.
 Permalink

Fedora-13 "Goddard" coming soon !!!!!!

Fedora 13 code named as "Goddard" is coming soon !!!

 Permalink
Next1-10/18