BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

Opinion Mining and Sentiment Analysis papers from Computational Linguistics Open Access Journal

 



blog comments powered by Disqus

 Permalink

Taming Text : Review

    We are living in the era of Information Revolution. Everyday wast amount of information is being created and disseminated over World Wide Web(WWW). Even though each piece of information published in the web is useful in some way; we may require to identify and extract relevant/useful information.Such kind of information extraction includes identifying Person Names, Organization Names etc.. ,finding category of a text, identifying sentiment of a tweet etc ... Processing large amount text data from web is a challenging task, because there is an information overflow. As more information appears there is a demand for smart and intelligent processing and text data. The very field of text analytics has been attracted attention of developers around the glob. Many practical as well as theoretical books has been published on the topic.

This book, "
Taming Text", written by Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris is an excellent source for Text Analytics Developers and Researchers who is interested to learn Text Analytics. The book focuses on practical Text Analytics techniques like Classification,Clustering, String Matching, Searching and Entity Identification. The book provides easy-to follow examples in using well-known Open Source Text Analytics tools like Apache Mahout, Apache Lucece, Apache Solr, OpenNLP etc.. The entire book is based on the author's experience in contributing to relevant Open Source tools, hands on experience and their industry exposure. It is a must-read for Text Analytics developers and Researchers. Given the increasing importance of Text Analytics this book can be served as a hand book for budding Text Analytics Developers and Industry People. Definitely it can be used in Natural Language Processing, Machine Learning and Computational Linguistics courses.

Chapter 1: Getting Started Taming Text
The first chapter of the book introduces what is Taming Text? The authors gives list of challenges in text processing with brief explanations. The chapter is mostly an introductory stuff.

Chapter 2: Foundations of Taming Text
This chapter gives a quick warm up of your high school English grammar. Starting from words, the authors presents essential linguistic concepts required for text processing.  I think "Taming Text" will be the first technical book which gives a good warm up on basics of Language and grammar. The chapter gives a detailed introduction to words, parts of speech, phrases and morphology. This introduction is sufficient enough to capture the essential linguistic aspects of Text Processing for a developer. The second part of this chapter deals with basic text processing tasks like, tokenization, sentence splitting, Part of Speech Tagging (POS Tagging) and Parsing. Code snippets for each of the task has been given in the chapter. All the code examples are narrated with the tool
OpenNLP . The chapter gives some basic of handling different file formats using Apache Tika. This chapter gives a step by step intro to the preliminaries of Text Processing.

Chapter 3: Searching
This chapter introduces the art of Search. It gives a brief but narrative description of the Search mechanism and scene behind the curtains. The chapter discusses the basics of Search with the help of
Apache Solr. There is an interesting discussion on search evaluation and search performance enhancements and page rank too. The chapter gives a detailed list of Open Source search engines. But I think the authors forgot to add the "Elasticsearch" library  to the list. I hope that it may be added in the final print version of the book.

Chapter 4: Fuzzy String Matching
Everybody might have wondered how the "Did you mean:" feature in Google or any other search engine works. Long ago I saw a question in Stackoverflow; querying about the availability of source code for  "Did you mean:" feature !!! (something similar I think). If you wonder how this feature is working this chapter will give you enough knowledge to implement something similar. There is a simple discussion on different fuzzy string matching algorithms with code samples. There is practical examples on how to implement the "Did you Mean" and type ahead (auto suggest) utility on Apache Solr. Over all this chapter gives a solid introduction and hands on experience on Fuzzy String Matching.

Chapter 5: Identifying People, Places and Things
Diving deeper into text processing ocean, the authors narrates many deeper concepts in Text Processing starting from this chapter. The main focus of this chapter is Named Entity Identification (NER), one of the trivial tasks in Information Extraction and Retrieval. The chapter gives a good introduction to the task on Named Entity Identification along with code samples using OpenNLP. The code samples will help you to make your hands dirty. There is a section which deals with how to train OpenNLP to adopt a new domain. This will be one of the most useful tip for working professionals. The only thing which I feels to be missing is a mention about
GATE and Apache UIMA. Both of the tools are famous for their capability to accomplish the NER task.

Chapter 6: Clustering Text
The sixth chapter mainly deals with Clustering. Clustering is an unsupervised (i.e. no human intervention required) task that can automatically put related content into buckets.[Taken from the book "Taming Text"]. The initial part of this chapter narrates clustering with reference to real world applications. A decent discussion on clustering techniques and clustering evaluation is also there. Code examples for clustering is given in this chapter.
Apache Solr, Apache Mahout and Carrot are used to give practical examples for clustering.

Chapter 7: Classification, Categorization and Tagging
Seventh chapter deals with document classification. As like in the other chapters there is a reasonable discussion on document classification techniques. This chapter will teach you how to perform document classification with Apache Lucene, Apache Solr, Apache Mahout and OepnNLP. There is interesting project called 'tag recommender' in this chapter. The only hiccup which I faced with this chapter is the "TT_HOME" environment variable which used through out the book. I think the authors forgot to mention how to set TT_HOME. I was familiar with Apache Mahout so ther was no issue with MAHOUT_HOME environment variable. A totally newbie will find it difficult to spot the TT_HOME and MAHOUT_HOME used in the code samples. A little bit light on setting these variables may help reader a lot. I think this will be included in the final copy(I am reading a MEAP version).

Chapter 8: An Example Application: Question Answering

This chapter gives a hands on experience in Taming Text. The entire chapter is dedicated for building a Question Answering project using the techniques discussed in all the chapters. A simple make your hands dirty by Taming Text chapter. Here also you will be caught with the TT_HOME ghost.

Chapter 9: Untamed Text: Exploring the Next Frontier

The last chapter "Untamed Text: Exploring the Next Frontier" mentions other ares in Text processing such as Semantics Pragmatics and Sentiment Analysis etc.. Brief narration on each of these field are included in this chapter. There are a lots of pointers to some useful tools for advanced Text processing tasks like Text Summarisation and Relation Extraction etc ..

Conclusion
Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris have done a nice job by authoring this book with lucid explanations and practical examples for different Text Processing Challenges. With the help of simple and narrative examples the authors demonstrates how to solve real world text processing challenges using Free and Open Source Tools. The algorithm discussions in the book is so simple; even a newbie can follow the concepts without much hiccups. It is a good desktop reference for people who would like to start with Text Processing. It provides comprehensive and hands-on experience in Text Processing. So grab a copy soon and be ready for Big Data Analysis.

Free and Open Source Tools Discussed in the Book
Apache Solr
Apache Lucene
Apache Mahout
Apache OpenNLP
Carrot2.

Disclaimer : I received a review copy of the book from Manning

Related Entries:
Mahout in Action: Review
Comments (2)  Permalink

Using Yahoo! Term Extractor web service with Python

Yesterday I was listening Van Lindberg's talk in PyCon US 2011 about Patent Mining. In his talk he mentioned about the Yahoo! Term Extractor Web Service. Before some times I heard that the service is not available now. Again I checked the web site and found that it is working now. I just played with the web service using a simple python script. Soon after seeing that it is working fine i created a dirty Python API for the web service. I am sharing code and documentation here
Code:
https://bitbucket.org/jaganadhg/yahootermextract/overview
Sample :
https://bitbucket.org/jaganadhg/yahootermextract/wiki/Home

After finishing the work I searched in the net and found that some similar scripts are available already :-(
 http://kwc.org/blog/archives/2005/2005-04-04.yahoo_term_extraction_examples.html

http://effbot.org/zone/yahoo-term-extraction.htm

Happy Hacking !!!!!!!!!!!!

Related Entries:
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
CSV to CouchDB data importing, a Python hack
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Comments (0)  Permalink

WordNet sense similarity with NLTK: some basics

WordNet sense similarity with NLTK: some basics

The Natural Language Tool Kit (NLTK) has a WordNet module which enables a developer to play around with it in different ways. Recently I saw a question regarding Word Sense Disambiguation with NLTK. I just planed to try finding sense similarity. NLTK has implementation of the following sense similarity algorithms.

1) Wu-Palmer Similarity:
Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
2) Resnik Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

3) Path Distance Similarity:
Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy.
4) Lin Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.
5) Leacock Chodorow Similarity:
Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur.
6) Jiang-Conrath Similarity:
Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.

(Description taken from the NLTK help guide)

I just wrote a small piece of code to find Path Distance Similarity and Wu-Palmer Similarity. IT returns all the score by comparing all the sysnsets of two given words.

The code is available at my bitbucket repo
#!/usr/bin/env python

from nltk.corpus import wordnet as wn

def getSenseSimilarity(worda,wordb):
"""
find similarity betwwn word senses of two words
"""
wordasynsets = wn.synsets(worda)
wordbsynsets = wn.synsets(wordb)
synsetnamea = [wn.synset(str(syns.name)) for syns in wordasynsets]
synsetnameb = [wn.synset(str(syns.name)) for syns in wordbsynsets]

for sseta, ssetb in [(sseta,ssetb) for sseta in synsetnamea\
for ssetb in synsetnameb]:
pathsim = sseta.path_similarity(ssetb)
wupsim = sseta.wup_similarity(ssetb)
if pathsim != None:
print "Path Sim Score: ",pathsim," WUP Sim Score: ",wupsim,\
"\t",sseta.definition, "\t", ssetb.definition

if __name__ == "__main__":
#getSenseSimilarity('cat','walk')
getSenseSimilarity('cricket','score')

The code is available at my bitbucket repo

free web counter
AIComputational LinguisticsNatural Language ProcessingNLTKPython
Related Entries:
NLTK and Indian Language corpus processing Part - I
NLP with Python NLTK Book
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Graphical works with NLTK
NLTK and Indian Language corpus processing Part-III
Comments (3)  Permalink

Historic paper on the origin and development of Indian Language Technologies

Recently I got an interesting research paper on Indian Language technology. Title of the paper is "A Journey from Indian Scripts Processing to Indian Language Processing " by Dr.RMK Sinaha IIT Kanpur. The article appeared in 'IEEE Annals of the History of Computing' . The author is a pioneer and leading researcher in Indian Language Processing. Without his contribution Indian Language Technologies might not have been matured this much. I am not going to put any comment on his writing, because I am a humble disciple of Dr.RMK .

The article is pretty much larger one (24 pages). I will just give the main points which discussed in the paper.
The article begins with a good short and informative introduction to Indian languages. In the introductory section he discusses digital divide and its impacts in a multilingual country like India.Then he directly proceeds to the features of Indian scripts. In this section he discusses the logic of conjunct formation of Indian languages, similarities and dissimilarities in Indic scripts. The 'script composition grammar' of Indic languages are also discussed in this section. In the next section he discusses the history of printing in india. The developments from printing to computers. Developments regarding the typewriter for Devenagri is discussed in detail. Then he proceeds to the efforts in the computer processing of Indian Languages by IIT Kanpur and other people. He narrated the history of Integrated Devenagri Computer(IDC) and GIST technology, the evolution if ISCII, ISFOC and InScript standards. Detailed history of GIST/ISM technology is given in the article. A brief description on the history of Indian Language word processors are also there in the article. An account of the OCR technology for Indian language is also found in the article. Then he directly proceeds to the developments of other Natural Language Processing activities in India.

Definitely the article is a historical account of Indian Language Technology, which every person who is working in Indian Language Technology should read.

Bibliographic Details :

A Journey from Indian Scripts Processing to Indian Language Processing
Sinha, R.M.K.;
Annals of the History of Computing, IEEE
Volume 31,  Issue 1,  Jan.-March 2009 Page(s):8 - 31
Digital Object Identifier 10.1109/MAHC.2009.1

Comments (3)  Permalink

Graphical works with NLTK

Bird

                                                                      Image -I

Language

                                                                                        Image -II

Python

                                                                       Image - III

Can you guess from which data this plots are generated ? This is plot of WordNet relations generated with NLTK and Python-networkx library. I got this idea from the 4th cahpter of the book Natural Language Processing with Python.

The Image _i generated from the synset relation  of the word 'bird' and Image - II from 'language' and Image III from 'Python'. For more information read the book Natural Language Processing with Python.

Happy hacking !!!!!!

Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
WordNet sense similarity with NLTK: some basics
NLTK and Indian Language corpus processing Part-III
NLTK and Indian Language corpus processing - Part-II
NLTK and Indian Language corpus processing Part - I
 Permalink

NLTK and Indian Language corpus processing Part-III

I think you enjoyed the Part-I and Part-II of this tutorial. If you have any comment, suggestion or criticism please write to me. In part -III we can try to some more work with Indian Language Corpora in NLTK.

Generating word and POS bigram and trigram

For generating word and POS bigram I selected the 'hindi.pos' file and created the bigrams and trigrams.
Here is the code to do that.

    =========== Code Begin ===========
    from nltk.corpus import indian
    from nltk import bigrams
    from nltk import trigrams

    hpos = indian.tagged_sents('hindi.pos')

    # Stores the POS tagged sentences from 'hindi.pos'

    wpos = []

    for sent in hpos:
        tojoin = sent
        for tagged in tojoin:
            wpos.append(" ".join(tagged))

    #Stores word and pos as single unit to a list called 'wpos'

    wpos_bigram = bigrams(wpos)
    # Generating word and POS bigram
    for wpb in wpos_bigram:
       print " ".join(wpb)

    # Prints the word and POS bigram

    wpos_trigram = trigrams(wpos)
    # Generating the Word and POS trigram
    for wpt in wpos_trigram:
        print " ".join(wpt)
    #Prints the word and POS trigram

    =========== Code Begin ===========

       

For generating word and pos from other Indian Language corpus just replace 'hindi.pos' with appropriate file id.

Collocations Concordance from Indian Language Corpora

Now let's try to build collocation from hindi corpus(hindi.pos).

    >>> hw = nltk.corpus.indian.words('hindi.pos')
    >>> th = Text(hw)
    >>> th.collocations()
    Building collocations list
    है ।; के लिए; कहा कि; हैं ।;
    पारी खेली; है कि; रनों की;
    न्यू जीलैंड; युद्ध विराम;
    ने कहा; के हाथों; करते हुए;
    डेविस कप; की पारी; रहे हैं;
    खेली ।; रन पर; रन बनाये;
    हाथों लपकवाया; किए गए

Concordence from Hindi corpus in NLTK


    >>> th.concordance('न्यू')
    Building index...
    Displaying 13 of 13 matches:
    वसीय मैच में न्यू जीलैंड को जी
    ��न से बाहर कर न्यू जीलैंड की टी
    ��न सकती हैं । न्यू जीलैंड ने पा
    ती डुनेडिन । न्यू जीलैंड ने पा
    -२ से जीत ली । न्यू जीलैंड ने पा
    ��त किया गया । न्यू जीलैंड की पा
    लपकवा दिया । न्यू जीलैंड की तर
    ीसरे मैच में न्यू जीलैंड को २८
    ��े हरा दिया । न्यू जीलैंड को जी
    ��त किया गया । न्यू जीलैंड की शु
    �� पारी खेली । न्यू जीलैंड के १५
    ��ी कर पाये और न्यू जीलैंड की पा
    ोगदान दिया । न्यू जीलैंड की तर
    >>>

Here is an example to populate frequency distribution of some Hindi words in 'hindi.pos' file.

    ========== Code begin =====================
    # -*- coding: utf-8 -*-
    from nltk.corpus import indian
    from nltk import FreqDist
    hindi_text = indian.words('hindi.pos')
    freq_dist = FreqDist([w.strip() for w in hindi_text])
    modals = ['की','है','हो','तो']

    for modal in modals:
        print modal + " : " , freq_dist[modal]

    ====== Code End =================

The result is given below.

    की :  236
    है :  189
    हो :  28
    तो :  10


Happy Dipavali !!!
Happy Hacking


Related Entries:
NLTK and Indian Language corpus processing - Part-II
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
WordNet sense similarity with NLTK: some basics
Graphical works with NLTK
NLTK and Indian Language corpus processing Part - I
 Permalink

NLTK and Indian Language corpus processing - Part-II

In Part-I we saw how to access Indian Language corpora in NLTK and how to play with it. Now let's see some more examples.

First we can see how to access each word with associated POS tag. (Before proceeding don't forgot to do the imports done in Part-I )

    >>> for sent in hpos:
    ...    tmp = sent
    ...    for j in range(len(sent))
    ...        print " ".join(k[j])

It will print word with POS like

    दो QFNUM
    विकेट NN
    लिये VFM
    । PUNC
    अनवर NNP
    को PREP
    विंसेट NNP
    ने PREP
    रन NNC
    आउट NN
    किया VFM
    । PUNC
    >>>

Let us see how to do parsing with the Indian Language POS Tagged corpus. For the purpose I am using the RegexParser available in NLTK.


    >>> sentence = hpos[2]

I am taking the third sentence in the hindi.pos file for parsing

    >>> grammar = "NP: {<DT>?<JJ>*<NN>}"

Defined grammar for parsing with RegexParser in NLTK.

    >>> cp = nltk.RegexpParser(grammar) # Creating the parser object and passing the grammar to it
    >>> result = cp.parse(sentence) # Do the parsing and store the result to 'result'
    >>> print result # Printing the result

It will produce the parse structure like

    (S
      इराक/NNP
      के/PREP
      विदेश/NNC
      (NP मंत्री/NN)
      ने/PREP
      अमरीका/NNP
      के/PREP
      उस/PRP
      (NP प्रस्ताव/NN)
      का/PREP
      मजाक/NVB
      उड़ाया/VFM
      है/VAUX
      ,/PUNC
      जिसमें/PRP
      अमरीका/NNP
      ने/PREP
      संयुक्त/NNC
      (NP राष्ट्र/NN)
      के/PREP
      (NP प्रतिबंधों/NN)
      को/PREP
      (NP इराकी/JJ नागरिकों/NN)
      के/PREP
      लिए/PREP
      कम/INTF
      हानिकारक/JJ
      बनाने/VNN
      के/PREP
      लिए/PREP
      कहा/VFM
      है/VAUX
      ।/PUNC)

If you would like to visualise the parse structure just do this much

    >>> result.draw()

It will show a big parse tree. It is too big one so I am not attaching the screen shot.


Then what about generating bigrams from Indian Language corpora?

Here comes the code for that.

    >>> hinw = indian.words('hindi.pos') # Stores the words in 'hindi.pos' to hinw

    >>> hinbi = nltk.bigrams(hinw) # Generate the bigrams and store it in to hinbi

To print the bigrams

    >>> for i in hinbi:
    ...     print " ".join(i)

Here you can see some sample bigram

    चुके थे
    थे तथा
    तथा ३
    ३ बार
    बार विधानसभा
    विधानसभा के
    के सदस्य


   
Fine then what about trigrams.
Hmmm it is easy !!

First store words in the corpus to a list

    >>> hinw = indian.words('hindi.pos')

Then generate trigrams with nltk.trigrams() function

    >>> hintr = nltk.trigrams(hinw)

To print the trigrams
   
    >>> for j in hintr:
    ...    print " ".join(j)

Here is the sample output

    दो-दो तथा फ्रैंक्लीन
    तथा फ्रैंक्लीन और
    फ्रैंक्लीन और हैरिस
    और हैरिस ने
    हैरिस ने एक-एक
    ने एक-एक विकेट
    एक-एक विकेट लिये
    विकेट लिये ।


Finding count of a particular word in Indian Language corpus.

Store the words to some variable and use the count() function.

    >>> txt2 = indian.words('hindi.pos')
   
    >>> txt2.count('भारत')
    23
    >>>
Here I stred all the words in 'hindi.pos' and explored the count of the word भारत .

Find he percentage of text taken by a particular text

To find out the percentage of text taken by the word भारत

    >>> 100 * txt2.count('की') /len(txt2)
    2
    >>>

Producing lexical dispersion plot from Indian Language corpus

For that we have to play some trick

This is the command for plotting lexical dispersion plot of भारत  and की in Hindi corpus


    Text(txt2).dispersion_plot(['भारत','की'])

The Text() function convert the wordlist to nltk text object. It makes the plotting job easy. In the plot you cant see the word, because Unicode text will be displayed as box in the plot.


Selecting word based on parameters from Hindi corpus

For the same I am taking the example mentioned in the NLTK book.


    {w|w is a member of V and P(w)}

    [w for w in V if p(w)]

    >>> V = set(txt2)
    >>> my_word = [w for w in V if len(w) > 25]
    >>> sorted(my_word)
    >>> fd = FreqDist(txt2)
    >>> sorted([w for w in set(txt2) if len(w) > 5 and fd[w] > 25])

It will give the following list of word as output

        इस
        एक
        और
        कर
        कहा
        का
        कि
        किया
        ......



Conditional frequency distribution for Indian Language corpora

Here is an example

    Step 1
    Generate bigrams

    >>> hinbi = nltk.bigrams(hinw)

    Step 2
    Generate Conditional Frequency Distribution

    >>> gd = nltk.ConditionalFreqDist(big)

You can plot the cfd but it will take some time to generate the plot and you can see some animation effect
   
    >>> gd.plot()


One can print the tabulated cfd also

    >>> gd.tabulate()


More coming soon

Happy hacking !!!!!!!!


Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
NLTK and Indian Language corpus processing Part-III
Finding bigrams with NLTK.
WordNet sense similarity with NLTK: some basics
Graphical works with NLTK
 Permalink

NLTK and Indian Language corpus processing Part - I

During my presentation in Indian Python conference some body asked about Indian Language corpus processing in NLTK. Some how I skipped the answer. Because I know that Indian Language corpus is there in NLTK. But I never tried to play with that. But after the conference I did some thing on that too. I am posting my experiments with results here. If it can be done in a better way please tell me so that I can improve.

The Natural Language Toolkit contains some Indian language corpus. The corpus is  POS Tagged one. It is available for Bangala, Hindi, Marathi and Telugu languages.

    Total number of words in Bangala is     10281
                  Hindi     9408
                  Marathi     19066
                  Telugu     9999

    Total number Sentences in Bangala     899
                  Hindi        541
                  Marathi    1197
                  Telugu    994

Let's see how to access Indian Language corpora in NLTK and how to play with it.

    >>> from nltk.corpus import indian

    It will import Indian Language corpus from NLTK data.

    >>> indian.fileids() # Shows files in Indian Language corpus collection in NLTK
    ['bangla.pos', 'hindi.pos', 'marathi.pos', 'telugu.pos']
    >>>


To find number of characters in each language corpora

    >>> for f in indian.fileids():
    ...     print f
    ...     print len(indian.raw(f))

It will produce the following output

    bangla.pos
    209525
    hindi.pos
    175045
    marathi.pos
    429234
    telugu.pos
    251391

To find number of words in each language corpus

    >>> for f in indian.fileids():
    ...     print f
    ...     print len(indian.words(f))
    ...
It will produce the following output

    bangla.pos
    10281
    hindi.pos
    9408
    marathi.pos
    19066
    telugu.pos
    9999

To find number of sentences
    >>> for f in indian.fileids():
    ...     print f
    ...     print len(indian.sents(f))
    ...
It will produce the following output
   
    bangla.pos
    899
    hindi.pos
    541
    marathi.pos
    1197
    telugu.pos


We can extract sentences from these corpora too.

For accessing sentences from Hindi corpus
    >>> hindi_sent = indian.sents('hindi.pos')

To print individual sentences

    >>> for hsen in hindi_sent:
        print hsen

It will print each sentence as a list of words. Let's see how store sentences to a file.

    >>> it = open("jhi.txt",'w')
    >>> for hsen in hindi_sent:
    ...     it.write(" ".join(hsen))
    ...

The above given piece of code will convert the list of words in to actual sentence, and it will store to the file specified.


To access words in a corpus

    >>> hin_word = indian.words('hindi.pos')

This piece of code will store all the words in hindi.pos file to hin_words.


As we stored the sentences in to file we can write words to file also.

    >>> hwo = open("hcwo.txt",'w') # Open a file to store the words
    >>> for hw in hin_word:
    ...     hwo.write(" ".join(hw)) # Write each words to the file
    ...

For accessing the POS tagged sentences from a corpora

    >>> hpos = indian.tagged_sents('hindi.pos')

    >>> for sent in hpos:
    ...    print sent

It will print the POS tagged sentences a list of tuples.

More coming soon !!!!!

Happy hacking

Related Entries:
WordNet sense similarity with NLTK: some basics
NLP with Python NLTK Book
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Graphical works with NLTK
NLTK and Indian Language corpus processing Part-III
 Permalink

New article on Machine Translation

I got an interesting article in the ACM Communications Oct 2009 issue. "Human Interaction for High-Quality Machine Translation" by "Francisco Casacuberta, Jorge Civera, Elsa Cubel, Antonio L. Lagarda, Guy Lapalme, Elliott Macklovitch, Enrique Vidal".

Nice and informative one.
Don't miss this .

Related Entries:
index
Life update
Opinion Mining and Sentiment Analysis papers from Computational Linguistics Open Access Journal
HBase Administration Cookbook by Yifeng Jiang new book from Packt
Quick MySQL to CouchDB migration with Python
 Permalink
Next1-10/15