BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

NLTK and Indian Language corpus processing Part-III

I think you enjoyed the Part-I and Part-II of this tutorial. If you have any comment, suggestion or criticism please write to me. In part -III we can try to some more work with Indian Language Corpora in NLTK.

Generating word and POS bigram and trigram

For generating word and POS bigram I selected the 'hindi.pos' file and created the bigrams and trigrams.
Here is the code to do that.

    =========== Code Begin ===========
    from nltk.corpus import indian
    from nltk import bigrams
    from nltk import trigrams

    hpos = indian.tagged_sents('hindi.pos')

    # Stores the POS tagged sentences from 'hindi.pos'

    wpos = []

    for sent in hpos:
        tojoin = sent
        for tagged in tojoin:
            wpos.append(" ".join(tagged))

    #Stores word and pos as single unit to a list called 'wpos'

    wpos_bigram = bigrams(wpos)
    # Generating word and POS bigram
    for wpb in wpos_bigram:
       print " ".join(wpb)

    # Prints the word and POS bigram

    wpos_trigram = trigrams(wpos)
    # Generating the Word and POS trigram
    for wpt in wpos_trigram:
        print " ".join(wpt)
    #Prints the word and POS trigram

    =========== Code Begin ===========

       

For generating word and pos from other Indian Language corpus just replace 'hindi.pos' with appropriate file id.

Collocations Concordance from Indian Language Corpora

Now let's try to build collocation from hindi corpus(hindi.pos).

    >>> hw = nltk.corpus.indian.words('hindi.pos')
    >>> th = Text(hw)
    >>> th.collocations()
    Building collocations list
    है ।; के लिए; कहा कि; हैं ।;
    पारी खेली; है कि; रनों की;
    न्यू जीलैंड; युद्ध विराम;
    ने कहा; के हाथों; करते हुए;
    डेविस कप; की पारी; रहे हैं;
    खेली ।; रन पर; रन बनाये;
    हाथों लपकवाया; किए गए

Concordence from Hindi corpus in NLTK


    >>> th.concordance('न्यू')
    Building index...
    Displaying 13 of 13 matches:
    वसीय मैच में न्यू जीलैंड को जी
    ��न से बाहर कर न्यू जीलैंड की टी
    ��न सकती हैं । न्यू जीलैंड ने पा
    ती डुनेडिन । न्यू जीलैंड ने पा
    -२ से जीत ली । न्यू जीलैंड ने पा
    ��त किया गया । न्यू जीलैंड की पा
    लपकवा दिया । न्यू जीलैंड की तर
    ीसरे मैच में न्यू जीलैंड को २८
    ��े हरा दिया । न्यू जीलैंड को जी
    ��त किया गया । न्यू जीलैंड की शु
    �� पारी खेली । न्यू जीलैंड के १५
    ��ी कर पाये और न्यू जीलैंड की पा
    ोगदान दिया । न्यू जीलैंड की तर
    >>>

Here is an example to populate frequency distribution of some Hindi words in 'hindi.pos' file.

    ========== Code begin =====================
    # -*- coding: utf-8 -*-
    from nltk.corpus import indian
    from nltk import FreqDist
    hindi_text = indian.words('hindi.pos')
    freq_dist = FreqDist([w.strip() for w in hindi_text])
    modals = ['की','है','हो','तो']

    for modal in modals:
        print modal + " : " , freq_dist[modal]

    ====== Code End =================

The result is given below.

    की :  236
    है :  189
    हो :  28
    तो :  10


Happy Dipavali !!!
Happy Hacking


Related Entries:
NLTK and Indian Language corpus processing - Part-II
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
WordNet sense similarity with NLTK: some basics
Graphical works with NLTK
NLTK and Indian Language corpus processing Part - I
 Permalink

NLTK and Indian Language corpus processing - Part-II

In Part-I we saw how to access Indian Language corpora in NLTK and how to play with it. Now let's see some more examples.

First we can see how to access each word with associated POS tag. (Before proceeding don't forgot to do the imports done in Part-I )

    >>> for sent in hpos:
    ...    tmp = sent
    ...    for j in range(len(sent))
    ...        print " ".join(k[j])

It will print word with POS like

    दो QFNUM
    विकेट NN
    लिये VFM
    । PUNC
    अनवर NNP
    को PREP
    विंसेट NNP
    ने PREP
    रन NNC
    आउट NN
    किया VFM
    । PUNC
    >>>

Let us see how to do parsing with the Indian Language POS Tagged corpus. For the purpose I am using the RegexParser available in NLTK.


    >>> sentence = hpos[2]

I am taking the third sentence in the hindi.pos file for parsing

    >>> grammar = "NP: {<DT>?<JJ>*<NN>}"

Defined grammar for parsing with RegexParser in NLTK.

    >>> cp = nltk.RegexpParser(grammar) # Creating the parser object and passing the grammar to it
    >>> result = cp.parse(sentence) # Do the parsing and store the result to 'result'
    >>> print result # Printing the result

It will produce the parse structure like

    (S
      इराक/NNP
      के/PREP
      विदेश/NNC
      (NP मंत्री/NN)
      ने/PREP
      अमरीका/NNP
      के/PREP
      उस/PRP
      (NP प्रस्ताव/NN)
      का/PREP
      मजाक/NVB
      उड़ाया/VFM
      है/VAUX
      ,/PUNC
      जिसमें/PRP
      अमरीका/NNP
      ने/PREP
      संयुक्त/NNC
      (NP राष्ट्र/NN)
      के/PREP
      (NP प्रतिबंधों/NN)
      को/PREP
      (NP इराकी/JJ नागरिकों/NN)
      के/PREP
      लिए/PREP
      कम/INTF
      हानिकारक/JJ
      बनाने/VNN
      के/PREP
      लिए/PREP
      कहा/VFM
      है/VAUX
      ।/PUNC)

If you would like to visualise the parse structure just do this much

    >>> result.draw()

It will show a big parse tree. It is too big one so I am not attaching the screen shot.


Then what about generating bigrams from Indian Language corpora?

Here comes the code for that.

    >>> hinw = indian.words('hindi.pos') # Stores the words in 'hindi.pos' to hinw

    >>> hinbi = nltk.bigrams(hinw) # Generate the bigrams and store it in to hinbi

To print the bigrams

    >>> for i in hinbi:
    ...     print " ".join(i)

Here you can see some sample bigram

    चुके थे
    थे तथा
    तथा ३
    ३ बार
    बार विधानसभा
    विधानसभा के
    के सदस्य


   
Fine then what about trigrams.
Hmmm it is easy !!

First store words in the corpus to a list

    >>> hinw = indian.words('hindi.pos')

Then generate trigrams with nltk.trigrams() function

    >>> hintr = nltk.trigrams(hinw)

To print the trigrams
   
    >>> for j in hintr:
    ...    print " ".join(j)

Here is the sample output

    दो-दो तथा फ्रैंक्लीन
    तथा फ्रैंक्लीन और
    फ्रैंक्लीन और हैरिस
    और हैरिस ने
    हैरिस ने एक-एक
    ने एक-एक विकेट
    एक-एक विकेट लिये
    विकेट लिये ।


Finding count of a particular word in Indian Language corpus.

Store the words to some variable and use the count() function.

    >>> txt2 = indian.words('hindi.pos')
   
    >>> txt2.count('भारत')
    23
    >>>
Here I stred all the words in 'hindi.pos' and explored the count of the word भारत .

Find he percentage of text taken by a particular text

To find out the percentage of text taken by the word भारत

    >>> 100 * txt2.count('की') /len(txt2)
    2
    >>>

Producing lexical dispersion plot from Indian Language corpus

For that we have to play some trick

This is the command for plotting lexical dispersion plot of भारत  and की in Hindi corpus


    Text(txt2).dispersion_plot(['भारत','की'])

The Text() function convert the wordlist to nltk text object. It makes the plotting job easy. In the plot you cant see the word, because Unicode text will be displayed as box in the plot.


Selecting word based on parameters from Hindi corpus

For the same I am taking the example mentioned in the NLTK book.


    {w|w is a member of V and P(w)}

    [w for w in V if p(w)]

    >>> V = set(txt2)
    >>> my_word = [w for w in V if len(w) > 25]
    >>> sorted(my_word)
    >>> fd = FreqDist(txt2)
    >>> sorted([w for w in set(txt2) if len(w) > 5 and fd[w] > 25])

It will give the following list of word as output

        इस
        एक
        और
        कर
        कहा
        का
        कि
        किया
        ......



Conditional frequency distribution for Indian Language corpora

Here is an example

    Step 1
    Generate bigrams

    >>> hinbi = nltk.bigrams(hinw)

    Step 2
    Generate Conditional Frequency Distribution

    >>> gd = nltk.ConditionalFreqDist(big)

You can plot the cfd but it will take some time to generate the plot and you can see some animation effect
   
    >>> gd.plot()


One can print the tabulated cfd also

    >>> gd.tabulate()


More coming soon

Happy hacking !!!!!!!!


Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
NLTK and Indian Language corpus processing Part-III
Finding bigrams with NLTK.
WordNet sense similarity with NLTK: some basics
Graphical works with NLTK
 Permalink

NLTK and Indian Language corpus processing Part - I

During my presentation in Indian Python conference some body asked about Indian Language corpus processing in NLTK. Some how I skipped the answer. Because I know that Indian Language corpus is there in NLTK. But I never tried to play with that. But after the conference I did some thing on that too. I am posting my experiments with results here. If it can be done in a better way please tell me so that I can improve.

The Natural Language Toolkit contains some Indian language corpus. The corpus is  POS Tagged one. It is available for Bangala, Hindi, Marathi and Telugu languages.

    Total number of words in Bangala is     10281
                  Hindi     9408
                  Marathi     19066
                  Telugu     9999

    Total number Sentences in Bangala     899
                  Hindi        541
                  Marathi    1197
                  Telugu    994

Let's see how to access Indian Language corpora in NLTK and how to play with it.

    >>> from nltk.corpus import indian

    It will import Indian Language corpus from NLTK data.

    >>> indian.fileids() # Shows files in Indian Language corpus collection in NLTK
    ['bangla.pos', 'hindi.pos', 'marathi.pos', 'telugu.pos']
    >>>


To find number of characters in each language corpora

    >>> for f in indian.fileids():
    ...     print f
    ...     print len(indian.raw(f))

It will produce the following output

    bangla.pos
    209525
    hindi.pos
    175045
    marathi.pos
    429234
    telugu.pos
    251391

To find number of words in each language corpus

    >>> for f in indian.fileids():
    ...     print f
    ...     print len(indian.words(f))
    ...
It will produce the following output

    bangla.pos
    10281
    hindi.pos
    9408
    marathi.pos
    19066
    telugu.pos
    9999

To find number of sentences
    >>> for f in indian.fileids():
    ...     print f
    ...     print len(indian.sents(f))
    ...
It will produce the following output
   
    bangla.pos
    899
    hindi.pos
    541
    marathi.pos
    1197
    telugu.pos


We can extract sentences from these corpora too.

For accessing sentences from Hindi corpus
    >>> hindi_sent = indian.sents('hindi.pos')

To print individual sentences

    >>> for hsen in hindi_sent:
        print hsen

It will print each sentence as a list of words. Let's see how store sentences to a file.

    >>> it = open("jhi.txt",'w')
    >>> for hsen in hindi_sent:
    ...     it.write(" ".join(hsen))
    ...

The above given piece of code will convert the list of words in to actual sentence, and it will store to the file specified.


To access words in a corpus

    >>> hin_word = indian.words('hindi.pos')

This piece of code will store all the words in hindi.pos file to hin_words.


As we stored the sentences in to file we can write words to file also.

    >>> hwo = open("hcwo.txt",'w') # Open a file to store the words
    >>> for hw in hin_word:
    ...     hwo.write(" ".join(hw)) # Write each words to the file
    ...

For accessing the POS tagged sentences from a corpora

    >>> hpos = indian.tagged_sents('hindi.pos')

    >>> for sent in hpos:
    ...    print sent

It will print the POS tagged sentences a list of tuples.

More coming soon !!!!!

Happy hacking

Related Entries:
WordNet sense similarity with NLTK: some basics
NLP with Python NLTK Book
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Graphical works with NLTK
NLTK and Indian Language corpus processing Part-III
 Permalink

Python3 ZWJ and Malayalam; some doubts

Again I tried to do something in Python3. But it resulted in some strange results.
See the below given code.

    ==== Code Begin =========

    def ജഗന്‍ ():
        print("എന്റെ പേര് ജഗന്‍ എന്നാണ്")
        ജഗന്‍  = "ഞാന്‍"
        print(ജഗന്‍  )

    ജഗന്‍ ()

==== Code End ===========

When I tried to execute this it throws some error.

    ~/pypract$ python3 tes2.py
      File "tes2.py", line 2
        def ജഗന്‍ ():
                          ^
    SyntaxError: invalid character in identifier

    ~/pypract$

I thought that it may be due to the use of
'ZWJ' in some names I used in function names and variable names. So I decided to rewrite the same without 'ZWJ' character. The code is given below

    ==== Code Begin =====
    def ജഗന്‍():
        print("എന്റെ പേര് ജഗന്‍ എന്നാണ്")
        ജഗന്‍= "ഞാന്‍"
        print(ജഗന്‍)

    ജഗന്‍()
    ==== Code End =======

This code executed with out any error. What I did is I replaced the ന്‍ with the Unicode 5.1 equivalent .
The output is

    /pypract$ python3 tester.py
    എന്റെ പേര് ജഗന്‍ എന്നാണ്
    ഞാന്‍

I can't understand what is happening. Is a logical mistake I made in my program!!!
Or is it a problem related to ZWJ and Python????

Related Entries:
Again Python programming in Malayalam
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
Python3 is wonderful
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
Comments (1)  Permalink

New article on Machine Translation

I got an interesting article in the ACM Communications Oct 2009 issue. "Human Interaction for High-Quality Machine Translation" by "Francisco Casacuberta, Jorge Civera, Elsa Cubel, Antonio L. Lagarda, Guy Lapalme, Elliott Macklovitch, Enrique Vidal".

Nice and informative one.
Don't miss this .

Related Entries:
index
Opinion Mining and Sentiment Analysis papers from Computational Linguistics Open Access Journal
HBase Administration Cookbook by Yifeng Jiang new book from Packt
Quick MySQL to CouchDB migration with Python
The Mozilla Story: the story of how Mozilla helped shape the web we know today
 Permalink

Python3 is wonderful


See the below given Python code. What do you think!! will it be executed without throwing errors or not?

####Code begin#####################
    import sys

    def പണിയെടുക്കൂ ( പാഠം):
        for വരി in പാഠം:
            print( വരി )


    വരവ് = sys.argv[1]

    മൊത്തം = open(വരവ്,'r').readlines()

    പണിയെടുക്കൂ(മൊത്തം)

######## Code End ######################

Don't scratch your head it will. If you use Python3 for running the code.

Save the code as test.py. Install Python3 . Run the program as python3 test.py <your file>

I just saw some new Python documentation for Python3 with some similar examples. Thanks to Santhosh Thottingal SMC for pointing the link. Then I decided to experiment with it.

Wow great in Python3 you can declare variable names function names in your local language. But you wont get the Python reserved words in in your language. I think Python is the first programming language which provides such a great facility.


Related Entries:
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
Again Python programming in Malayalam
Python3 ZWJ and Malayalam; some doubts
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
Comments (4)  Permalink
1-6/6