I think you enjoyed the Part-I and Part-II of this tutorial. If you have any comment, suggestion or criticism please write to me. In part -III we can try to some more work with Indian Language Corpora in NLTK.
Generating word and POS bigram and trigram
For generating word and POS bigram I selected the 'hindi.pos' file and created the bigrams and trigrams.
Here is the code to do that.
=========== Code Begin ===========
from nltk.corpus import indian
from nltk import bigrams
from nltk import trigrams
hpos = indian.tagged_sents('hindi.pos')
# Stores the POS tagged sentences from 'hindi.pos'
wpos = []
for sent in hpos:
tojoin = sent
for tagged in tojoin:
wpos.append(" ".join(tagged))
#Stores word and pos as single unit to a list called 'wpos'
wpos_bigram = bigrams(wpos)
# Generating word and POS bigram
for wpb in wpos_bigram:
print " ".join(wpb)
# Prints the word and POS bigram
wpos_trigram = trigrams(wpos)
# Generating the Word and POS trigram
for wpt in wpos_trigram:
print " ".join(wpt)
#Prints the word and POS trigram
=========== Code Begin ===========
For generating word and pos from other Indian Language corpus just replace 'hindi.pos' with appropriate file id.
Collocations Concordance from Indian Language Corpora
Now let's try to build collocation from hindi corpus(hindi.pos).
>>> hw = nltk.corpus.indian.words('hindi.pos')
>>> th = Text(hw)
>>> th.collocations()
Building collocations list
है ।; के लिए; कहा कि; हैं ।;
पारी खेली; है कि; रनों की;
न्यू जीलैंड; युद्ध विराम;
ने कहा; के हाथों; करते हुए;
डेविस कप; की पारी; रहे हैं;
खेली ।; रन पर; रन बनाये;
हाथों लपकवाया; किए गए
Concordence from Hindi corpus in NLTK
>>> th.concordance('न्यू')
Building index...
Displaying 13 of 13 matches:
वसीय मैच में न्यू जीलैंड को जी
��न से बाहर कर न्यू जीलैंड की टी
��न सकती हैं । न्यू जीलैंड ने पा
ती डुनेडिन । न्यू जीलैंड ने पा
-२ से जीत ली । न्यू जीलैंड ने पा
��त किया गया । न्यू जीलैंड की पा
लपकवा दिया । न्यू जीलैंड की तर
ीसरे मैच में न्यू जीलैंड को २८
��े हरा दिया । न्यू जीलैंड को जी
��त किया गया । न्यू जीलैंड की शु
�� पारी खेली । न्यू जीलैंड के १५
��ी कर पाये और न्यू जीलैंड की पा
ोगदान दिया । न्यू जीलैंड की तर
>>>
Here is an example to populate frequency distribution of some Hindi words in 'hindi.pos' file.
========== Code begin =====================
# -*- coding: utf-8 -*-
from nltk.corpus import indian
from nltk import FreqDist
hindi_text = indian.words('hindi.pos')
freq_dist = FreqDist([w.strip() for w in hindi_text])
modals = ['की','है','हो','तो']
for modal in modals:
print modal + " : " , freq_dist[modal]
====== Code End =================
The result is given below.
की : 236
है : 189
हो : 28
तो : 10
Happy Dipavali !!!
Happy Hacking
In Part-I we saw how to access Indian Language corpora in NLTK and how to play with it. Now let's see some more examples.
First we can see how to access each word with associated POS tag. (Before proceeding don't forgot to do the imports done in Part-I )
>>> for sent in hpos:
... tmp = sent
... for j in range(len(sent))
... print " ".join(k[j])
It will print word with POS like
दो QFNUM
विकेट NN
लिये VFM
। PUNC
अनवर NNP
को PREP
विंसेट NNP
ने PREP
रन NNC
आउट NN
किया VFM
। PUNC
>>>
Let us see how to do parsing with the Indian Language POS Tagged corpus. For the purpose I am using the RegexParser available in NLTK.
>>> sentence = hpos[2]
I am taking the third sentence in the hindi.pos file for parsing
>>> grammar = "NP: {<DT>?<JJ>*<NN>}"
Defined grammar for parsing with RegexParser in NLTK.
>>> cp = nltk.RegexpParser(grammar) # Creating the parser object and passing the grammar to it
>>> result = cp.parse(sentence) # Do the parsing and store the result to 'result'
>>> print result # Printing the result
It will produce the parse structure like
(S
इराक/NNP
के/PREP
विदेश/NNC
(NP मंत्री/NN)
ने/PREP
अमरीका/NNP
के/PREP
उस/PRP
(NP प्रस्ताव/NN)
का/PREP
मजाक/NVB
उड़ाया/VFM
है/VAUX
,/PUNC
जिसमें/PRP
अमरीका/NNP
ने/PREP
संयुक्त/NNC
(NP राष्ट्र/NN)
के/PREP
(NP प्रतिबंधों/NN)
को/PREP
(NP इराकी/JJ नागरिकों/NN)
के/PREP
लिए/PREP
कम/INTF
हानिकारक/JJ
बनाने/VNN
के/PREP
लिए/PREP
कहा/VFM
है/VAUX
।/PUNC)
If you would like to visualise the parse structure just do this much
>>> result.draw()
It will show a big parse tree. It is too big one so I am not attaching the screen shot.
Then what about generating bigrams from Indian Language corpora?
Here comes the code for that.
>>> hinw = indian.words('hindi.pos') # Stores the words in 'hindi.pos' to hinw
>>> hinbi = nltk.bigrams(hinw) # Generate the bigrams and store it in to hinbi
To print the bigrams
>>> for i in hinbi:
... print " ".join(i)
Here you can see some sample bigram
चुके थे
थे तथा
तथा ३
३ बार
बार विधानसभा
विधानसभा के
के सदस्य
Fine then what about trigrams.
Hmmm it is easy !!
First store words in the corpus to a list
>>> hinw = indian.words('hindi.pos')
Then generate trigrams with nltk.trigrams() function
>>> hintr = nltk.trigrams(hinw)
To print the trigrams
>>> for j in hintr:
... print " ".join(j)
Here is the sample output
दो-दो तथा फ्रैंक्लीन
तथा फ्रैंक्लीन और
फ्रैंक्लीन और हैरिस
और हैरिस ने
हैरिस ने एक-एक
ने एक-एक विकेट
एक-एक विकेट लिये
विकेट लिये ।
Finding count of a particular word in Indian Language corpus.
Store the words to some variable and use the count() function.
>>> txt2 = indian.words('hindi.pos')
>>> txt2.count('भारत')
23
>>>
Here I stred all the words in 'hindi.pos' and explored the count of the word भारत .
Find he percentage of text taken by a particular text
To find out the percentage of text taken by the word भारत
>>> 100 * txt2.count('की') /len(txt2)
2
>>>
Producing lexical dispersion plot from Indian Language corpus
For that we have to play some trick
This is the command for plotting lexical dispersion plot of भारत and की in Hindi corpus
Text(txt2).dispersion_plot(['भारत','की'])
The Text() function convert the wordlist to nltk text object. It makes the plotting job easy. In the plot you cant see the word, because Unicode text will be displayed as box in the plot.
Selecting word based on parameters from Hindi corpus
For the same I am taking the example mentioned in the NLTK book.
{w|w is a member of V and P(w)}
[w for w in V if p(w)]
>>> V = set(txt2)
>>> my_word = [w for w in V if len(w) > 25]
>>> sorted(my_word)
>>> fd = FreqDist(txt2)
>>> sorted([w for w in set(txt2) if len(w) > 5 and fd[w] > 25])
It will give the following list of word as output
इस
एक
और
कर
कहा
का
कि
किया
......
Conditional frequency distribution for Indian Language corpora
Here is an example
Step 1
Generate bigrams
>>> hinbi = nltk.bigrams(hinw)
Step 2
Generate Conditional Frequency Distribution
>>> gd = nltk.ConditionalFreqDist(big)
You can plot the cfd but it will take some time to generate the plot and you can see some animation effect
>>> gd.plot()
One can print the tabulated cfd also
>>> gd.tabulate()
More coming soon
Happy hacking !!!!!!!!
During my presentation in Indian Python conference some body asked about Indian Language corpus processing in NLTK. Some how I skipped the answer. Because I know that Indian Language corpus is there in NLTK. But I never tried to play with that. But after the conference I did some thing on that too. I am posting my experiments with results here. If it can be done in a better way please tell me so that I can improve.
The Natural Language Toolkit contains some Indian language corpus. The corpus is POS Tagged one. It is available for Bangala, Hindi, Marathi and Telugu languages.
Total number of words in Bangala is 10281
Hindi 9408
Marathi 19066
Telugu 9999
Total number Sentences in Bangala 899
Hindi 541
Marathi 1197
Telugu 994
Let's see how to access Indian Language corpora in NLTK and how to play with it.
>>> from nltk.corpus import indian
It will import Indian Language corpus from NLTK data.
>>> indian.fileids() # Shows files in Indian Language corpus collection in NLTK
['bangla.pos', 'hindi.pos', 'marathi.pos', 'telugu.pos']
>>>
To find number of characters in each language corpora
>>> for f in indian.fileids():
... print f
... print len(indian.raw(f))
It will produce the following output
bangla.pos
209525
hindi.pos
175045
marathi.pos
429234
telugu.pos
251391
To find number of words in each language corpus
>>> for f in indian.fileids():
... print f
... print len(indian.words(f))
...
It will produce the following output
bangla.pos
10281
hindi.pos
9408
marathi.pos
19066
telugu.pos
9999
To find number of sentences
>>> for f in indian.fileids():
... print f
... print len(indian.sents(f))
...
It will produce the following output
bangla.pos
899
hindi.pos
541
marathi.pos
1197
telugu.pos
We can extract sentences from these corpora too.
For accessing sentences from Hindi corpus
>>> hindi_sent = indian.sents('hindi.pos')
To print individual sentences
>>> for hsen in hindi_sent:
print hsen
It will print each sentence as a list of words. Let's see how store sentences to a file.
>>> it = open("jhi.txt",'w')
>>> for hsen in hindi_sent:
... it.write(" ".join(hsen))
...
The above given piece of code will convert the list of words in to actual sentence, and it will store to the file specified.
To access words in a corpus
>>> hin_word = indian.words('hindi.pos')
This piece of code will store all the words in hindi.pos file to hin_words.
As we stored the sentences in to file we can write words to file also.
>>> hwo = open("hcwo.txt",'w') # Open a file to store the words
>>> for hw in hin_word:
... hwo.write(" ".join(hw)) # Write each words to the file
...
For accessing the POS tagged sentences from a corpora
>>> hpos = indian.tagged_sents('hindi.pos')
>>> for sent in hpos:
... print sent
It will print the POS tagged sentences a list of tuples.
More coming soon !!!!!
Happy hacking
[ Python ]
by jaganadhg
@ 06.10.2009 15:47 GMT
Again I tried to do something in Python3. But it resulted in some strange results.
See the below given code.
==== Code Begin =========
def ജഗന് ():
print("എന്റെ പേര് ജഗന് എന്നാണ്")
ജഗന് = "ഞാന്"
print(ജഗന് )
ജഗന് ()
==== Code End ===========
When I tried to execute this it throws some error.
~/pypract$ python3 tes2.py
File "tes2.py", line 2
def ജഗന് ():
^
SyntaxError: invalid character in identifier
~/pypract$
I thought that it may be due to the use of 'ZWJ' in some names I used in function names and variable names. So I decided to rewrite the same without 'ZWJ' character. The code is given below
==== Code Begin =====
def ജഗന്():
print("എന്റെ പേര് ജഗന് എന്നാണ്")
ജഗന്= "ഞാന്"
print(ജഗന്)
ജഗന്()
==== Code End =======
This code executed with out any error. What I did is I replaced the ന് with the Unicode 5.1 equivalent .
The output is
/pypract$ python3 tester.py
എന്റെ പേര് ജഗന് എന്നാണ്
ഞാന്
I can't understand what is happening. Is a logical mistake I made in my program!!!
Or is it a problem related to ZWJ and Python????
I got an interesting article in the ACM Communications Oct 2009 issue. "Human Interaction for High-Quality Machine Translation" by "Francisco Casacuberta, Jorge Civera, Elsa Cubel, Antonio L. Lagarda, Guy Lapalme, Elliott Macklovitch, Enrique Vidal".
Nice and informative one.
Don't miss this .
[ Python ]
by jaganadhg
@ 04.10.2009 15:35 GMT
See the below given Python code. What do you think!! will it be executed without throwing errors or not?
####Code begin#####################
import sys
def പണിയെടുക്കൂ ( പാഠം):
for വരി in പാഠം:
print( വരി )
വരവ് = sys.argv[1]
മൊത്തം = open(വരവ്,'r').readlines()
പണിയെടുക്കൂ(മൊത്തം)
######## Code End ######################
Don't scratch your head it will. If you use Python3 for running the code.
Save the code as test.py. Install Python3 . Run the program as python3 test.py <your file>
I just saw some new Python documentation for Python3 with some similar examples. Thanks to Santhosh Thottingal SMC for pointing the link. Then I decided to experiment with it.
Wow great in Python3 you can declare variable names function names in your local language. But you wont get the Python reserved words in in your language. I think Python is the first programming language which provides such a great facility.