I used to play a lot with text databases. Today I was just thinking of migrating some of my data collection to CouchDB. I used the following script to convert one of my DB table (Almost all fields are TEXT) to a CouchDB collection.
import MySQLdb as mdb
couch = couchdb.Server()
db = couch.create('YOUR_COLLECTION_NAME')
con = mdb.connect(host='HOST_NAME',user='YOU',passwd='YOUR_PASS',db='YOUR_DB')
cur = con.cursor(mdb.cursors.DictCursor)
command = cur.execute("SELECT * FROM YOUR_DB_TABLE")
results = cur.fetchall()
for result in results:
The DictCursor in Python MySQLdb API was a great help in creating fields and values in CouchDB collection. As my table contained text data only the operation was smooth and I was able to migrate about 1 GB data to CouchDB. But !!! life is not easy if your text data have encoding issues or junk values that can't be converted to Unicode you are in trouble. Don't worry here comes the solution; replace the last two lines in the code with below given code.
for result in results:
k = result.keys()
v = result.values()
v = [repr(i) for i in v]
d = dict(zip(k,v))
Hmm so far so good. But I tried the same code with a different table where the structure is like:
| Field | Type | Null | Key | Default | Extra |
| ID | int(11) | NO | PRI | NULL | auto_increment |
| NAME | varchar(30) | NO | | | |
| PRICE | decimal(5,2) | NO | | 0.00 | |
Now the code thrown a big list of error. Life is not easy !! have to find a good solution for this ... Happy hacking !!!!
Python Testing Cookbook, by Greg L Turnquist is one of the latest books from Packt Publishing. It is the second book on Python Testing; the first one was Python Testing: Beginners Guide by Daniel. The Python Testing Cookbook is a collection of useful easy to learn tips and tricks. Even though this book is labeled as "cookbook" it is quite useful for the newbies in Python testing too. All the essential tips to play with Python testing is given in the introductory chapters with illustrative examples. The book can serve as a good resource to train Python newbies in the art of Software Testing.
The first chapter of the book deals with basics of unittest with Python. It gives step by step examples to understand unit-testing.
The second chapter helps you to put your nose in to testing with Python "nose" framework. The chapter gives insights on how to write nose-plugins to make the testers life easy. Third chapter of the book deals with doctest. The chapter gives good intro to writing docstring for doc-test. The art of doc-test is well covered in this chapter. The fourth chapter deals with testing behavior driven development. The chapter introduces Mock, mockito and Lettuce testing tools. The fifth chapter deals with Acceptance Testing with Pyccuracy and Robot tools. This chapter gives some insight on selinium too. The sixth chapter speaks about test automation with Continuous Integration(CI). The chapter introduces Jenkins and NoseXunit. The chapter is very useful for the people who follows the waterfall model in Software Development. The seventh chapter discusses about test coverage.The chapter is bit complicated for beginners. Some play with database, springpython etc are there. In some examples I feel that the element of testing missed out ;-). The eighth chapter deals with Smoke and Load testing in Python. This chapter introduces the tool Pyro too. The 9th chapter is a collection of general advices for automated testers. After making your hands dirty with testing u can relax and clear the doubts with advices in this chapter.
The book comes with extensive code samples. Even if the book is all about testing I found one bad practice in coding through out the examples; that is import * . This makes the learner to scratch his head to understand what comes from where. I think it is worth to buy and read the book to get good insights on Test Automation with Python. It is a good book for beginners to learn testing and good reference book for experienced professionals too.
disclaimer: I received a free eBook from Packt for review
Packt Publishing releases a new Python book "Python Testing Cookbook" by Greg L. Turnquist. I received review copy of the book today. I will put a review of the book here soon. A sample chapter from the book is available at Packt Web page.
Language : English
Paperback : 364 pages [ 235mm x 191mm ]
Release Date : May 2011
ISBN : 1849514666
ISBN 13 : 978-1-84951-466-8
Author(s) : Greg L. Turnquist
Yesterday I was listening Van Lindberg's talk in PyCon US 2011 about Patent Mining. In his talk he mentioned about the Yahoo! Term Extractor Web Service. Before some times I heard that the service is not available now. Again I checked the web site and found that it is working now. I just played with the web service using a simple python script. Soon after seeing that it is working fine i created a dirty Python API for the web service. I am sharing code and documentation here
Sample : https://bitbucket.org/jaganadhg/yahootermextract/wiki/Home
After finishing the work I searched in the net and found that some similar scripts are available already :-(
Happy Hacking !!!!!!!!!!!!
Last Thursday (17th March 2011) I conducted a Python training session at Kongu Engineering College, Perundurai, Erode ad part of the Infocruise program. I got the opportunity through the ILUGC mailing list. Thanks Sreenivasan T.
I started my Journey from Coimbatore on Thursday morning in a TNSTC bus. I reached Perundurrai by 10.00 A.M. Maharajan student from Kongu and his friend reached the bus stand to pick me up to the college. By 10.15 we reached the college. The college is located in a nice calm and cool place. It is a green campus. After a small refreshment we went to the workshop venue. Around 60 students from different colleges participated in the workshop. The workshop was a partial hands on one. There was some power issues in the lab. So we were not able to do a full fledged hands on session on Python. The students were excited to learn such a simple and elegant programming language. They came up with copuple of interesting question regarding Python Job opportunities, application development etc..
The workshop ended by 4.00 PM. Maharajan dropped me at Perundurai bus stand. I cached a bus to Coimbatore and reached office by 06.30 P.M.
Some snaps (Thanks to Maharajan for sharing the photos)
Ane one day workshop on Free and Open Source Software has been conducted at PSR Engineering College, Sevalpatti, Sivakasi,Tamilnadu. I was invited to give an introduction to Python in the workshop. Mr. Chidambaresan an alumni of the PSR Engineering college picked me for the workshop. I reached Sattur by morning 4.30 and Chidambaresan picked me to a lodge for refreshments. By 7.30 A.M Chidambaresan and his friend arrived at lodge and we started to the college. On the way we picked Suthan HOD, MCA, Sivanthi Engineering College from Kovilpatti. After taking breakfast from Kovilpatti we headed towards the college. During the journy we discussed about FOSS and Engineering Syllabus, FOSS, ILUGC, and ILUGCBE and bit of politics too ;-) . We reached the college by 09.30 AM and we met the HOD, faculty members and Principal of the College. We had a nice discussion about students and their learning mentality, the necessity of motivating students to learn FOSS and contribute to FOSS. The college is located in a very nice and ambient village called Sevalpatti. Most of the students are from nearby villages or towns.
We started the workshop by 10.10 A.M. There was a small meeting which introduced Suthan and me to the students. after the introduction Suthan started his session on Introduction FOSS and FOSS Philosophy. He started the talk with nice examples and explained the concept of FOSS and its history. Even he explained it Tamil too, to reach the message up to the heart of the students. After the talk he gave a demo on how to install Ubuntu. The students came up with lots of doubts about on FOSS. After lunch we started the second session on Python programming. Jaganadh G (me) gave an interactive lecture on the basics of Python programming. The students were surprised to learn such a simple and powerful language. Then Chidambaresan gave a small and insperative talk on the necessity od students to learn and contribute to FOSS. After the Python session there was a valedictory function too. HOD and faculty members of the IT department was present in the function. HOD of IT department distributed certificates and Ubuntu CD to the participants. Some of the students came forward to give feedback on the workshop. This is the first time FOSS is being introduced in the college. Chidambaresan took the pain to make it happen in the college. We hope that some of the students will following the message of FOSSS.
The workshop ended by 4.00 PM and we returned to Sattur by car, hoping that FOSS will be flourishing at Sivkasi region.
Some snaps from the workshop
Last month I was playing with Apache CouchDB. Just some introductory stuff, map reduce etc... Soon I received some Linguistic data in .cvs format, as part of the project which I was managing. There was a need to analyze it. Usually we used MySQL/Spreadsheets to store and analyze the data. Suddenly I thought why can't I do it with CouchDB ?? . There was no direct option for import CSV data to CouchDB. I searched in the web and ended with a hint. Manilal a friend of mine also pointed to the same hint http://www.apacheserver.net/Load-CSV-file-into-couchdb-at1056996.htm .
Soon I created a small script to do the job aka load CSV file to CouchDB. The script is available in my Bitbucket repo https://bitbucket.org/jagan/misc/src/84cefb61c86a/csv2couch.py . It is a quick solution. May be you may be have a better version !!! I thought putting it in the web may help somebody else.
Happy Hacking !!!
Python 2.6 Text Processing Beginner's Guide by Jeff McNeil is one of the latest books by Packt Publishers. I received the review copy of this book before one and half months or so. Due to busy schedule I was not able to finish the review process. Finally I got enough time to review it. The book gives good insight to on different technical aspects and use of Python standards and third party libraries for text processing. It is filled with lots of examples and practical projects. I think I might have took almost one year to gather knowledge in the topic discussed in this book, when I started my career in Natural Language Processing domain. I am giving a bit detailed review on the book here.
The first chapter of this book gives some practical and interesting exercises like implementing cypher, some basic tricks with HTML. It also discusses how to setup a Python virtual environment for working with the examples in the book. The section of setting virtual environment is nice an well written one. It gives a clear idea of how to setup virtual environments.
The second chapter deals with Python IO module. It narrates the basic file operations with Python. The use of context manager(with function) for for file processing is discussed in this chapter. I am suing Python for text processing for lat three to four years. But after reading only I found that there is something called "fileinput" in Python programming language for multiple file access. The chapter discuss how to access remote files and StringIO() module in Python. At the end of this chapter there is a discussion about IO in Python 3 too.
The third chapter is about Python String Services. It deals with string formatting, templating, modulo formatting etc. Every concept is explained with necessary mini projects which followed from chapter two. The chapter gives a comprehensive view on advanced string services in Python.
The fourth chapter is entitled as Text Processing Using the Standard Library. This chapter deals with topic like reading wnd writing csv files(csv file processing), playing with application config files(.ini files), and working with JSON. The examples are bit long one but worth practicing for better understanding.
The fifth chapter deals with one of the key aspect in text processing "Regular Expressions". The chapter teaches basics syntax of regular expression in Python. The chapter also discusses about advanced processing like regex grouping, look ashed and look behind assertion in regular expressions. The look behind operation in regular expression is the most tricky part in dealing with regex. I think only masters in regex can do it effectively ;-) .The chapter dscuss basics of Unicode regular expressions too. The chapter is filled with enough examples for each and every concept discussed.
The sixth chapter deals with Markup Languages. The chapter discusses about XMl and HTML processing with Pytho standard libraries. xml.dom.minido, SAX,laxm and BeautifulSoup packages are discussed with illustrative examples.
The seventh chapter is entitled as Creating Templates. "Templating involves the creation of text files, or templates, that contain special markup. When a specialized parser encounters this markup, it replaces it with a computed value". The templating concept was quite new to me. But I got a good insight on the topic from this chapter. The chapter discusses some libraries like "Makeo" for templating task.
The eight chapter deals with localization (l1on) and encoding. If you are working with non-English data this chapter is a must read for you.The chapter discuses about character encoding, Unicode processing and Python3 too. Apart from mere Python stuff this chapter gives a good insight about charter encoding too.
The ninth chapter Advanced Output Formats is quite useful if you are trying to create output in PDF, CSV orExcel format. This chapter discuss about ReportLab a PDF generation library in Python.The only disadvantage which I found in ReportLab is its lack of complete Unicode Support. The chapter also discusses about creating excel files with xlwt module. Finally the chapter deals with handling OpenDocument format woth ODFPy module too. I used to read excel file from Python. But after going through this book I am able to even write Excel output too.
The tenth chapter deals with Advanced Parsing and Grammars. This is one of the key skill which required for Python text processing peoples. Creating custom grammars for parsing specific data. Through out my career I spent lot of time to train Engineers to understand parsing and BNF grammar. This time I got a good pointer for my people to start with BNF and Python programming. Also this chapter discusses about some parsing module in NLTK my favorite Python library. Some advanced topics in PyParsing also discussed in this chapter.
The eleventh and last chapter is the most interesting one in the book. The chapter deals with Searching and Indexing. PyLucene is the bset known Searching Index library in Python. But it is a wrapper to the apache Lucene. But his chapter discusses about another Python tool Nucular. Practical examples for creating search index etc are given in this chapter. This is the first time I am using the Nucular tool. I feel it as a nice and easy one compared to PyLucene. But I dont think this is superior than Lucene. I will play more with this tool and will update it in another blog post.
There are two appendix . The first appendix gives pointers to Python resources. The next one is answer o the pop quiz in the chapters.
I will give 9 out of 10 for this book. If you are dealing with rigorous text processing this book is a must have reference for you.
Packt Publishing releases new book "Python 2.6 Text Processing Beginners Guide" by Jeff McNeil. I received review copy of the book today. I will put a review of the book here soon. The book comes with lot of practical examples and tips.
Language : English
Paperback : 380 pages [ 235mm x 191mm ]
Release Date : December 2010
ISBN : 1849512124
ISBN 13 : 978-1-84951-212-1
Author(s) : Jeff McNeil
Python Text Processing with NLTK 2.0 Cookbook by Jacob Perkins is one of the latest books published by Packt in the Open Source series. The book is meant for people who started learning and practicing the Natural Language Tool Kit(NLTK).NLTK is an Open Source Python library to learn practice and implement Natural Language Processing techniques. The software is licensed under the Apache Software license. It is one of the most widely recommended tool kit for beginners in NLP to make their hands dirty. The toolkit is part of syllabus for many institutions around the globe where Natural Language Processing/ Computational Linguistics courses are offered. Perkins book work is the second book published on the toolkit NLTK. The first book is written by core developers of NLTK; Steven Bird, Ewan Klein, and Edward Loper, published by O'rielly. Steven et.all's book is a comprehensive introduction to the toolkit with basic Python lessons. People who has gone through the book may definitely like the new book by Perkin. The book is must have desktop reference for students, professionals, and faculty members interested in the area of NLP, Computational Linguistics and NLTK. Perkins handles the topic in an elegant way. Most of the people who searched for some NLTK tips might have gone through the author's blog. He maintains same simplicity and explanation style and hands-om approach throughout the book; which makes the reader to digest the topic with much easiness. The book is a collection of practical and working recipes related to NLTK.
The first chapter of the book "Tokenizing Text and WordNet Basics" deals with tokenizing text in to words sentences and paragraphs. The chapter also deals with tips and tricks with WordNet module in NLTK. Perkin discusses about Word Sense Disambiguation(WSD) techniques in this chapter. The missile part in WordNet is the use of wordnet 'ic' function. Tips for extracting collocations from a corpus is also included in the first chapter. The chapter "Replacing and Correcting Words"(IInd chapter) discusses stemming , lemmatization and spelling correction. He introduces another Python module named Python-Enchant for discussing about the spell checking technique. The chapter also discusses techniques like replaces negation with antonyms and replacement of repeating characters. The third chapter deals with Corpora. This chapter mainly discusses how to load user generated corpora in to NLTK with corpus readers implemented in NTLK. The most attracting part of this chapter is discussion about MonngoDB blackened for corpus reader in NLTK. MongoDB is a text based DB, which belongs to the NoSQL family. This part will be very useful for students in NLP and working professionals. The fourth chapter deals with POS Tagging techniques. It discusses mainly about training different POS taggers and using it. It is also quiet useful for people who would like to extend the functionality in NLTK for their projects and people who is interested to extend POS taggers in language other than English. Some part of this chapter content was published in the authors blog before one year. Chapter five of the book deals with Chunking and Chinking techniques with NLTK. Named Entity Identification and Extraction techniques are also discussed in this chapter. It gaves good insight to train NLTK chunking module for custom chunking tasks. With the help of this chapter I was able to create a small named entity extraction script with some Indian names. The sixth chapter is named as "Transforming Chunks and Trees" which deals with verb form correction, plural to singular correction, word filtering, and playing trees structures. Many time I saw that people used to raise question about handling tree data in NLTK. I think people can refer this chapter for getting good insight to play with NLTK parse tree data. The seventh chapter deals with most wanted topic of the time "Text Classification". Some part of this chapter appeared as blog post in Perkin's blog. There was many requests in freelancing web sites for text classification with NLTK. I found that some of them were not bided too. The chapter discusses the task of Text Classification in details with all the classification implementations available in NLTK. Training NLTK classifier is discussed very clearly. Apart form the classifier training, classification the chapter discusses classifier evaluation and tuning too. The eight chapter a revolutionary one which deals with Distributed data processing and handling large scale data with NLTK. I was not able to fully play with the total code in this chapter (Yes I worked out the code in other chapters and it was quite exciting. It contributed to my professional life too) . This chapter will be really helpful for industry people who is looking for to adopt NLTK in to NLP projects. Some basic insights of the contents in this chapter was also published in Perkin's blog. After Nithin Madanini's talk in US Python Conference on corpus processing with Dumbo and NLTK I think this is the only existing resource for practical large scale data processing with NLTK. The ninth and last chapter is about Parsing Scientific data with Python. This chapter deals with some Python modules rather than the NLTK tool. It discusses about URL extraction, timezone look-up, character conversion etc.. This chapter is good for people who plays with web data processing like harvesting. There is an appendix for the book which contains "Penn Treebank". It give list of all tags with its frequency in treebank corpus.
For the last three four years I am using NLTK to teach and develop prototypes of NLP applications. I was very much when I went through each of the recipes in this book. The author provides UML diagrams for the modules in NLTK which helps the reader to get good insight on the functionality of each module. This will be a good book not only for students and practitioners but also for people would like to contribute to NLTK project too. Also this book will help students in NLP and Computational Linguistics to do their projects with NLTK and Python. I give 9 out of 10 for the book. Natural Language Processing students, teachers, professional hurry and bag a copy of this book.
Thanks to Packt publihsres for the review copy of the book.