BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

Quick MySQL to CouchDB migration with Python

I used to play a lot with text databases. Today I was just thinking of migrating some of my data collection to CouchDB. I used the following script to convert one of my DB table (Almost all fields are TEXT) to a CouchDB collection.

#!/usr/bin/env python
import couchdb
import MySQLdb as mdb
couch = couchdb.Server()
db = couch.create('YOUR_COLLECTION_NAME')
con = mdb.connect(host='HOST_NAME',user='YOU',passwd='YOUR_PASS',db='YOUR_DB')
cur = con.cursor(mdb.cursors.DictCursor)
command = cur.execute("SELECT * FROM YOUR_DB_TABLE")
results = cur.fetchall()
for result in results:
    db.save(result)

The DictCursor in Python MySQLdb API was a great help in creating fields and values in CouchDB collection. As my table contained text data only the operation was smooth and I was able to migrate about 1 GB data to CouchDB. But !!! life is not easy if your text data have encoding issues or junk values that can't be converted to Unicode you are in trouble. Don't worry here comes the solution; replace the last two lines in the code with below given code.

for result in results:
    k = result.keys()
    v = result.values()
    v = [repr(i) for i in v]
    d = dict(zip(k,v))
    db.save(d)

Hmm so far so good. But I tried the same code with a different table where the structure is like:

+-------+--------------+------+-----+---------+----------------+
| Field | Type         | Null | Key | Default | Extra          |
+-------+--------------+------+-----+---------+----------------+
| ID    | int(11)      | NO   | PRI | NULL    | auto_increment |
| NAME  | varchar(30)  | NO   |     |         |                |
| PRICE | decimal(5,2) | NO   |     | 0.00    |                |
+-------+--------------+------+-----+---------+----------------+

Now the code thrown a big list of error. Life is not easy !! have to find a good solution for this ... Happy hacking !!!!

 Permalink

Hadoop Database access experiment

Over a couple of weeks I was reading and practicing the book "Hadoop in Action". After getting some insight on Hadoop and Map Reduce I worked out a couple of examples from the book and some example problems which I created too. Then I was discussing about features of Hadoop with some of my colleagues over a cup of tea. One of the guy asked a question regarding accessing database from Hadoop and process the data. I saw some discussions related to Hadoop and database access some where in the internet. Finally I digged-out the article "Database Access with Hadoop" for Cloudera blog. After reading the same I decided to work with a sample problem.

To workout the Hadoop database access sample program. Before some times I extracted a bunch of Tweets related to Gmail's new look and feel. I extracted the Tweets for some social media analysis practice. The extraction was done using Twitter4j API. The data is stored in MySQL database. The database table contains one table called NewGamil with following structure.
        +-----------------+--------------+------+-----+---------+----------------+
        | Field           | Type         | Null | Key | Default | Extra |
        +-----------------+--------------+------+-----+---------+----------------+
        | TweetId        | int(11)      | NO   | PRI | NULL    | auto_increment |
        | Tweet           | varchar(240) | YES  |     | NULL    |                |
        +-----------------+--------------+------+-----+---------+----------------+

The problem which selected to workout is fetch all the tweets from the table 'NewGamil' and perform a word count. The word count result has to be stored in HDFS. In-fact there are ways to write data back to database itself. But I decided first experiment with read from database ;-).

Hadoop provides a handy API for accessing database; the DBInputformat API. The API allows us to read data from RDBMS like MySQL, PostgreSQL of Oracle . To access the data from DB we have to create a class to define the data which we are going to fetch and write back to DB.  In my project I created a class namely GetTweets to accomplish the same.

    public static class GetTweets implements Writable, DBWritable {
        String strTweet;

        public GetTweets() {

        }

        public void readFields(DataInput in) throws IOException {

            this.strTweet = Text.readString(in);
        }

        public void readFields(ResultSet resultSet) throws SQLException {
            // this.id = resultSet.getLong(1);
            this.strTweet = resultSet.getString(1);
        }

        public void write(DataOutput out) throws IOException {

        }

        public void write(PreparedStatement stmt) throws SQLException {

        }

    }

Since I am accessing only one field from the table I defined the same in readFields() method. The write() methods are kept blank because the project does not aims to write back the data to DB. I'll experiment with writing data and post it soon.  In the readFileds() method we have to define how the data had to be extracted from the DB table. Since 'Tweet'  the data which I extractes for processing is VARCHAR() I am reading it as string and casting it to Text() data in hadoop. This class "GetTweets" will be used in our Mapper and Reducer class.

Now lets write our Mapper class:

    public static class TweetWordCountMapper extends MapReduceBase implements
            Mapper<LongWritable, GetTweets, Text, IntWritable> {
        private final static IntWritable intTwordsCount = new IntWritable(1);
        private Text strTwoken = new Text();

        public void map(LongWritable key, GetTweets value,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            GetTweets tweets = new GetTweets();
            tweets.strTweet = value.strTweet;
            TwitterTokenizer twokenizer = new TwitterTokenizer();
            List<String> twokens = twokenizer.twokenize(value.strTweet
                    .toString());

            for (int i = 0; i < twokens.size(); i++) {
                output.collect(new Text(twokens.get(i)), intTwordsCount);
            }

        }

    }

In the mapper class 'TweetWordCountMapper' I used the 'GetTweets' class to fetch the values for processing. Then we can access the data by creating object of the class inside the Mapper class.
NB: The code for TwitterTokenizer is taken from https://github.com/vinhkhuc/Twitter-Tokenizer.

Now we can write our reducer class :

    public static class TweetWordCountReducer extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            int intTwokenCount = 0;
            while (values.hasNext()) {
                intTwokenCount += values.next().get();
            }
            output.collect(key, new IntWritable(intTwokenCount));
        }
    }

This reducer is responsible to sum the word count and produce the final output.

After this we have to configure the job with database connection details and driver class.

        JobConf twokenJobConf = new JobConf(TweetWordCount.class);
        twokenJobConf.setJobName("twoken_count");

        twokenJobConf.setInputFormat(DBInputFormat.class); //Set input format here
        twokenJobConf.setOutputFormat(TextOutputFormat.class);// Sets the output format

        Object out = new Path("twokens");

        twokenJobConf.setMapperClass(TweetWordCountMapper.class);
        twokenJobConf.setCombinerClass(TweetWordCountReducer.class);
        twokenJobConf.setReducerClass(TweetWordCountReducer.class);

        twokenJobConf.setOutputKeyClass(Text.class);
        twokenJobConf.setOutputValueClass(IntWritable.class);

        DBConfiguration.configureDB(twokenJobConf, "com.mysql.jdbc.Driver",
                "jdbc:mysql://localhost/GmailTrend", "jaganadhg", "jagan123"); //Specifies the DB configuration

        String[] fields = { "Tweet" }; //Specifies the Fields to be fetched from DB
        DBInputFormat.setInput(twokenJobConf, GetTweets.class, "NewGamil",
                null /* conditions */, "Tweet", fields); // Specifies the DB table and fields

        SequenceFileOutputFormat.setOutputPath(twokenJobConf, (Path) out);

        JobClient.runJob(twokenJobConf);


Before compiling and running the program we have to some additional setup in the Hadoop ecosystem. The MySQL connector library has to be put in $HADOOP_HOME/lib folder. To download the connector .jar file go to MySQL Connector/J download folder. I used the mysql-connector-java-3.1.14-bin.jar file in my program. After putting the jar in $HADOOP_HOME/lib restart the hadoop ecosystem. Viola !! now you are ready to run the program. Convert the code to .jar file and run it.

The complete project is available in my bitbucket repository .

Happy hacking !!!!!!!!!!!!

Related Entries:
HBase Administration Cookbook by Yifeng Jiang : Review
Hadoop Comic by Maneesh Varshney
Mahout in Action: Review
New book by Packt: MySQL for Python
Comments (2)  Permalink

Book Review: MySQL for Python (Packt) by Albert Lukaszewski


https://www.packtpub.com/sites/default/files/imagecache/productview/0189OS_MockupCover_0.jpg


MySQL for Python by Albert Lukaszewski is a must have for all Python programmers who is working with MySQL database. It provides a comprehensive overview of MySQL Python programming. There was a lack of such a good book on MySQL + Python. Developers and newbies who is interested and working in MySQL and Python used to refer some blog posts as reference resource for Python MySQL programming. Thankfully we have a new comprehensive book on MySQL Python programming. If you are a Python programmer and novice in MySQL definitely this book will help you to get good knowledge in MySQL too. I am giving 4.5 out of 5 stars for this book.

The first chapter of the book deals with installation and configuration of mysqldb Python module. It is well written one. Even a new be can properly do the installation after reading the chapter. Installation methods for different GNU/Linux distros, and Microsoft Windows is explained in this chapter. First time I am reading such an extensive installation introduction in any books. The chapter also discuss basics of MySQL DB programming with Python.

The second chapter deals with query formation, passing query to MySQL, dynamic query processing and applying user defined variables in MySQL query. If you created a MySQL database which mentioned in chapter and follows it you may fell in to trouble here. The DB which you created is named as 'menu'; contains a table 'fish'. In second chapter also for examples this DB is quoted. But instead of "SELECT * FROM fish" the author used "SELECT * FROM menu". I am not sure is it an error or not.

The third chapter deals with the data insertion in MySQL database(Insert) . Both chapters gives basic introduction to the MySQL database and accessing MySQL database with python program. The fourth chapter deals with exception handling in MySQL Python programming. This chapter can be downloaded from the Packt website for free. It contains concise description of error handling mechanism in python-mysqldb module. It discusses about six types of database errors and how to handle it in MySQL DB Python programming. The fifth chapter deals with fetching and processing record by record and by chunks. The sixth chapter discusses how to handle large data in MySQL python programming. It gives good examples for how to execute multiple INSERT statements rapidly. The fifth and sixth chapter gives concise idea about dynamic data insertion and retrieval from MySQL database with Python. It gives how to use the executemany() function in python-mysqldb module and the contexts when it is not be used. The seventh chapter gives a detailed picture on how to create and drop databases. It also teaches you to manage database instances with Python and database table creation automation. Eight chapter is about user management in MySQL db. This chapter discuss about user creation and access control in MySQL database with Python. Handling date and time value is discussed in chapter nine. It gives illustrative examples for frequently used functions for managing date and time with Python in MySQL.

Aggregate functions like COUNT(), MAX(), GROUP BY etc are discussed in chapter ten. The eleventh chapter discuss about use of WHERE, HAVING etc... and joining tables. Examples are illustrated with the help of "sakila" db. String processing in MySQL is discussed in the chapter twelve. The chapter discusses about the SUBSTRING(), CONCAT() statements.

The 13th and 14th chapter discusses some more advanced database programming concepts. The 13th chapter discusses about accessing MySQL metadata and the 14th chapter discusses about data base backup and recovery techniques.

Each chapters contains mini projects on the concepts discussed in the chapters. It helps the reader to make hads dirty with MySQL Python programming. The language and narration style in the book is so nice. Now there is no need to Python-MySQL developers search in the web to get tips and help.

The book touches almost all the MySQL database programming techniques except ORM. All the examples and projects explained is is simply practically oriented. I personally worked out most of the code in the book.

The final word. 1) This book is a must have reference in your desk if you are a Python programmer works with MySQL database.
2) If you are a Ptyhon programmer and newbie in MySQL you can gain knowledge in MySQL as well as Python MySQL programming.
3) Experienced MySQL programmers can skip many parts in each chapter. The book gives good tips than other blog and resource om MySQL python programming.
4) In-short it is bible for Python-MySQL programmers

Some issues I found in the book are:
1) The installation instructions for .rpm based system "sudo yum install " . 'sudo' is not being used by many of the experienced programmers. It seems to be a command used in debian based systems like Ubuntu.
2) In the first chapter we create a db called 'menu' with a table 'fish'. But in subsequent chapters queries are introduced as "SELECT * FROM menu" etc.. I am not sure if it is an error.
Go today and have a copy of the book . tom:content>MySQLPythonReviewDatabase Programming
 Permalink

New book by Packt: MySQL for Python

Packt Publishing provided an e-book of their new book "MySQL for Python" by Albert Lukaszewski. I will be reading the book during my deepavali holidays and I will put a review on the book here.

They provided link to a free chapter (Chapter No. 4 - Exception Handling) also . If you are interested in Python and MySQL feel free to read it.


https://www.packtpub.com/sites/default/files/imagecache/productview/0189OS_MockupCover_0.jpg
Related Entries:
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
Hadoop Database access experiment
Using Yahoo! Term Extractor web service with Python
Python workshop at Kongu Engineering College, Perundurai
FOSS Workshop at PSR Engineering College Sivakasi
 Permalink
1-4/4