BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python


Life update

It is quite long time I updated something in this blog. The reason is Feeflux ended service and I was too lazy to manage a server by my own. Today I realized that Freeflux is till operational. Let me re-start with one big update in life :

We are blessed with a baby boy and named him Vaidyanadh. Spending leisure with him :-)

Will be updating more technical stuff here soon .


Opinion Mining and Sentiment Analysis papers from Computational Linguistics Open Access Journal


blog comments powered by Disqus


HBase Administration Cookbook by Yifeng Jiang : Review

 Packt publishers has announced a new book HBase Administration Cookbook by Yifeng Jiang. I think this is the first Big-Data book from Packt. The name suggest that the book HBase Administration Cookbook by Yifeng Jiang is essentially for people who is playing with HBase and would like to deep dive into HBase administration essentials. The book discusses various essential topics in HBase administration starting from installation to performance tuning. The book targets big-data administration professionals primarily. The author discusses the art and science of HBase administration in nine systematically arranged chapters. The initial chapter deals with installation of Habse in Amazone EC2 instance and discusses various setting . The chapter ends with High Availability master settings. The second chapter deals with migrating data to Habse. There is a detailed discussion on how to migrate MySQL data to HBase. This may be interesting for people who plans to migrate existing data to HBase. The third chapter mainly deals with HBase administration tools and over view of the tools. Data backup and restoration is one of the key concept when we discuss about data management. Fourth chapter of this book deals with data backup, restoration and replication in HBase. The fifth chapter deals with HBase cluster monitoring and diagnosis. The chapter comes with beautiful scripts for reporting cluster status. Security aspects of Habse is being discussed in chapter six. Security essentials for HBase and Hadoop with Kerberos is also discussed with detailed examples. Necessary troubleshooting aspects for HBase administration is discussed in chapter seven. Performance tuning and advanced configuration etc are discussed in chapter eight and nine.

The author has presented each topics in the book in lucid and digestive manner. Necessary examples and explanations were provided throughout the book; which helps the reader to gain a hands on experience in HBase administration. Even though some books are available on general aspects of Habse this is the first book which deals with HBase administration in detail. This book will be helpful for budding HBase and big-data administrators. Even though the book discusses the cluster setup and installation based on Amazone EC2 smart administrators can manage the same in their non Amazone clusters.
Related Entries:
Hadoop Database access experiment
Hadoop Comic by Maneesh Varshney
Mahout in Action: Review
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'

HBase Administration Cookbook by Yifeng Jiang new book from Packt

 Packt Publishers announces a new book "HBase Administration Cookbook by Yifeng Jiang".

I am reading it now; will bring a review soon.

Comments (1)  Permalink

Quick MySQL to CouchDB migration with Python

I used to play a lot with text databases. Today I was just thinking of migrating some of my data collection to CouchDB. I used the following script to convert one of my DB table (Almost all fields are TEXT) to a CouchDB collection.

#!/usr/bin/env python
import couchdb
import MySQLdb as mdb
couch = couchdb.Server()
db = couch.create('YOUR_COLLECTION_NAME')
con = mdb.connect(host='HOST_NAME',user='YOU',passwd='YOUR_PASS',db='YOUR_DB')
cur = con.cursor(mdb.cursors.DictCursor)
command = cur.execute("SELECT * FROM YOUR_DB_TABLE")
results = cur.fetchall()
for result in results:

The DictCursor in Python MySQLdb API was a great help in creating fields and values in CouchDB collection. As my table contained text data only the operation was smooth and I was able to migrate about 1 GB data to CouchDB. But !!! life is not easy if your text data have encoding issues or junk values that can't be converted to Unicode you are in trouble. Don't worry here comes the solution; replace the last two lines in the code with below given code.

for result in results:
    k = result.keys()
    v = result.values()
    v = [repr(i) for i in v]
    d = dict(zip(k,v))

Hmm so far so good. But I tried the same code with a different table where the structure is like:

| Field | Type         | Null | Key | Default | Extra          |
| ID    | int(11)      | NO   | PRI | NULL    | auto_increment |
| NAME  | varchar(30)  | NO   |     |         |                |
| PRICE | decimal(5,2) | NO   |     | 0.00    |                |

Now the code thrown a big list of error. Life is not easy !! have to find a good solution for this ... Happy hacking !!!!


Hadoop Database access experiment

Over a couple of weeks I was reading and practicing the book "Hadoop in Action". After getting some insight on Hadoop and Map Reduce I worked out a couple of examples from the book and some example problems which I created too. Then I was discussing about features of Hadoop with some of my colleagues over a cup of tea. One of the guy asked a question regarding accessing database from Hadoop and process the data. I saw some discussions related to Hadoop and database access some where in the internet. Finally I digged-out the article "Database Access with Hadoop" for Cloudera blog. After reading the same I decided to work with a sample problem.

To workout the Hadoop database access sample program. Before some times I extracted a bunch of Tweets related to Gmail's new look and feel. I extracted the Tweets for some social media analysis practice. The extraction was done using Twitter4j API. The data is stored in MySQL database. The database table contains one table called NewGamil with following structure.
        | Field           | Type         | Null | Key | Default | Extra |
        | TweetId        | int(11)      | NO   | PRI | NULL    | auto_increment |
        | Tweet           | varchar(240) | YES  |     | NULL    |                |

The problem which selected to workout is fetch all the tweets from the table 'NewGamil' and perform a word count. The word count result has to be stored in HDFS. In-fact there are ways to write data back to database itself. But I decided first experiment with read from database ;-).

Hadoop provides a handy API for accessing database; the DBInputformat API. The API allows us to read data from RDBMS like MySQL, PostgreSQL of Oracle . To access the data from DB we have to create a class to define the data which we are going to fetch and write back to DB.  In my project I created a class namely GetTweets to accomplish the same.

    public static class GetTweets implements Writable, DBWritable {
        String strTweet;

        public GetTweets() {


        public void readFields(DataInput in) throws IOException {

            this.strTweet = Text.readString(in);

        public void readFields(ResultSet resultSet) throws SQLException {
            // = resultSet.getLong(1);
            this.strTweet = resultSet.getString(1);

        public void write(DataOutput out) throws IOException {


        public void write(PreparedStatement stmt) throws SQLException {



Since I am accessing only one field from the table I defined the same in readFields() method. The write() methods are kept blank because the project does not aims to write back the data to DB. I'll experiment with writing data and post it soon.  In the readFileds() method we have to define how the data had to be extracted from the DB table. Since 'Tweet'  the data which I extractes for processing is VARCHAR() I am reading it as string and casting it to Text() data in hadoop. This class "GetTweets" will be used in our Mapper and Reducer class.

Now lets write our Mapper class:

    public static class TweetWordCountMapper extends MapReduceBase implements
            Mapper<LongWritable, GetTweets, Text, IntWritable> {
        private final static IntWritable intTwordsCount = new IntWritable(1);
        private Text strTwoken = new Text();

        public void map(LongWritable key, GetTweets value,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            GetTweets tweets = new GetTweets();
            tweets.strTweet = value.strTweet;
            TwitterTokenizer twokenizer = new TwitterTokenizer();
            List<String> twokens = twokenizer.twokenize(value.strTweet

            for (int i = 0; i < twokens.size(); i++) {
                output.collect(new Text(twokens.get(i)), intTwordsCount);



In the mapper class 'TweetWordCountMapper' I used the 'GetTweets' class to fetch the values for processing. Then we can access the data by creating object of the class inside the Mapper class.
NB: The code for TwitterTokenizer is taken from

Now we can write our reducer class :

    public static class TweetWordCountReducer extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            int intTwokenCount = 0;
            while (values.hasNext()) {
                intTwokenCount +=;
            output.collect(key, new IntWritable(intTwokenCount));

This reducer is responsible to sum the word count and produce the final output.

After this we have to configure the job with database connection details and driver class.

        JobConf twokenJobConf = new JobConf(TweetWordCount.class);

        twokenJobConf.setInputFormat(DBInputFormat.class); //Set input format here
        twokenJobConf.setOutputFormat(TextOutputFormat.class);// Sets the output format

        Object out = new Path("twokens");



        DBConfiguration.configureDB(twokenJobConf, "com.mysql.jdbc.Driver",
                "jdbc:mysql://localhost/GmailTrend", "jaganadhg", "jagan123"); //Specifies the DB configuration

        String[] fields = { "Tweet" }; //Specifies the Fields to be fetched from DB
        DBInputFormat.setInput(twokenJobConf, GetTweets.class, "NewGamil",
                null /* conditions */, "Tweet", fields); // Specifies the DB table and fields

        SequenceFileOutputFormat.setOutputPath(twokenJobConf, (Path) out);


Before compiling and running the program we have to some additional setup in the Hadoop ecosystem. The MySQL connector library has to be put in $HADOOP_HOME/lib folder. To download the connector .jar file go to MySQL Connector/J download folder. I used the mysql-connector-java-3.1.14-bin.jar file in my program. After putting the jar in $HADOOP_HOME/lib restart the hadoop ecosystem. Viola !! now you are ready to run the program. Convert the code to .jar file and run it.

The complete project is available in my bitbucket repository .

Happy hacking !!!!!!!!!!!!

Related Entries:
HBase Administration Cookbook by Yifeng Jiang : Review
Hadoop Comic by Maneesh Varshney
Mahout in Action: Review
New book by Packt: MySQL for Python
Comments (2)  Permalink

Hadoop Comic by Maneesh Varshney

There was a discussion on Hadoop Comic in Apache Hadoop Mailing list. I found that Maneesh Varshney created a wonderful comic strip to describe the entire Hadoop Distributed File System (HDFS). It is quite useful to understand the HDFS in easy way.
I am sharing the comic at my
Slideshare account.
Kudos to Maneesh Varshney for the wonderful and creative work.

Here it is .

View more documents from jaganadhg.

Related Entries:
HBase Administration Cookbook by Yifeng Jiang : Review
Hadoop Database access experiment
Mahout in Action: Review

The Mozilla Story: the story of how Mozilla helped shape the web we know today

Comments (1)  Permalink

Experiments with NoSQL databases: CouchDB

I started reading about NoSQL databases for a long time. Occasionally  I used some NoSQL databases like Apache CouchDB and Apache Cassandra for some analytics purpose(Some minor projects) with Python. This time I just thought why can't try something on Java + NoSQL. I created a small for project to play with. The idea of this project is: store Twitter search result to CouchDB.   I used the following Operating System, Programming Languages and Libraries in this project.

        Operating System                  :     Fedora 16 (verne)
        Programming Language     :     Java (JDK 1.6.0_29)
        IDE                                            :     Eclipse 3.7.1
        Apache CouchDB                   :    1.0.
        External Libraries                   :     Couchdb4J
                                                              Apache Commons httpclient, logging, codec,commons,collections, beanutils
                                                              Jsonlib, ezmorph   

Installing CouchDB
To install CouchDB fire the terminal and type the command
    $su -c 'yum -y install couchdb'

After succesful installation start the CoucbDB server by issuing the command in the terminal
    $su -c '/etc/init.d/couchdb start'

Now your CouchDB instance will be up and running. You can check this by opening CouchDB Futon in the broswer by navigating to http://localhost:5984/_utils/. If everything will fine you will see the Funton Interface.

Let's start out project.
First create a function to connect to the CouchDB instance,create and retrun a database with given name. If the database already exits it has to return the database.

     * @param strDBName
     * @return dbCouchDB

    public static Database connectCouchDB(String strDBName) {
        Database dbCouchDB = null;
        Session dbCouchDBSession = new Session("localhost", 5984);
        List<String> databases = dbCouchDBSession.getDatabaseNames();
        if (databases.contains(strDBName)) {
            dbCouchDB = dbCouchDBSession.getDatabase(strDBName);
        } else {
            dbCouchDB = dbCouchDBSession.getDatabase(strDBName);

        return dbCouchDB;



Now we can create a function to search in Twitter Search and return the tweets.

     * @param strQuery
     * @throws TwitterException
     * @return queryResult

    public static QueryResult getTweets(String strQuery)
            throws TwitterException {
        Twitter twitter = new TwitterFactory().getInstance();
        Query query = new Query(strQuery);
        QueryResult queryResult =;
        return queryResult;


To insert the tweets to the CouchDB document collection(database) it has to be converted to a document. Lets create a function to convert individual tweets to CouchDB document.

     * @param tweet
     * @return couchDocument

    public static Document tweetToCouchDocument(Tweet tweet) {

        Document couchDocument = new Document();

        couchDocument.put("Tweet", tweet.getText().toString());
        couchDocument.put("UserName", tweet.getFromUser().toString());
        couchDocument.put("Time", tweet.getCreatedAt().toGMTString());
        couchDocument.put("URL", tweet.getSource().toString());

        return couchDocument;


Now we can try to write the Twitter Search results to the CouchDB document collection with the following function.

     * @param tweetQury
     * @param dbName
     * @throws TwitterException

    public static void writeTweetToCDB(String strTweetQury, String strdbName)
            throws TwitterException {
        QueryResult tweetResults = getTweets(strTweetQury);
        Database dbInstance = connectCouchDB(strdbName);
        for (Tweet tweet : tweetResults.getTweets()) {
            Document document = tweetToCouchDocument(tweet);


Now it is time to execute our project. Add the following lines to the main() and run the project.

        String query = "java";
        String dbName = "javatweets";
        writeTweetToCDB(query, dbName);

That is all !!!!!! .

The entire code is available at my bitbucket repo

Happy Hacking !!!!!!!!


Lucene Index Writer API changes from 2.x to 3.x

The 3.x version of Lucene introduces lots of changes in its API. In 2.x we used IndexWriter API like this:

         Directory dir = File(indexDir));
        writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_30),true,IndexWriter.MaxFieldLength.UNLIMITED);

I used the same code with 3.x version for one of my project. The tool was working fine. But my IDE(eclipse) told that some of the things are deprecated hmm...... I decided to dig the new API and I found that the above given code has to be changed to this :

        Directory indexDir = File(strDirName));
        IndexWriterConfig confIndexWriter = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
        writer = new IndexWriter(indexDir, confIndexWriter);

If you would like to use the "IndexWriter.MaxFieldLength.UNLIMITED" the IndexWriterConfig should be like:
        IndexWriterConfig idxconfa = new IndexWriterConfig(Version.LUCENE_30, new LimitTokenCountAnalyzer(new StandardAnalyzer(Version.LUCENE_30), 1000000000));

The int '1000000000' is set as maximum limit here. max(int) is the maximum you can set in IndexWriterConfig.