BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python

Bangalore

HBase Administration Cookbook by Yifeng Jiang : Review

 Packt publishers has announced a new book HBase Administration Cookbook by Yifeng Jiang. I think this is the first Big-Data book from Packt. The name suggest that the book HBase Administration Cookbook by Yifeng Jiang is essentially for people who is playing with HBase and would like to deep dive into HBase administration essentials. The book discusses various essential topics in HBase administration starting from installation to performance tuning. The book targets big-data administration professionals primarily. The author discusses the art and science of HBase administration in nine systematically arranged chapters. The initial chapter deals with installation of Habse in Amazone EC2 instance and discusses various setting . The chapter ends with High Availability master settings. The second chapter deals with migrating data to Habse. There is a detailed discussion on how to migrate MySQL data to HBase. This may be interesting for people who plans to migrate existing data to HBase. The third chapter mainly deals with HBase administration tools and over view of the tools. Data backup and restoration is one of the key concept when we discuss about data management. Fourth chapter of this book deals with data backup, restoration and replication in HBase. The fifth chapter deals with HBase cluster monitoring and diagnosis. The chapter comes with beautiful scripts for reporting cluster status. Security aspects of Habse is being discussed in chapter six. Security essentials for HBase and Hadoop with Kerberos is also discussed with detailed examples. Necessary troubleshooting aspects for HBase administration is discussed in chapter seven. Performance tuning and advanced configuration etc are discussed in chapter eight and nine.


The author has presented each topics in the book in lucid and digestive manner. Necessary examples and explanations were provided throughout the book; which helps the reader to gain a hands on experience in HBase administration. Even though some books are available on general aspects of Habse this is the first book which deals with HBase administration in detail. This book will be helpful for budding HBase and big-data administrators. Even though the book discusses the cluster setup and installation based on Amazone EC2 smart administrators can manage the same in their non Amazone clusters.
Related Entries:
Hadoop Database access experiment
Hadoop Comic by Maneesh Varshney
Mahout in Action: Review
Book Review: Python 2.6 Text Processing Beginner's Guide by Jeff McNei
New book by Packt:'Python Text Processing with NLTK2.0 Cookbook'
 Permalink

Hadoop Database access experiment

Over a couple of weeks I was reading and practicing the book "Hadoop in Action". After getting some insight on Hadoop and Map Reduce I worked out a couple of examples from the book and some example problems which I created too. Then I was discussing about features of Hadoop with some of my colleagues over a cup of tea. One of the guy asked a question regarding accessing database from Hadoop and process the data. I saw some discussions related to Hadoop and database access some where in the internet. Finally I digged-out the article "Database Access with Hadoop" for Cloudera blog. After reading the same I decided to work with a sample problem.

To workout the Hadoop database access sample program. Before some times I extracted a bunch of Tweets related to Gmail's new look and feel. I extracted the Tweets for some social media analysis practice. The extraction was done using Twitter4j API. The data is stored in MySQL database. The database table contains one table called NewGamil with following structure.
        +-----------------+--------------+------+-----+---------+----------------+
        | Field           | Type         | Null | Key | Default | Extra |
        +-----------------+--------------+------+-----+---------+----------------+
        | TweetId        | int(11)      | NO   | PRI | NULL    | auto_increment |
        | Tweet           | varchar(240) | YES  |     | NULL    |                |
        +-----------------+--------------+------+-----+---------+----------------+

The problem which selected to workout is fetch all the tweets from the table 'NewGamil' and perform a word count. The word count result has to be stored in HDFS. In-fact there are ways to write data back to database itself. But I decided first experiment with read from database ;-).

Hadoop provides a handy API for accessing database; the DBInputformat API. The API allows us to read data from RDBMS like MySQL, PostgreSQL of Oracle . To access the data from DB we have to create a class to define the data which we are going to fetch and write back to DB.  In my project I created a class namely GetTweets to accomplish the same.

    public static class GetTweets implements Writable, DBWritable {
        String strTweet;

        public GetTweets() {

        }

        public void readFields(DataInput in) throws IOException {

            this.strTweet = Text.readString(in);
        }

        public void readFields(ResultSet resultSet) throws SQLException {
            // this.id = resultSet.getLong(1);
            this.strTweet = resultSet.getString(1);
        }

        public void write(DataOutput out) throws IOException {

        }

        public void write(PreparedStatement stmt) throws SQLException {

        }

    }

Since I am accessing only one field from the table I defined the same in readFields() method. The write() methods are kept blank because the project does not aims to write back the data to DB. I'll experiment with writing data and post it soon.  In the readFileds() method we have to define how the data had to be extracted from the DB table. Since 'Tweet'  the data which I extractes for processing is VARCHAR() I am reading it as string and casting it to Text() data in hadoop. This class "GetTweets" will be used in our Mapper and Reducer class.

Now lets write our Mapper class:

    public static class TweetWordCountMapper extends MapReduceBase implements
            Mapper<LongWritable, GetTweets, Text, IntWritable> {
        private final static IntWritable intTwordsCount = new IntWritable(1);
        private Text strTwoken = new Text();

        public void map(LongWritable key, GetTweets value,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            GetTweets tweets = new GetTweets();
            tweets.strTweet = value.strTweet;
            TwitterTokenizer twokenizer = new TwitterTokenizer();
            List<String> twokens = twokenizer.twokenize(value.strTweet
                    .toString());

            for (int i = 0; i < twokens.size(); i++) {
                output.collect(new Text(twokens.get(i)), intTwordsCount);
            }

        }

    }

In the mapper class 'TweetWordCountMapper' I used the 'GetTweets' class to fetch the values for processing. Then we can access the data by creating object of the class inside the Mapper class.
NB: The code for TwitterTokenizer is taken from https://github.com/vinhkhuc/Twitter-Tokenizer.

Now we can write our reducer class :

    public static class TweetWordCountReducer extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            int intTwokenCount = 0;
            while (values.hasNext()) {
                intTwokenCount += values.next().get();
            }
            output.collect(key, new IntWritable(intTwokenCount));
        }
    }

This reducer is responsible to sum the word count and produce the final output.

After this we have to configure the job with database connection details and driver class.

        JobConf twokenJobConf = new JobConf(TweetWordCount.class);
        twokenJobConf.setJobName("twoken_count");

        twokenJobConf.setInputFormat(DBInputFormat.class); //Set input format here
        twokenJobConf.setOutputFormat(TextOutputFormat.class);// Sets the output format

        Object out = new Path("twokens");

        twokenJobConf.setMapperClass(TweetWordCountMapper.class);
        twokenJobConf.setCombinerClass(TweetWordCountReducer.class);
        twokenJobConf.setReducerClass(TweetWordCountReducer.class);

        twokenJobConf.setOutputKeyClass(Text.class);
        twokenJobConf.setOutputValueClass(IntWritable.class);

        DBConfiguration.configureDB(twokenJobConf, "com.mysql.jdbc.Driver",
                "jdbc:mysql://localhost/GmailTrend", "jaganadhg", "jagan123"); //Specifies the DB configuration

        String[] fields = { "Tweet" }; //Specifies the Fields to be fetched from DB
        DBInputFormat.setInput(twokenJobConf, GetTweets.class, "NewGamil",
                null /* conditions */, "Tweet", fields); // Specifies the DB table and fields

        SequenceFileOutputFormat.setOutputPath(twokenJobConf, (Path) out);

        JobClient.runJob(twokenJobConf);


Before compiling and running the program we have to some additional setup in the Hadoop ecosystem. The MySQL connector library has to be put in $HADOOP_HOME/lib folder. To download the connector .jar file go to MySQL Connector/J download folder. I used the mysql-connector-java-3.1.14-bin.jar file in my program. After putting the jar in $HADOOP_HOME/lib restart the hadoop ecosystem. Viola !! now you are ready to run the program. Convert the code to .jar file and run it.

The complete project is available in my bitbucket repository .

Happy hacking !!!!!!!!!!!!

Related Entries:
HBase Administration Cookbook by Yifeng Jiang : Review
Hadoop Comic by Maneesh Varshney
Mahout in Action: Review
New book by Packt: MySQL for Python
Comments (2)  Permalink

Hadoop Comic by Maneesh Varshney

There was a discussion on Hadoop Comic in Apache Hadoop Mailing list. I found that Maneesh Varshney created a wonderful comic strip to describe the entire Hadoop Distributed File System (HDFS). It is quite useful to understand the HDFS in easy way.
I am sharing the comic at my
Slideshare account.
Kudos to Maneesh Varshney for the wonderful and creative work.

Here it is .

Hdfs
View more documents from jaganadhg.

Related Entries:
HBase Administration Cookbook by Yifeng Jiang : Review
Hadoop Database access experiment
Mahout in Action: Review
 Permalink

Mahout in Action: Review

Apache Mahout is an Open Source scalable Machine Learning library in Java. It is designed to handle large data set. More than a dozen of Machine Learning and Data Mining algorithms are available in Mahout. All those algorithms are implemented on top of Apache Hadoop. The framework is distributed under a commercially friendly Apache License. It helps researchers and corporate to build scalable and practical products based on Machine Learning and Data Mining Principles. A wide range of big companies as well as startups are using Apache Mahout in their products.

The Apache Mahout project is focused three interesting Machine Learning problems 1) recommendation systems 2) clustering and 3) classification. The project address real world practical problems. The tool makes life of Machine Learning Developers much enjoyable. The book "Mahout in Action" by Sean Owen,Robin Anil, Ted Dunning and Ellen Friedman introduces the wonder world of creating scalable and real world machine learning projects with Apache Mahout. It is written in a lucid language so that a beginner in Machine Learning can understand the concepts and kick start working with classification, clustering or recommendation projects. Even though the detailed algorithmic back ground of underlying algorithms in Mahout is not described the logic (common sense) behind the system is explained very well with help of code examples and practical projects.  I am giving chapter wise overview of the book "Mahout in Action" below. A sample chapter is availeble for download at http://www.manning.com/free/green_owen.html
 

Chapter 1 of the book get you introduced to Mahout. Through this chapter you get to know the history of Mahout project, algorithms, it's capabilities and configurations.

Chapter 2 of the book introduces recommendation systems to the reader. The chapter teaches how to build a basic re commender systems with Apache Mahout. The examples given for narrating the technique is very clear and understandable to all.

Chapter 3 of the book discuss about data representation for building a recommender engine. The discussions in this chapter extends up to some naive data structure in Mahout. There is some discussion on using MySQL for storing data for building recommender engines.

Chapter 4 of the book gives more insight in to building scalable recommender systems. It introduces user based recommendation engines as well as item based recommendation engines. The examples are very clear and it helps practitioners to build better prototypes much faster. The chapter is written in such a lucid way that any body can understand the common sense behind the recommender engines.

The fifth chapter of the book deals with producing a full fledged recommender system with Apache Mahout. The discussion and examples in this chapter extends up to deploying a web based recommeder engine. Once u covered up this chapter it can be ensured that you can build a good production quality recommender engine for your client.

Chapter 6 of the book discussed how to build a scalable and distributed recommendation system with Mahout and Hadoop frame work. The chapter gives illustrative example for the task with Wikipedia data set. The author spent some pages for explaining Map Reduce concept in a much lucid way. There is a discussion on running the recommender in a cloud platform too. This chapter is definitely a helping point for professionals to kick start their recommender projects with less pain. 

Starting from chapter 7 to 12 the book discusses about Clustering techniques using Apache Mahout. Chapter seven gives a brief introduction to clustering with practical examples. The chapter contains discussions on different clustering algorithms available in Mahout.

Chapter eight of the book deals with preparing and representing data for clustering task. Tips and tricks for converting raw data to vectors for clustering is discussed in a very lucid manner in this chapter.

The 9th chapter of the book discusses details on clustering algorithms in Mahout. The major algorithms covered in this chapter are K-Means clustering, Centroid generation using Canopy clustering, Fuzzy K-Means clustering, Dirichlet clustering,Topic modeling using LDA as a variant of clustering. There is a small cases study on clustering news items using Apache Mahout. One of my project student has undertaken such a project for his MSc in CS .

The 10th chapter is focused on evaluation of clustering system. The chapter discusses about clustering output inspection, quality evaluation of clustering and improving the quality of clusters.

The 11th chapter deals with producing a scalable clustering system with Mahout. It gives good insight in to the art of content clustering with two case studies. The 12th chapter discusses some use cased of clustering with code examples including twitter user clustering, playing with last.fm data and clustering.

Beginning from chapter 13 to end of chapter 16 the book discusses about the technique of classification. Chapter 13 of the book gives introduction to classification. It explains classification step by step with examples.The illustrations given in the chapter makes the content more enjoyable and understanding for the reader. Chapter 14 deals with training a classifier system. It explains the task of training with a publically available data-set called 20 newsgroups data set. There is a discussion on selecting algorithm for the classification task too. When ever I came to know about Mahout I used the classification techniques and algorithms. Chapter 16 has a wonderful discussion on deployment of classification system. The section gives practical insight on pros and cons of developing and deploying scalable classification system that can be bench marked with existing best performing systems.

The 17th Chapter needs special mention. The chapter is a case study named "Case study: Shop It To Me". The discussions in this chapter shows real power of Apache Mahout with the help of a practical project.

There are two appendix provided to the book. Appendix A deals with some JVM tuning tips and tricks for Deploying Hadoop/Mahout based projects. It is even useful for core Java programmers too. The Appendix B gives insight on "Mahout Math" and some deep math related stuff in Mahout.

The book is available from Manning MEAP site. Three excerpts are available in the web site along with sample code. This is a must-read for all Machine Learning and NLP Developers and Researchers.  This is an excellent book and  I am very much happy to read practice and understand the Apache Mahout in such detail.  Kudos to Sean Owen,Robin Anil, Ted Dunning and Ellen Friedman.

For code samples and sample chapters visit http://www.manning.com/free/green_owen.html

Related Entries:
HBase Administration Cookbook by Yifeng Jiang : Review
Hadoop Database access experiment
Hadoop Comic by Maneesh Varshney
Taming Text : Review
Practical Machine Learning. My talk slides at BarCamp Kerala 9
Comments (2)  Permalink
1-4/4