BlogGalleryAbout meContact
My Resume
WebDOCPDF
RTFODTTXT
powered by emurse
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python SMC

Coimbatore

Hadoop Database access experiment

Over a couple of weeks I was reading and practicing the book "Hadoop in Action". After getting some insight on Hadoop and Map Reduce I worked out a couple of examples from the book and some example problems which I created too. Then I was discussing about features of Hadoop with some of my colleagues over a cup of tea. One of the guy asked a question regarding accessing database from Hadoop and process the data. I saw some discussions related to Hadoop and database access some where in the internet. Finally I digged-out the article "Database Access with Hadoop" for Cloudera blog. After reading the same I decided to work with a sample problem.

To workout the Hadoop database access sample program. Before some times I extracted a bunch of Tweets related to Gmail's new look and feel. I extracted the Tweets for some social media analysis practice. The extraction was done using Twitter4j API. The data is stored in MySQL database. The database table contains one table called NewGamil with following structure.
        +-----------------+--------------+------+-----+---------+----------------+
        | Field           | Type         | Null | Key | Default | Extra |
        +-----------------+--------------+------+-----+---------+----------------+
        | TweetId        | int(11)      | NO   | PRI | NULL    | auto_increment |
        | Tweet           | varchar(240) | YES  |     | NULL    |                |
        +-----------------+--------------+------+-----+---------+----------------+

The problem which selected to workout is fetch all the tweets from the table 'NewGamil' and perform a word count. The word count result has to be stored in HDFS. In-fact there are ways to write data back to database itself. But I decided first experiment with read from database ;-).

Hadoop provides a handy API for accessing database; the DBInputformat API. The API allows us to read data from RDBMS like MySQL, PostgreSQL of Oracle . To access the data from DB we have to create a class to define the data which we are going to fetch and write back to DB.  In my project I created a class namely GetTweets to accomplish the same.

    public static class GetTweets implements Writable, DBWritable {
        String strTweet;

        public GetTweets() {

        }

        public void readFields(DataInput in) throws IOException {

            this.strTweet = Text.readString(in);
        }

        public void readFields(ResultSet resultSet) throws SQLException {
            // this.id = resultSet.getLong(1);
            this.strTweet = resultSet.getString(1);
        }

        public void write(DataOutput out) throws IOException {

        }

        public void write(PreparedStatement stmt) throws SQLException {

        }

    }

Since I am accessing only one field from the table I defined the same in readFields() method. The write() methods are kept blank because the project does not aims to write back the data to DB. I'll experiment with writing data and post it soon.  In the readFileds() method we have to define how the data had to be extracted from the DB table. Since 'Tweet'  the data which I extractes for processing is VARCHAR() I am reading it as string and casting it to Text() data in hadoop. This class "GetTweets" will be used in our Mapper and Reducer class.

Now lets write our Mapper class:

    public static class TweetWordCountMapper extends MapReduceBase implements
            Mapper<LongWritable, GetTweets, Text, IntWritable> {
        private final static IntWritable intTwordsCount = new IntWritable(1);
        private Text strTwoken = new Text();

        public void map(LongWritable key, GetTweets value,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            GetTweets tweets = new GetTweets();
            tweets.strTweet = value.strTweet;
            TwitterTokenizer twokenizer = new TwitterTokenizer();
            List<String> twokens = twokenizer.twokenize(value.strTweet
                    .toString());

            for (int i = 0; i < twokens.size(); i++) {
                output.collect(new Text(twokens.get(i)), intTwordsCount);
            }

        }

    }

In the mapper class 'TweetWordCountMapper' I used the 'GetTweets' class to fetch the values for processing. Then we can access the data by creating object of the class inside the Mapper class.
NB: The code for TwitterTokenizer is taken from https://github.com/vinhkhuc/Twitter-Tokenizer.

Now we can write our reducer class :

    public static class TweetWordCountReducer extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            int intTwokenCount = 0;
            while (values.hasNext()) {
                intTwokenCount += values.next().get();
            }
            output.collect(key, new IntWritable(intTwokenCount));
        }
    }

This reducer is responsible to sum the word count and produce the final output.

After this we have to configure the job with database connection details and driver class.

        JobConf twokenJobConf = new JobConf(TweetWordCount.class);
        twokenJobConf.setJobName("twoken_count");

        twokenJobConf.setInputFormat(DBInputFormat.class); //Set input format here
        twokenJobConf.setOutputFormat(TextOutputFormat.class);// Sets the output format

        Object out = new Path("twokens");

        twokenJobConf.setMapperClass(TweetWordCountMapper.class);
        twokenJobConf.setCombinerClass(TweetWordCountReducer.class);
        twokenJobConf.setReducerClass(TweetWordCountReducer.class);

        twokenJobConf.setOutputKeyClass(Text.class);
        twokenJobConf.setOutputValueClass(IntWritable.class);

        DBConfiguration.configureDB(twokenJobConf, "com.mysql.jdbc.Driver",
                "jdbc:mysql://localhost/GmailTrend", "jaganadhg", "jagan123"); //Specifies the DB configuration

        String[] fields = { "Tweet" }; //Specifies the Fields to be fetched from DB
        DBInputFormat.setInput(twokenJobConf, GetTweets.class, "NewGamil",
                null /* conditions */, "Tweet", fields); // Specifies the DB table and fields

        SequenceFileOutputFormat.setOutputPath(twokenJobConf, (Path) out);

        JobClient.runJob(twokenJobConf);


Before compiling and running the program we have to some additional setup in the Hadoop ecosystem. The MySQL connector library has to be put in $HADOOP_HOME/lib folder. To download the connector .jar file go to MySQL Connector/J download folder. I used the mysql-connector-java-3.1.14-bin.jar file in my program. After putting the jar in $HADOOP_HOME/lib restart the hadoop ecosystem. Viola !! now you are ready to run the program. Convert the code to .jar file and run it.

The complete project is available in my bitbucket repository .

Happy hacking !!!!!!!!!!!!

Related Entries:
Hadoop Comic by Maneesh Varshney
Mahout in Action: Review
New book by Packt: MySQL for Python
Comments (2)  Permalink

Hadoop Comic by Maneesh Varshney

There was a discussion on Hadoop Comic in Apache Hadoop Mailing list. I found that Maneesh Varshney created a wonderful comic strip to describe the entire Hadoop Distributed File System (HDFS). It is quite useful to understand the HDFS in easy way.
I am sharing the comic at my
Slideshare account.
Kudos to Maneesh Varshney for the wonderful and creative work.

Here it is .

Hdfs
View more documents from jaganadhg.

Related Entries:
Hadoop Database access experiment
Mahout in Action: Review
 Permalink

The Mozilla Story: the story of how Mozilla helped shape the web we know today

Comments (1)  Permalink

Experiments with NoSQL databases: CouchDB

I started reading about NoSQL databases for a long time. Occasionally  I used some NoSQL databases like Apache CouchDB and Apache Cassandra for some analytics purpose(Some minor projects) with Python. This time I just thought why can't try something on Java + NoSQL. I created a small for project to play with. The idea of this project is: store Twitter search result to CouchDB.   I used the following Operating System, Programming Languages and Libraries in this project.

        Operating System                  :     Fedora 16 (verne)
        Programming Language     :     Java (JDK 1.6.0_29)
        IDE                                            :     Eclipse 3.7.1
        Apache CouchDB                   :    1.0.
        External Libraries                   :     Couchdb4J
                                                                Twitter4J
                                                              Apache Commons httpclient, logging, codec,commons,collections, beanutils
                                                              Jsonlib, ezmorph   

Installing CouchDB
To install CouchDB fire the terminal and type the command
    $su -c 'yum -y install couchdb'

After succesful installation start the CoucbDB server by issuing the command in the terminal
    $su -c '/etc/init.d/couchdb start'

Now your CouchDB instance will be up and running. You can check this by opening CouchDB Futon in the broswer by navigating to http://localhost:5984/_utils/. If everything will fine you will see the Funton Interface.

Let's start out project.
First create a function to connect to the CouchDB instance,create and retrun a database with given name. If the database already exits it has to return the database.

    /**
     * @param strDBName
     * @return dbCouchDB
     */

    public static Database connectCouchDB(String strDBName) {
        Database dbCouchDB = null;
        Session dbCouchDBSession = new Session("localhost", 5984);
        List<String> databases = dbCouchDBSession.getDatabaseNames();
        if (databases.contains(strDBName)) {
            dbCouchDB = dbCouchDBSession.getDatabase(strDBName);
        } else {
            dbCouchDBSession.createDatabase(strDBName);
            dbCouchDB = dbCouchDBSession.getDatabase(strDBName);
        }

        return dbCouchDB;

    }

   

Now we can create a function to search in Twitter Search and return the tweets.

    /**
     * @param strQuery
     * @throws TwitterException
     * @return queryResult
     */

    public static QueryResult getTweets(String strQuery)
            throws TwitterException {
        Twitter twitter = new TwitterFactory().getInstance();
        Query query = new Query(strQuery);
        QueryResult queryResult = twitter.search(query);
        return queryResult;

    }


To insert the tweets to the CouchDB document collection(database) it has to be converted to a document. Lets create a function to convert individual tweets to CouchDB document.

    /**
     * @param tweet
     * @return couchDocument
     */

    @SuppressWarnings("deprecation")
    public static Document tweetToCouchDocument(Tweet tweet) {

        Document couchDocument = new Document();

        couchDocument.setId(String.valueOf(tweet.getId()));
        couchDocument.put("Tweet", tweet.getText().toString());
        couchDocument.put("UserName", tweet.getFromUser().toString());
        couchDocument.put("Time", tweet.getCreatedAt().toGMTString());
        couchDocument.put("URL", tweet.getSource().toString());

        return couchDocument;

    }


Now we can try to write the Twitter Search results to the CouchDB document collection with the following function.

    /**
     * @param tweetQury
     * @param dbName
     * @throws TwitterException
     */

    public static void writeTweetToCDB(String strTweetQury, String strdbName)
            throws TwitterException {
        QueryResult tweetResults = getTweets(strTweetQury);
        Database dbInstance = connectCouchDB(strdbName);
        dbInstance.getAllDocuments();
        for (Tweet tweet : tweetResults.getTweets()) {
            Document document = tweetToCouchDocument(tweet);
            dbInstance.saveDocument(document);
        }

    }

Now it is time to execute our project. Add the following lines to the main() and run the project.

        String query = "java";
        String dbName = "javatweets";
        System.out.println("Started");
        writeTweetToCDB(query, dbName);
        System.out.println("Finished");


That is all !!!!!! .

The entire code is available at my bitbucket repo

Happy Hacking !!!!!!!!

 Permalink

Lucene Index Writer API changes from 2.x to 3.x

The 3.x version of Lucene introduces lots of changes in its API. In 2.x we used IndexWriter API like this:
       

         Directory dir = FSDirectory.open(new File(indexDir));
        writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_30),true,IndexWriter.MaxFieldLength.UNLIMITED);


I used the same code with 3.x version for one of my project. The tool was working fine. But my IDE(eclipse) told that some of the things are deprecated hmm...... I decided to dig the new API and I found that the above given code has to be changed to this :

        Directory indexDir =  FSDirectory.open(new File(strDirName));
        IndexWriterConfig confIndexWriter = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
        writer = new IndexWriter(indexDir, confIndexWriter);


If you would like to use the "IndexWriter.MaxFieldLength.UNLIMITED" the IndexWriterConfig should be like:
        IndexWriterConfig idxconfa = new IndexWriterConfig(Version.LUCENE_30, new LimitTokenCountAnalyzer(new StandardAnalyzer(Version.LUCENE_30), 1000000000));

The int '1000000000' is set as maximum limit here. max(int) is the maximum you can set in IndexWriterConfig.

 Permalink

Taming Text : Review

    We are living in the era of Information Revolution. Everyday wast amount of information is being created and disseminated over World Wide Web(WWW). Even though each piece of information published in the web is useful in some way; we may require to identify and extract relevant/useful information.Such kind of information extraction includes identifying Person Names, Organization Names etc.. ,finding category of a text, identifying sentiment of a tweet etc ... Processing large amount text data from web is a challenging task, because there is an information overflow. As more information appears there is a demand for smart and intelligent processing and text data. The very field of text analytics has been attracted attention of developers around the glob. Many practical as well as theoretical books has been published on the topic.

This book, "
Taming Text", written by Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris is an excellent source for Text Analytics Developers and Researchers who is interested to learn Text Analytics. The book focuses on practical Text Analytics techniques like Classification,Clustering, String Matching, Searching and Entity Identification. The book provides easy-to follow examples in using well-known Open Source Text Analytics tools like Apache Mahout, Apache Lucece, Apache Solr, OpenNLP etc.. The entire book is based on the author's experience in contributing to relevant Open Source tools, hands on experience and their industry exposure. It is a must-read for Text Analytics developers and Researchers. Given the increasing importance of Text Analytics this book can be served as a hand book for budding Text Analytics Developers and Industry People. Definitely it can be used in Natural Language Processing, Machine Learning and Computational Linguistics courses.

Chapter 1: Getting Started Taming Text
The first chapter of the book introduces what is Taming Text? The authors gives list of challenges in text processing with brief explanations. The chapter is mostly an introductory stuff.

Chapter 2: Foundations of Taming Text
This chapter gives a quick warm up of your high school English grammar. Starting from words, the authors presents essential linguistic concepts required for text processing.  I think "Taming Text" will be the first technical book which gives a good warm up on basics of Language and grammar. The chapter gives a detailed introduction to words, parts of speech, phrases and morphology. This introduction is sufficient enough to capture the essential linguistic aspects of Text Processing for a developer. The second part of this chapter deals with basic text processing tasks like, tokenization, sentence splitting, Part of Speech Tagging (POS Tagging) and Parsing. Code snippets for each of the task has been given in the chapter. All the code examples are narrated with the tool
OpenNLP . The chapter gives some basic of handling different file formats using Apache Tika. This chapter gives a step by step intro to the preliminaries of Text Processing.

Chapter 3: Searching
This chapter introduces the art of Search. It gives a brief but narrative description of the Search mechanism and scene behind the curtains. The chapter discusses the basics of Search with the help of
Apache Solr. There is an interesting discussion on search evaluation and search performance enhancements and page rank too. The chapter gives a detailed list of Open Source search engines. But I think the authors forgot to add the "Elasticsearch" library  to the list. I hope that it may be added in the final print version of the book.

Chapter 4: Fuzzy String Matching
Everybody might have wondered how the "Did you mean:" feature in Google or any other search engine works. Long ago I saw a question in Stackoverflow; querying about the availability of source code for  "Did you mean:" feature !!! (something similar I think). If you wonder how this feature is working this chapter will give you enough knowledge to implement something similar. There is a simple discussion on different fuzzy string matching algorithms with code samples. There is practical examples on how to implement the "Did you Mean" and type ahead (auto suggest) utility on Apache Solr. Over all this chapter gives a solid introduction and hands on experience on Fuzzy String Matching.

Chapter 5: Identifying People, Places and Things
Diving deeper into text processing ocean, the authors narrates many deeper concepts in Text Processing starting from this chapter. The main focus of this chapter is Named Entity Identification (NER), one of the trivial tasks in Information Extraction and Retrieval. The chapter gives a good introduction to the task on Named Entity Identification along with code samples using OpenNLP. The code samples will help you to make your hands dirty. There is a section which deals with how to train OpenNLP to adopt a new domain. This will be one of the most useful tip for working professionals. The only thing which I feels to be missing is a mention about
GATE and Apache UIMA. Both of the tools are famous for their capability to accomplish the NER task.

Chapter 6: Clustering Text
The sixth chapter mainly deals with Clustering. Clustering is an unsupervised (i.e. no human intervention required) task that can automatically put related content into buckets.[Taken from the book "Taming Text"]. The initial part of this chapter narrates clustering with reference to real world applications. A decent discussion on clustering techniques and clustering evaluation is also there. Code examples for clustering is given in this chapter.
Apache Solr, Apache Mahout and Carrot are used to give practical examples for clustering.

Chapter 7: Classification, Categorization and Tagging
Seventh chapter deals with document classification. As like in the other chapters there is a reasonable discussion on document classification techniques. This chapter will teach you how to perform document classification with Apache Lucene, Apache Solr, Apache Mahout and OepnNLP. There is interesting project called 'tag recommender' in this chapter. The only hiccup which I faced with this chapter is the "TT_HOME" environment variable which used through out the book. I think the authors forgot to mention how to set TT_HOME. I was familiar with Apache Mahout so ther was no issue with MAHOUT_HOME environment variable. A totally newbie will find it difficult to spot the TT_HOME and MAHOUT_HOME used in the code samples. A little bit light on setting these variables may help reader a lot. I think this will be included in the final copy(I am reading a MEAP version).

Chapter 8: An Example Application: Question Answering

This chapter gives a hands on experience in Taming Text. The entire chapter is dedicated for building a Question Answering project using the techniques discussed in all the chapters. A simple make your hands dirty by Taming Text chapter. Here also you will be caught with the TT_HOME ghost.

Chapter 9: Untamed Text: Exploring the Next Frontier

The last chapter "Untamed Text: Exploring the Next Frontier" mentions other ares in Text processing such as Semantics Pragmatics and Sentiment Analysis etc.. Brief narration on each of these field are included in this chapter. There are a lots of pointers to some useful tools for advanced Text processing tasks like Text Summarisation and Relation Extraction etc ..

Conclusion
Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris have done a nice job by authoring this book with lucid explanations and practical examples for different Text Processing Challenges. With the help of simple and narrative examples the authors demonstrates how to solve real world text processing challenges using Free and Open Source Tools. The algorithm discussions in the book is so simple; even a newbie can follow the concepts without much hiccups. It is a good desktop reference for people who would like to start with Text Processing. It provides comprehensive and hands-on experience in Text Processing. So grab a copy soon and be ready for Big Data Analysis.

Free and Open Source Tools Discussed in the Book
Apache Solr
Apache Lucene
Apache Mahout
Apache OpenNLP
Carrot2.

Disclaimer : I received a review copy of the book from Manning

Related Entries:
Mahout in Action: Review
Comments (2)  Permalink

Seven years of 'humanity to others' through a Free Operating System

 On  20th Oct. 2004 Canonical Ltd announced the first release of worlds most popular and sexy operating system Ubuntu. The first release was code named as "Warty Warthog".  From their onwards Ubuntu was a grant success with more than 20 million users across the glob. A totally free and open source operating system attains much popularity than any other similar proprietary operating systems with in seven years. This is a remarkable achievement by the humanity behind Ubuntu operating system. People who dedicated their free times to write code, test and use the operating system and Cannonical Ltd. made it possible. And they continues the journey to serve the humanity with better Operating System that satisfies the computing needs of "Common Man".  Kudos to the entire team behind Ubuntu !!!

blog comments powered by Disqus
Comments (0)  Permalink

Changing Life Patterns !!!

 Permalink

Python Testing Cookbook, by Greg L Turnquist : Review

  Python Testing Cookbook, by Greg L Turnquist is one of the latest books from Packt Publishing. It is the second book on Python Testing; the first one was Python Testing: Beginners Guide by Daniel. The Python Testing Cookbook is a collection of useful easy to learn tips and tricks. Even though this book is labeled as "cookbook" it is quite useful for the newbies in Python testing too. All the essential tips to play with Python testing is given in the introductory chapters with illustrative examples. The book can serve as a good resource to train Python newbies in the art of Software Testing.

The first chapter of the book deals with basics of unittest with Python. It gives step by step examples to understand unit-testing.

The second chapter helps you to put your nose in to testing with Python "nose" framework. The chapter gives insights on how to write nose-plugins to make the testers life easy. Third chapter of the book deals with doctest. The chapter gives good intro to writing docstring for doc-test. The art of doc-test is well covered in this chapter.  The fourth chapter deals with testing behavior driven development. The chapter introduces Mock, mockito and Lettuce testing tools. The fifth chapter deals with Acceptance Testing with Pyccuracy and Robot tools. This chapter gives some insight on selinium too. The sixth chapter speaks about test automation with Continuous Integration(CI). The chapter introduces Jenkins and NoseXunit. The chapter is very useful for the people who follows the waterfall model in Software Development. The seventh chapter discusses about test coverage.The chapter is bit complicated for beginners. Some play with database, springpython etc are there. In some examples I feel that the element of testing missed out ;-). The eighth chapter deals with Smoke and Load testing in Python. This chapter introduces the tool Pyro too.  The 9th chapter is a collection of general advices for automated testers. After making your hands dirty with testing u can relax and clear the doubts with advices in this chapter.

The book comes with extensive code samples. Even if the book is all about testing I found one bad practice in coding through out the examples; that is import * . This makes the learner to scratch his head to understand what comes from where. I think it is worth to buy and read the book to get good insights on Test Automation with Python. It is a good book for beginners to learn testing and good reference book for experienced professionals too.

disclaimer: I received a free eBook from Packt for review





Comments (0)  Permalink

New Book from Packt: Python Testing Cookbook by Greg L. Turnquist

  Packt Publishing releases a new Python book "Python Testing Cookbook" by Greg L. Turnquist. I received review copy of the book today. I will put a review of the book here soon. A sample chapter from the book is available at Packt Web page.
Book Info
Language : English
Paperback : 364 pages [ 235mm x 191mm ]
Release Date : May 2011
ISBN : 1849514666
ISBN 13 : 978-1-84951-466-8
Author(s) : Greg L. Turnquist

Comments (0)  Permalink
Next1-10/110