BlogGalleryAbout meContact
Jaganadh's bookshelf: read

Python Text Processing with NTLK 2.0 CookbookPython 2.6 Text Processing Beginners Guide

More of Jaganadh's books »
Jaganadh Gopinadhan's  book recommendations, reviews, quotes, book clubs, book trivia, book lists
Ubuntu GNU/Linux I am nerdier than 94% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum! Python


Taming Text : Review

    We are living in the era of Information Revolution. Everyday wast amount of information is being created and disseminated over World Wide Web(WWW). Even though each piece of information published in the web is useful in some way; we may require to identify and extract relevant/useful information.Such kind of information extraction includes identifying Person Names, Organization Names etc.. ,finding category of a text, identifying sentiment of a tweet etc ... Processing large amount text data from web is a challenging task, because there is an information overflow. As more information appears there is a demand for smart and intelligent processing and text data. The very field of text analytics has been attracted attention of developers around the glob. Many practical as well as theoretical books has been published on the topic.

This book, "
Taming Text", written by Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris is an excellent source for Text Analytics Developers and Researchers who is interested to learn Text Analytics. The book focuses on practical Text Analytics techniques like Classification,Clustering, String Matching, Searching and Entity Identification. The book provides easy-to follow examples in using well-known Open Source Text Analytics tools like Apache Mahout, Apache Lucece, Apache Solr, OpenNLP etc.. The entire book is based on the author's experience in contributing to relevant Open Source tools, hands on experience and their industry exposure. It is a must-read for Text Analytics developers and Researchers. Given the increasing importance of Text Analytics this book can be served as a hand book for budding Text Analytics Developers and Industry People. Definitely it can be used in Natural Language Processing, Machine Learning and Computational Linguistics courses.

Chapter 1: Getting Started Taming Text
The first chapter of the book introduces what is Taming Text? The authors gives list of challenges in text processing with brief explanations. The chapter is mostly an introductory stuff.

Chapter 2: Foundations of Taming Text
This chapter gives a quick warm up of your high school English grammar. Starting from words, the authors presents essential linguistic concepts required for text processing.  I think "Taming Text" will be the first technical book which gives a good warm up on basics of Language and grammar. The chapter gives a detailed introduction to words, parts of speech, phrases and morphology. This introduction is sufficient enough to capture the essential linguistic aspects of Text Processing for a developer. The second part of this chapter deals with basic text processing tasks like, tokenization, sentence splitting, Part of Speech Tagging (POS Tagging) and Parsing. Code snippets for each of the task has been given in the chapter. All the code examples are narrated with the tool
OpenNLP . The chapter gives some basic of handling different file formats using Apache Tika. This chapter gives a step by step intro to the preliminaries of Text Processing.

Chapter 3: Searching
This chapter introduces the art of Search. It gives a brief but narrative description of the Search mechanism and scene behind the curtains. The chapter discusses the basics of Search with the help of
Apache Solr. There is an interesting discussion on search evaluation and search performance enhancements and page rank too. The chapter gives a detailed list of Open Source search engines. But I think the authors forgot to add the "Elasticsearch" library  to the list. I hope that it may be added in the final print version of the book.

Chapter 4: Fuzzy String Matching
Everybody might have wondered how the "Did you mean:" feature in Google or any other search engine works. Long ago I saw a question in Stackoverflow; querying about the availability of source code for  "Did you mean:" feature !!! (something similar I think). If you wonder how this feature is working this chapter will give you enough knowledge to implement something similar. There is a simple discussion on different fuzzy string matching algorithms with code samples. There is practical examples on how to implement the "Did you Mean" and type ahead (auto suggest) utility on Apache Solr. Over all this chapter gives a solid introduction and hands on experience on Fuzzy String Matching.

Chapter 5: Identifying People, Places and Things
Diving deeper into text processing ocean, the authors narrates many deeper concepts in Text Processing starting from this chapter. The main focus of this chapter is Named Entity Identification (NER), one of the trivial tasks in Information Extraction and Retrieval. The chapter gives a good introduction to the task on Named Entity Identification along with code samples using OpenNLP. The code samples will help you to make your hands dirty. There is a section which deals with how to train OpenNLP to adopt a new domain. This will be one of the most useful tip for working professionals. The only thing which I feels to be missing is a mention about
GATE and Apache UIMA. Both of the tools are famous for their capability to accomplish the NER task.

Chapter 6: Clustering Text
The sixth chapter mainly deals with Clustering. Clustering is an unsupervised (i.e. no human intervention required) task that can automatically put related content into buckets.[Taken from the book "Taming Text"]. The initial part of this chapter narrates clustering with reference to real world applications. A decent discussion on clustering techniques and clustering evaluation is also there. Code examples for clustering is given in this chapter.
Apache Solr, Apache Mahout and Carrot are used to give practical examples for clustering.

Chapter 7: Classification, Categorization and Tagging
Seventh chapter deals with document classification. As like in the other chapters there is a reasonable discussion on document classification techniques. This chapter will teach you how to perform document classification with Apache Lucene, Apache Solr, Apache Mahout and OepnNLP. There is interesting project called 'tag recommender' in this chapter. The only hiccup which I faced with this chapter is the "TT_HOME" environment variable which used through out the book. I think the authors forgot to mention how to set TT_HOME. I was familiar with Apache Mahout so ther was no issue with MAHOUT_HOME environment variable. A totally newbie will find it difficult to spot the TT_HOME and MAHOUT_HOME used in the code samples. A little bit light on setting these variables may help reader a lot. I think this will be included in the final copy(I am reading a MEAP version).

Chapter 8: An Example Application: Question Answering

This chapter gives a hands on experience in Taming Text. The entire chapter is dedicated for building a Question Answering project using the techniques discussed in all the chapters. A simple make your hands dirty by Taming Text chapter. Here also you will be caught with the TT_HOME ghost.

Chapter 9: Untamed Text: Exploring the Next Frontier

The last chapter "Untamed Text: Exploring the Next Frontier" mentions other ares in Text processing such as Semantics Pragmatics and Sentiment Analysis etc.. Brief narration on each of these field are included in this chapter. There are a lots of pointers to some useful tools for advanced Text processing tasks like Text Summarisation and Relation Extraction etc ..

Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris have done a nice job by authoring this book with lucid explanations and practical examples for different Text Processing Challenges. With the help of simple and narrative examples the authors demonstrates how to solve real world text processing challenges using Free and Open Source Tools. The algorithm discussions in the book is so simple; even a newbie can follow the concepts without much hiccups. It is a good desktop reference for people who would like to start with Text Processing. It provides comprehensive and hands-on experience in Text Processing. So grab a copy soon and be ready for Big Data Analysis.

Free and Open Source Tools Discussed in the Book
Apache Solr
Apache Lucene
Apache Mahout
Apache OpenNLP

Disclaimer : I received a review copy of the book from Manning

Related Entries:
Mahout in Action: Review
Comments (2)  Permalink

Mahout in Action: Review

Apache Mahout is an Open Source scalable Machine Learning library in Java. It is designed to handle large data set. More than a dozen of Machine Learning and Data Mining algorithms are available in Mahout. All those algorithms are implemented on top of Apache Hadoop. The framework is distributed under a commercially friendly Apache License. It helps researchers and corporate to build scalable and practical products based on Machine Learning and Data Mining Principles. A wide range of big companies as well as startups are using Apache Mahout in their products.

The Apache Mahout project is focused three interesting Machine Learning problems 1) recommendation systems 2) clustering and 3) classification. The project address real world practical problems. The tool makes life of Machine Learning Developers much enjoyable. The book "Mahout in Action" by Sean Owen,Robin Anil, Ted Dunning and Ellen Friedman introduces the wonder world of creating scalable and real world machine learning projects with Apache Mahout. It is written in a lucid language so that a beginner in Machine Learning can understand the concepts and kick start working with classification, clustering or recommendation projects. Even though the detailed algorithmic back ground of underlying algorithms in Mahout is not described the logic (common sense) behind the system is explained very well with help of code examples and practical projects.  I am giving chapter wise overview of the book "Mahout in Action" below. A sample chapter is availeble for download at

Chapter 1 of the book get you introduced to Mahout. Through this chapter you get to know the history of Mahout project, algorithms, it's capabilities and configurations.

Chapter 2 of the book introduces recommendation systems to the reader. The chapter teaches how to build a basic re commender systems with Apache Mahout. The examples given for narrating the technique is very clear and understandable to all.

Chapter 3 of the book discuss about data representation for building a recommender engine. The discussions in this chapter extends up to some naive data structure in Mahout. There is some discussion on using MySQL for storing data for building recommender engines.

Chapter 4 of the book gives more insight in to building scalable recommender systems. It introduces user based recommendation engines as well as item based recommendation engines. The examples are very clear and it helps practitioners to build better prototypes much faster. The chapter is written in such a lucid way that any body can understand the common sense behind the recommender engines.

The fifth chapter of the book deals with producing a full fledged recommender system with Apache Mahout. The discussion and examples in this chapter extends up to deploying a web based recommeder engine. Once u covered up this chapter it can be ensured that you can build a good production quality recommender engine for your client.

Chapter 6 of the book discussed how to build a scalable and distributed recommendation system with Mahout and Hadoop frame work. The chapter gives illustrative example for the task with Wikipedia data set. The author spent some pages for explaining Map Reduce concept in a much lucid way. There is a discussion on running the recommender in a cloud platform too. This chapter is definitely a helping point for professionals to kick start their recommender projects with less pain. 

Starting from chapter 7 to 12 the book discusses about Clustering techniques using Apache Mahout. Chapter seven gives a brief introduction to clustering with practical examples. The chapter contains discussions on different clustering algorithms available in Mahout.

Chapter eight of the book deals with preparing and representing data for clustering task. Tips and tricks for converting raw data to vectors for clustering is discussed in a very lucid manner in this chapter.

The 9th chapter of the book discusses details on clustering algorithms in Mahout. The major algorithms covered in this chapter are K-Means clustering, Centroid generation using Canopy clustering, Fuzzy K-Means clustering, Dirichlet clustering,Topic modeling using LDA as a variant of clustering. There is a small cases study on clustering news items using Apache Mahout. One of my project student has undertaken such a project for his MSc in CS .

The 10th chapter is focused on evaluation of clustering system. The chapter discusses about clustering output inspection, quality evaluation of clustering and improving the quality of clusters.

The 11th chapter deals with producing a scalable clustering system with Mahout. It gives good insight in to the art of content clustering with two case studies. The 12th chapter discusses some use cased of clustering with code examples including twitter user clustering, playing with data and clustering.

Beginning from chapter 13 to end of chapter 16 the book discusses about the technique of classification. Chapter 13 of the book gives introduction to classification. It explains classification step by step with examples.The illustrations given in the chapter makes the content more enjoyable and understanding for the reader. Chapter 14 deals with training a classifier system. It explains the task of training with a publically available data-set called 20 newsgroups data set. There is a discussion on selecting algorithm for the classification task too. When ever I came to know about Mahout I used the classification techniques and algorithms. Chapter 16 has a wonderful discussion on deployment of classification system. The section gives practical insight on pros and cons of developing and deploying scalable classification system that can be bench marked with existing best performing systems.

The 17th Chapter needs special mention. The chapter is a case study named "Case study: Shop It To Me". The discussions in this chapter shows real power of Apache Mahout with the help of a practical project.

There are two appendix provided to the book. Appendix A deals with some JVM tuning tips and tricks for Deploying Hadoop/Mahout based projects. It is even useful for core Java programmers too. The Appendix B gives insight on "Mahout Math" and some deep math related stuff in Mahout.

The book is available from Manning MEAP site. Three excerpts are available in the web site along with sample code. This is a must-read for all Machine Learning and NLP Developers and Researchers.  This is an excellent book and  I am very much happy to read practice and understand the Apache Mahout in such detail.  Kudos to Sean Owen,Robin Anil, Ted Dunning and Ellen Friedman.

For code samples and sample chapters visit

Related Entries:
HBase Administration Cookbook by Yifeng Jiang : Review
Hadoop Database access experiment
Hadoop Comic by Maneesh Varshney
Taming Text : Review
Practical Machine Learning. My talk slides at BarCamp Kerala 9
Comments (2)  Permalink