Archive for the ‘java’ Category

Apache Mahout 0.2 Released – Now classify, cluster and generate recommendations!

Apache Mahout

Apache Mahout

For the past two years, I have been working with this amazing bunch of people whilst, being paid by Google in their summer of code program in a project called Mahout. And like the name says, it is trying to tame the young beast known as Hadoop. I have received a lot from the community. Being part of the project, I have got some real exposure to Java, data mining, machine learning and hands on experience over distributed systems like Hadoop, Hbase, Pig.  The project is still in its infancy, but, its ambitions are high in the sky. I am happy to announce the second release of the project, and proud to be a part of it. I hope people will adapt it in their projects and that it becomes the defacto standard machine learning library the way lucene and hadoop has become in their respective focus areas.

If you are already excited and want to take it for a ride, read Grant’s article on IBM developerworks here
The release announcement below

Apache Mahout 0.2 has been released and is now available for public download at

Up to date maven artifacts can be found in the Apache repository at

Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license.

Mahout is a machine learning library meant to scale: Scale in terms of community to support anyone interested in using machine learning. Scale in terms of business by providing the library under a commercially friendly, free software license. Scale in terms of computation to the size of data we manage today.

Built on top of the powerful map/reduce paradigm of the Apache Hadoop project, Mahout lets you solve popular machine learning problem settings like clustering, collaborative filtering and classification
over Terabytes of data over thousands of computers.

Implemented with scalability in mind the latest release brings many performance optimizations so that even in a single node setup the library performs well.

The complete changelist can be found here:

New Mahout 0.2 features include

  • Major performance enhancements in Collaborative Filtering, Classification and Clustering
  • New: Latent Dirichlet Allocation(LDA) implementation for topic modelling
  • New: Frequent Itemset Mining for mining top-k patterns from a list of transactions
  • New: Decision Forests implementation for Decision Tree classification (In Memory & Partial Data)
  • New: HBase storage support for Naive Bayes model building and classification
  • New: Generation of vectors from Text documents for use with Mahout Algorithms
  • Performance improvements in various Vector implementations
  • Tons of bug fixes and code cleanup

Getting started: New to Mahout?

For more information on Apache Mahout, see

Compile Java code from Java application dynamically

Recently, I came across this problem of adding a scripting interface to an application that we were developing as a part of our term project. Let me describe more details about the project and the problem. The project that we were working on is all about generating Random Graphs and their analysis. A little research for libraries that would simplify our task yeilded an open source gem JUNG. As this post is neither about JUNG nor opensource I would try to talk as little as possible about their coolness. Some serious effort for a couple of days from my beloved friend Praveen resulted in this java beast.

The app has more tools for network analysis than this simple visualization like plotting the Degree Distribution Vs Rank, Pk Vs k, Clustering Coefficient Vs Rank, Betweenness centrality Vs Rank plots. Despite this much effort we are not confident about getting the best possible grade due to the lack of good innovation 🙂 ( I guess I am talking big here). So, here comes the small innovation. The networks that we analyse are generated using standard models like Barabási–Albert (BA) model, Erdős–Rényi model etc. In this way the application is limited and finds little use as these models had been analysed fully in every possible angle. So the application needs to support new models. But, writing code for each and every model is too complex and inefficient. The obvious path then would be to somehow express a network. The simplest way for the end user would be to express the network in english :). But poor computers yet donot understand english completely. So, the alternative is to ask the user to express the network in a language that computers can understand. The essence is that we need a scripting interface to express the network in our application. A single problem can have more than one solution and with opensource its more true. After all opensource is about choice. We had a couple of alternatives for this problem too. There are lots of embeddable scripting languages at our disposal.

We finally decided to pick java for this purpose. The solution that we picked might seem odd as it comes second to last in the scriptometer rankings. But, it has its own benefits. The obvious thing is the simplicity and elegance of the soultion.

  1. There is no need for us to learn a new language
  2. There are no integration costs. ( If you have missed the point, we programmed the whole application in java itself )

Once we have decided that the end user expresses the network in java, we need to chalk out a solution for running his network model to generate the set of vertices and edge. The problem boils down to two things

  1. Compile the code that he has written from the application itself.
  2. Execute his code to generate the network graph.

So how do you compile java code from java application. There are two solutions to this problem too. The first one is using an undocumented java class The other solution is to use a more standard and documented api available from java 1.6 onwards. We chose the earlier one as the final application has to run on a machine with only java 1.5, though that means using undocumented and unsupported functions. Time for some real java

String[] optionsAndSources = {“”};

PrintWriter out = new PrintWriter( new FileWriter( “C:\\out.txt” ) );
int status = Main.compile( optionsAndSources, out );
System.out.println( “status: ” + status );

} catch(IOException e){
System.err.println(“The file cannot be opened “+e);

Remeber to include the tools.jar shipped with the jdk in the classpath. Main.compile is basically a wrapper over the javac ( javac is not required on the machine running this application. ). It does the real job of compiling the java source code and producing the class file. The arguments are self explanatory.

Once the class file is generated it has to be executed. This process is much simpler and doesnt require any undocumented classes ( aah releif !! ). Java has a class called Class. Instances of the class Class represent classes and interfaces in a running Java application. One can create a new instance of any class using the member function newInstance(). Once a new instance is created it has to be typecasted into a class that the compiler knows at compile time so that we can invoke its methods. So we have an abstract base class called Model that models all network models. Any network model defined by the user inherits this base class and also implements an abstract function defined in the base class Network called generate. Once a new instance of the user’s network model is created, it is typecasted to the base class Network and the generate method is called on this object to generate the network. More code follows …

Model obj=null;
try {
Class c = Class.forName(“UserModel”);
obj = (Model)(c.newInstance());
} catch (Exception e) {
/* Catching Exception for simplicity. */
System.err.println(“error while loading class”+e);

I would try to write more about creating an applet from this application and may be a link to the applet that we created in my next blog.