Friday, January 29, 2010

Maven plugin for hadoop - 0.20.0 released

As anyone working with M-R jobs in the Hadoop framework would have been familiar, the job jar is expected to be wrapped up as a single file to be submitted to the master ( JobTracker) which is responsible to propagate the same across to the slaves ( TaskTrackers) to perform the job.

I used to use a glorified shell script that used to bind the project class files and the dependencies together to create it. In the Ant world - it is relatively simple to write the packing script manually , but mvn gets a little bit trickier other than using mvn exec:exec itself. And more increasingly, find mvn way too easier to bootstrap a project when compared to Ant , with all due respect to the latter.

So - wrote this initial goal - pack , that creates a single jar file along with the resources of the project.

The dependencies of the project are present in the ./lib directory of the jar , that M-R starts can read in the classpath of the job. The alternate option is to flatten out the jars and then stitch all of them together along with the project resources as well. Not exactly intuitive and comfortable.

Warning: This comes with absolutely NO WARRANTY, whatsoever and is released under Apache License. The mojo is just a glorified script with minimal error checking done and has quite a lot of scope to be improved. So - use it at your own risk !

Installation


<plugin> 
<groupId>com.github.maven-hadoop.plugin</groupId>
<artifactId>maven-hadoop-plugin</artifactId>
<version>0.20.0</version>
<configuration>
<hadoopHome>/opt/software/hadoop</hadoopHome>
</configuration>
</plugin>

Set hadoop Home as appropriate to the installation that you use.

Usage:
=====

Currently a single goal is available called as pack that creates the jar file to be submitted to the hadoop job engine.

$ mvn hadoop:pack



The jar contains a directory called ./lib , that contains all the dependency artifacts of the current project in it, along with the classes of the current project itself.



The jar would be created in $basedir/target/hadoop-deploy/${ant.project.name}-hdeploy.jar .

Caveat: The dependencies of the current project copied to the lib in the jar file are a subset of project dependencies. To be more clear - they are the entire list of transitive dependencies, minus the hadoop + hadoop's transitive dependencies , since they are already present in the classpath , when hadoop RunJar is launched.

So the jars in the lib directory of $basedir/target/hadoop-deploy/${ant.project.name}-hdeploy.jar ( A - B ), is the list of dependencies of the current project (A) minus the transitive dependencies of Hadoop (B) to avoid classpath pollution.


Once created , this could be submitted to the hadoop jar engine

$ $HADOOP_HOME/bin/hadoop jar  $basedir/target/hadoop-deploy/${ant.project.name}-hdeploy.jar job.launching.mainClass


No comments:

Post a Comment