Tuesday, January 31, 2012

Hadoop 1.0

There is finally a 1.0 version of hadoop. One of my biggest complaints using hadoop since version 0.10 is that something breaks with every release and the usual retort is that compatibility will come with 1.0, and it seemed like 1.0 was just around the corner for years. I haven't been using hadoop as much for the last six months, so I didn't notice that there was finally a 1.0 release until today even though it was announced towards the end of last year. As a user this news inspires some hope that there might be a hadoop upgrade that just works. We'll have to see as new versions come out.

That said, it is a little dissapointing to see some of the cruft that is still around in the APIs. In particular org.apache.hadoop.mapred
and org.apache.hadoop.mapreduce packages are both still around. So if I want to write a map reduce job which API should I use? Even worse the javadocs still don't provide much clarity on which of these APIs should be preferred and neither appears to be deprecated. Maybe they have good reasons, but this adds a lot of confusion for users and, in my selfish opinion, should have been cleaned up before a 1.0 release.

The other big problem I've had with hadoop is trying to keep a stack of hadoop, pig, oozie, hbase, etc all working together flawlessly and being able to update individual components without too much worry on whether the rest of the stack will play nice. This is much easier to do if hadoop provides clean, simple, and well documented APIs that these other tools can build on. At first glance, the 1.0 apidocs look like they just slapped a 1.0 label on the current pile of crud and did not remove any of the old garbage and deprecated APIs.

If they actually maintain compatibility between 1.x releases it will still be a win for users, but hopefully for 2.0 they focus on a clean simple API for users and get rid of a bunch of old cruft. It would also be nice if we don't have to wait 6 years for 2.0.