As of today, Hadoop 0.20.2 is the latest stable release of Apache Hadoop that is marked as ready for production (neither 0.21 nor 0.22 are). Unfortunately, Hadoop 0.20.2 release is not compatible with the latest stable version of HBase: if you run HBase on top of Hadoop 0.20.2, you risk to lose data! Hence HBase users are required to build their own Hadoop 0.20.x version if they want to run HBase on a production cluster of Hadoop. In this article, I describe how to build such a production-ready version of Hadoop 0.20.x that is compatible with HBase 0.90.2.
Before we start
The examples below use git (not svn)
In the following sections, I will use git as the version control system to work on the Hadoop source code. Why? Because I am much more comfortable with git than svn, so please bear with me.
If you are using Subversion, feel free to adapt the git commands described below. You are invited to write a comment to this article about your SVN experience so that other SVN users can benefit, too!
Hadoop 0.20.2 versus 0.20.203.0
Hadoop is covered. What about HBase then?
I focus solely in this article on building a Hadoop 0.20.x version (see the Background section below) that is compatible with HBase 0.90.2. In a future article, I may describe how to actually install and set up HBase 0.90.2 on the Hadoop 0.20.x version that we created here.
Version of Hadoop 0.20-append used in this article
The instructions below use the latest version of
branch-0.20-append. As of this writing, the latest commit to the
append branch is git commit
df0d79cc aka Subversion
rev 1057313. For reference, the corresponding commit
message is “HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.” from January 10, 2011.
1 2 3 4 5 6 7
That said, the steps should also work for newer versions of
Hadoop and HBase: Which versions to pick for production clusters?
Hadoop 0.20.2 is the latest stable release of Apache Hadoop that is marked ready for production. Unfortunately, the latest stable release of Apache HBase, i.e. HBase 0.90.2, is not compatible with Hadoop 0.20.2: If you try to run HBase 0.90.2 on an unmodified version of Hadoop 0.20.2 release, you might lose data!
The following lines are taken slightly modified from the HBase documentation:
This version of HBase [0.90.2] will only run on Hadoop 0.20.x. It will not run on Hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an HDFS that has a durable sync. Currently only the branch-0.20-append branch has this attribute. No official releases have been made from this branch up to now so you will have to build your own Hadoop from the tip of this branch.
Here is a quick overview:
|Hadoop version||HBase version||Compatible?|
|0.22.x (in development)||0.90.2||NO|
To be honest, it took me quite some time to get up to speed with the various requirements, dependencies, project statuses, etc. for marrying Hadoop 0.20.x and HBase 0.90.2. Hence I want to contribute back to the Hadoop and HBase communities by writing this article.
Alternatives to what we are doing here
Another option you have to get HBase up and running on Hadoop – rather than build Hadoop 0.20-append yourself – is using Cloudera’s CDH3 distribution. CDH3 has the Hadoop 0.20-append patches needed to add a durable sync, i.e. to make Hadoop 0.20.x compatible with HBase 0.90.2.
A word of caution and a Thank You
First, a warning: while I have taken great care to compile and describe the steps in the following sections, I still cannot give you any guarantees. If in doubt, join our discussions on the HBase mailing list.
Second, I am only stitching together the pieces of the puzzle here. The heavy lifting has done by others. Hence I would like to thank Michael Stack for his great feedback while preparing the information for this article, and to both him and the rest of the HBase developers for their help on the HBase mailing list. It’s much appreciated!
Building Hadoop 0.20-append from branch-0.20-append
Retrieve the Hadoop 0.20-append sources
Hadoop as of version 0.20.x is not separated into the Common, HDFS and MapReduce components as the versions >= 0.21.0 are. Hence you find all the required code in the Hadoop Common repository.
So the first step is to check out the Hadoop Common repository.
However, the previous
git command only retrieved the latest version of Hadoop common, i.e. the tip aka
of the development for Hadoop Common. We however are only interested in the code tree for Hadoop 0.20-append, i.e.
git by default does not download remote branches from a cloned
repository, we must instruct it to explicitly do so:
1 2 3 4 5
Hadoop 0.20.2 release vs. Hadoop 0.20-append
Up to now, you might have asked yourself what the difference between the 0.20.2 release of Hadoop and its append branch actually is. Here’s the answer: The Hadoop 0.20-append branch is effectively a superset of Hadoop 0.20.2 release. In other words, there is not a single “real” commit in Hadoop 0.20.2 release that is not also in Hadoop 0.20-append. This means that Hadoop 0.20-append brings all the goodies that Hadoop 0.20.2 release has, great!
Run the following
git command to verify this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
As you can see, there are only two commits in 0.20.2 release that are not in
branch-0.20-append, namely the
commits “Hadoop 0.20.2 release” and “Hadoop 0.20.2-rc4”. Both of these commits are simple tagging commits, i.e. they
are just used for release management but do not introduce any changes to the content of the Hadoop source code.
Run the build process
First, we have to create the
(see full instructions).
Here are the contents of mine:
1 2 3 4 5 6 7 8 9
build.properties file should be placed (or available) in the
hadoop-common top directory, i.e.
hadoop-common/build.properties. You can either place the file there directly or you can follow the
recommended approach, where you place the file in a parent directory
and create a symlink to it. The latter approach is convenient if you also have checked out the repositories of the
hadoop-mapreduce and thus want to use the same
for all three sub-projects.
1 2 3 4 5 6 7 8
Now we are ready to compile Hadoop from source with
ant. I used the command
ant mvn-install as described on
Git and Hadoop. The build itself should only take a few minutes. Be
sure to run
ant test as well (or only
ant test-core if you’re lazy) but be aware that the tests take much
longer than the build (two hours on my 3-year old MacBook Pro, for instance).
1 2 3 4 5 6 7 8 9
The build test fails, now what?
Now comes the more delicate part: If you run the build tests via
ant test, you will notice that the build test
process always fails! One consistent test error is reported by
TestFileAppend4 and logged to the file
build/test/TEST-org.apache.hadoop.hdfs.TestFileAppend4.txt. Here is a short excerpt of the test’s output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Fortunately, this error
does not mean that the build is not working. From what we know
this is a problem of the unit tests in
branch-0.20-append themselves (see also
Michael Stack’s comment on HBASE-3285).
Occasionally, you might run into other build failures and/or build errors. On my machine, for instance, I have also seen the following tests fail:
I do not know what might cause these occasional errors – maybe it is a problem of the machine I am running the tests on. Still working on this.
Frankly, what I wrote above may sound discomforting to you. At least it does to me. Still, the feedback I have received on the HBase mailing list indicates that the Hadoop 0.20-append build as done above is indeed correct.
Locate the build output (Hadoop JAR files)
By default, the build run via
ant mvn-install places the generated Hadoop JAR files in
You can find the actual JAR files with the following command.
1 2 3 4 5 6 7
Install your Hadoop 0.20-append build in your Hadoop cluster
The only thing left to do now is to install the Hadoop 0.20-append build in your cluster. This step is easy: simply
replace the Hadoop JAR files of your existing installation of Hadoop 0.20.2 release with the ones you just created
above. You will also have to replace the Hadoop core JAR file in your HBase 0.90.2 installation
$HBASE_HOME/lib/hadoop-core-0.20-append-r1056497.jar) with the Hadoop core JAR file you created above
hadoop-core-0.20-apppend-for-hbase.jar if you followed the instructions above).
Rename the build JAR files if you run Hadoop 0.20.2
Hadoop 0.20.2 release names its JAR files in the form of
hadoop-0.20.2-examples.jar. The build process above uses the different scheme
hadoop-examples-0.20-append-for-hbase.jar. You might therefore want to rename the JAR files you created in
the previous section so that they match the naming scheme of Hadoop 0.20.2 release (otherwise the
will not be able to add the Hadoop core JAR file to its
CLASSPATH, and also command examples such as
hadoop jar hadoop-*-examples.jar pi 50 1000 in the Hadoop docs will not work as is).
1 2 3 4 5 6 7
In contrast, HBase uses the
hadoop-PACKAGE-VERSION.jar scheme. So when you replace the Hadoop core JAR file
shipped with HBase 0.90.2 in
$HBASE_HOME/lib, you can here opt for leaving the name of the newly built Hadoop core
JAR file as is.
Maintaining your own version of Hadoop 0.20-append
If you must integrate some additional patches into Hadoop 0.20.2 and/or Hadoop 0.20-append (normally in the form of backports of patches for Hadoop 0.21 or 22.0), you can create a local branch based on the Hadoop version you are interested in. Yes, this creates some effort on your behalf so you should be sure to weigh the pros and cons of doing so.
Imagine that, for instance, you use Hadoop 0.20-append based on
branch-0.20-append because you also want to run the
latest stable release of HBase on your Hadoop cluster. While doing your
>benchmarking and stress testing of your cluster, you have unfortunately discovered a problem that you could track down to
HDFS-611. Now a patch is actually available (you might have to do
some tinkering to backport it properly) but it is not in the version of Hadoop you are running, i.e. it is not in the
What you can do is to create a local git branch based on your Hadoop version (here:
branch-0.20-append) where you
can integrate and test any relevant patches you need. Please understand that I will only describe the basic approach
here – I do not go into details on how you can make sure to stay current with any changes to the Hadoop version you
are tracking after you followed the steps below. There are a lot of splendid git introductions such as the
Git Community Book that can explain this much better and thoroughly than I am able to.
1 2 3 4 5
Verify that the two append branches are identical up to now.
1 2 3 4 5 6 7 8
Apply the relevant patch to your branch. In the example below, I apply a backport of the patch for
branch-0.20-append via the file
HDFS-611.branch-0.20-append.v1.patch. Note that this backport is not available on the HDFS-611 page – I created
the backport myself based on the HDFS-611 patch for Hadoop 0.20.2 release
1 2 3 4 5 6 7 8 9 10 11
Verify that your patched branch is one commit ahead of the original (remote) append branch.
1 2 3 4 5 6 7 8 9
And by the way, if you want to see the commit differences between Hadoop 0.20.2 release, the official
branch-0.20-append and your own, patched
branch-0.20-append-yourbranch, run the following git command:
I hope this article helps you to build a Hadoop 0.20.x version for running HBase 0.90.2 on in a production environment! Your feedback and comments are as always appreciated.