<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[Michael G. Noll]]></title>
  <link href="http://www.michael-noll.com/atom.xml" rel="self"/>
  <link href="http://www.michael-noll.com/"/>
  <updated>2013-04-30T20:36:09+02:00</updated>
  <id>http://www.michael-noll.com/</id>
  <author>
    <name><![CDATA[Michael G. Noll]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    <title type="html"><![CDATA[Reading and Writing Avro Files from the Command Line]]></title>
    <link href="http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/"/>
    <updated>2013-03-17T18:59:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line</id>
    <content type="html"><![CDATA[<p><a href="http://avro.apache.org/">Apache Avro</a> is becoming one of the most popular data serialization formats nowadays, and
this holds true particularly for Hadoop-based big data platforms because tools like Pig, Hive and of course Hadoop itself
natively support reading and writing data in Avro format.  Many users seem to enjoy Avro but I have heard many
complaints about not being able to conveniently read or write Avro files with command line tools – “Avro is nice,
but why do I have to write Java or Python code just to quickly see what’s in a binary Avro file, or discover at least
its Avro schema?”</p>

<p>To those users it comes as a surprise that Avro actually ships with exactly such command line tools but apparently they
are not prominently advertised or documented as such.  In this short article I will show a few hands-on examples on how to
read, write, compress and convert data from and to binary Avro using Avro Tools 1.7.4.</p>

<!-- more -->

<h1 id="what-we-want-to-do">What we want to do</h1>

<p>Here is an overview of what we want to do:</p>

<ul>
  <li>We will start with an example Avro schema and a corresponding data file in plain-text JSON format.</li>
  <li>We will use Avro Tools to convert the JSON file into binary Avro, without and with compression (Snappy), and from binary
Avro back to JSON.</li>
</ul>

<h1 id="getting-avro-tools">Getting Avro Tools</h1>

<p>You can get a copy of the latest stable Avro Tools jar file from the
<a href="http://avro.apache.org/releases.html#Download">Avro Releases</a> page.  The actual file is in the <code>java</code> subdirectory
of a given Avro release version.  Here is a direct link to
<a href="http://www.us.apache.org/dist/avro/avro-1.7.4/java/avro-tools-1.7.4.jar">avro-tools-1.7.4.jar</a>
(11 MB) on the US Apache mirror site.</p>

<p>Save <code>avro-tools-1.7.4.jar</code> to a directory of your choice.  I will use <code>~/avro-tools-1.7.4.jar</code> for the
examples shown below.</p>

<h1 id="tools-included-in-avro-tools">Tools included in Avro Tools</h1>

<p>Just run Avro Tools without any parameters to see what’s included:</p>

<pre><code>$ java -jar ~/avro-tools-1.7.4.jar
[...snip...]
Available tools:
      compile  Generates Java code for the given schema.
       concat  Concatenates avro files without re-compressing.
   fragtojson  Renders a binary-encoded Avro datum as JSON.
     fromjson  Reads JSON records and writes an Avro data file.
     fromtext  Imports a text file into an avro data file.
      getmeta  Prints out the metadata of an Avro data file.
    getschema  Prints out schema of an Avro data file.
          idl  Generates a JSON schema from an Avro IDL file
       induce  Induce schema/protocol from Java class/interface via reflection.
   jsontofrag  Renders a JSON-encoded Avro datum as binary.
      recodec  Alters the codec of a data file.
  rpcprotocol  Output the protocol of a RPC service
   rpcreceive  Opens an RPC Server and listens for one message.
      rpcsend  Sends a single RPC message.
       tether  Run a tethered mapreduce job.
       tojson  Dumps an Avro data file as JSON, one record per line.
       totext  Converts an Avro data file to a text file.
  trevni_meta  Dumps a Trevni file's metadata as JSON.
trevni_random  Create a Trevni file filled with random instances of a schema.
trevni_tojson  Dumps a Trevni file as JSON.
</code></pre>

<p>Likewise run any particular tool without parameters to see its usage/help output.  For example, here is the help of the
<code>fromjson</code> tool:</p>

<pre><code>$ java -jar ~/avro-tools-1.7.4.jar fromjson
Expected 1 arg: input_file
Option                                  Description
------                                  -----------
--codec                                 Compression codec (default: null)
--schema                                Schema
--schema-file                           Schema File
</code></pre>

<p>Note that most of the tools write to <code>STDOUT</code>, so normally you would like to pipe their output to a file via the Bash
<code>&gt;</code> redirection operator (particularly when working with large files).</p>

<h1 id="example-data">Example data</h1>

<p>In the next sections I will use the following example data to demonstrate Avro Tools.  You can also download the example
files from <a href="https://github.com/miguno/avro-cli-examples">https://github.com/miguno/avro-cli-examples</a>.</p>

<h2 id="avro-schema">Avro schema</h2>

<p>The schema below defines a tuple of <code>(username, tweet, timestamp)</code> as the format of our example data records.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Avro schema (twitter.avsc)  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
</pre></td><td class="code"><pre><code class="json"><span class="line"><span class="p">{</span>
</span><span class="line">  <span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;record&quot;</span><span class="p">,</span>
</span><span class="line">  <span class="nt">&quot;name&quot;</span> <span class="p">:</span> <span class="s2">&quot;twitter_schema&quot;</span><span class="p">,</span>
</span><span class="line">  <span class="nt">&quot;namespace&quot;</span> <span class="p">:</span> <span class="s2">&quot;com.miguno.avro&quot;</span><span class="p">,</span>
</span><span class="line">  <span class="nt">&quot;fields&quot;</span> <span class="p">:</span> <span class="p">[</span> <span class="p">{</span>
</span><span class="line">    <span class="nt">&quot;name&quot;</span> <span class="p">:</span> <span class="s2">&quot;username&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;string&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;doc&quot;</span>  <span class="p">:</span> <span class="s2">&quot;Name of the user account on Twitter.com&quot;</span>
</span><span class="line">  <span class="p">},</span> <span class="p">{</span>
</span><span class="line">    <span class="nt">&quot;name&quot;</span> <span class="p">:</span> <span class="s2">&quot;tweet&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;string&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;doc&quot;</span>  <span class="p">:</span> <span class="s2">&quot;The content of the user&#39;s Twitter message&quot;</span>
</span><span class="line">  <span class="p">},</span> <span class="p">{</span>
</span><span class="line">    <span class="nt">&quot;name&quot;</span> <span class="p">:</span> <span class="s2">&quot;timestamp&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;long&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;doc&quot;</span>  <span class="p">:</span> <span class="s2">&quot;Unix epoch time in seconds&quot;</span>
</span><span class="line">  <span class="p">}</span> <span class="p">],</span>
</span><span class="line">  <span class="nt">&quot;doc:&quot;</span> <span class="p">:</span> <span class="s2">&quot;A basic schema for storing Twitter messages&quot;</span>
</span><span class="line"><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="data-records-in-json-format">Data records in JSON format</h2>

<p>And here is some corresponding example data with two records that follow the schema defined in the previous section.
We store this data in the file <code>twitter.json</code>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Example data in JSON format (twitter.json)  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="json"><span class="line"><span class="p">{</span><span class="nt">&quot;username&quot;</span><span class="p">:</span><span class="s2">&quot;miguno&quot;</span><span class="p">,</span><span class="nt">&quot;tweet&quot;</span><span class="p">:</span><span class="s2">&quot;Rock: Nerf paper, scissors is fine.&quot;</span><span class="p">,</span><span class="nt">&quot;timestamp&quot;</span><span class="p">:</span> <span class="mi">1366150681</span> <span class="p">}</span>
</span><span class="line"><span class="p">{</span><span class="nt">&quot;username&quot;</span><span class="p">:</span><span class="s2">&quot;BlizzardCS&quot;</span><span class="p">,</span><span class="nt">&quot;tweet&quot;</span><span class="p">:</span><span class="s2">&quot;Works as intended.  Terran is IMBA.&quot;</span><span class="p">,</span><span class="nt">&quot;timestamp&quot;</span><span class="p">:</span> <span class="mi">1366154481</span> <span class="p">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="converting-to-and-from-binary-avro">Converting to and from binary Avro</h1>

<h2 id="json-to-binary-avro">JSON to binary Avro</h2>

<p>Without compression:</p>

<pre><code>$ java -jar ~/avro-tools-1.7.4.jar fromjson --schema-file twitter.avsc twitter.json &gt; twitter.avro
</code></pre>

<p>With Snappy compression:</p>

<pre><code>$ java -jar ~/avro-tools-1.7.4.jar fromjson --codec snappy --schema-file twitter.avsc twitter.json &gt; twitter.snappy.avro
</code></pre>

<div class="note">
Note for Mac OS X users: If you run into <tt>SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY]</tt> when trying to compress the
data with Snappy make sure you use JDK 6 and not JDK 7.
</div>

<h2 id="binary-avro-to-json">Binary Avro to JSON</h2>

<p>The same command will work on both uncompressed and compressed data.</p>

<pre><code>$ java -jar ~/avro-tools-1.7.4.jar tojson twitter.avro &gt; twitter.json
$ java -jar ~/avro-tools-1.7.4.jar tojson twitter.snappy.avro &gt; twitter.json
</code></pre>

<div class="note">
Note for Mac OS X users: If you run into <tt>SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY]</tt> when trying to decompress
the data with Snappy make sure you use JDK 6 and not JDK 7.
</div>

<h2 id="retrieve-avro-schema-from-binary-avro">Retrieve Avro schema from binary Avro</h2>

<p>The same command will work on both uncompressed and compressed data.</p>

<pre><code>$ java -jar ~/avro-tools-1.7.4.jar getschema twitter.avro &gt; twitter.avsc
$ java -jar ~/avro-tools-1.7.4.jar getschema twitter.snappy.avro &gt; twitter.avsc
</code></pre>

<h1 id="known-issues-of-snappy-with-jdk-7-on-mac-os-x">Known Issues of Snappy with JDK 7 on Mac OS X</h1>

<p>If you happen to use JDK 7 on Mac OS X 10.8 you will run into the error below when trying to run the Snappy related
commands above.  In that case make sure to explicitly use JDK 6.  On Mac OS 10.8 the JDK 6 <code>java</code> binary is by default
available at <tt>/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java</tt>.</p>

<p>The cause of this problem is documented in the bug report
<a href="https://github.com/xerial/snappy-java/issues/6">Native (Snappy) library loading fails on openjdk7u4 for mac</a>.  This bug
is already fixed in the latest Snappy-Java 1.5 milestone releases, but Avro 1.7.4 still depends on the latest stable
release of Snappy-Java which is 1.0.4.1 (see <code>lang/java/pom.xml</code> in the Avro source code).</p>

<p>I also found that one way to fix this problem when writing your own Java code is to explicitly require Snappy 1.5.x.
Here is the relevant dependency declaration for <code>build.gradle</code> in case you are using Gradle.  This seems to solve
the problem, but I have yet to confirm whether this is a safe way for production scenarios.</p>

<pre><code>// Required to fix a Snappy native library error on OS X when trying to compress Avro files with Snappy;
// Avro 1.7.4 uses the latest stable release of Snappy, 1.0.4.1 (see avro/lang/java/pom.xml) that still contains
// the original bug described at https://github.com/xerial/snappy-java/issues/6.
//
// Note that in a production setting we do not care about OS X, so we could use Snappy 1.0.4.1 as required by
// Avro 1.7.4 as is.
//
compile group: 'org.xerial.snappy', name: 'snappy-java', version: '1.0.5-M4'
</code></pre>

<p>Detailed error message:</p>

<pre><code>$ uname -a
Darwin mac.local 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan  6 22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64

$ java -version
java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

$ java -jar ~/avro-tools-1.7.4.jar fromjson --codec snappy --schema-file twitter.avsc twitter.json &gt; twitter.snappy.avro

java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:601)
	at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317)
	at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219)
	at org.xerial.snappy.Snappy.&lt;clinit&gt;(Snappy.java:44)
	at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43)
	at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:349)
	at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:348)
	at org.apache.avro.file.DataFileWriter.writeIfBlockFull(DataFileWriter.java:295)
	at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:266)
	at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:109)
	at org.apache.avro.tool.Main.run(Main.java:80)
	at org.apache.avro.tool.Main.main(Main.java:69)
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
	at java.lang.Runtime.loadLibrary0(Runtime.java:845)
	at java.lang.System.loadLibrary(System.java:1084)
	at org.xerial.snappy.SnappyNativeLoader.loadLibrary(SnappyNativeLoader.java:52)
	... 15 more
Exception in thread "main" org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
	at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229)
	at org.xerial.snappy.Snappy.&lt;clinit&gt;(Snappy.java:44)
	at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43)
	at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:349)
	at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:348)
	at org.apache.avro.file.DataFileWriter.writeIfBlockFull(DataFileWriter.java:295)
	at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:266)
	at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:109)
	at org.apache.avro.tool.Main.run(Main.java:80)
	at org.apache.avro.tool.Main.main(Main.java:69)
</code></pre>

<h1 id="where-to-go-from-here">Where to go from here</h1>

<p>The example commands above show just a few variants of how to use Avro Tools to read, write and convert Avro files.
The Avro Tools library is documented at:</p>

<ul>
  <li><a href="http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/tool/package-summary.html">Java API docs of org.apache.avro.tool</a></li>
</ul>

<p>That said I found those docs not that helpful (the sources are however).  I’d recommend to just try running the command
line tools without parameters and have a look at their usage instructions which they will print to <code>STDOUT</code>.  Normally
this is enough to understand how they should be used.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node]]></title>
    <link href="http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/"/>
    <updated>2013-03-13T18:59:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node</id>
    <content type="html"><![CDATA[<p>In this article I describe how to install, configure and run a multi-broker Apache Kafka 0.8 (trunk) cluster on a
single machine.  The final setup consists of one local ZooKeeper instance and three local Kafka brokers.  We will
test-drive the setup by sending messages to the cluster via a console producer and receive those messages via a console
receiver.  I will also describe how to build Kafka for Scala 2.9.2, which makes it much easier to integrate Kafka with
other Scala-based frameworks and tools that require Scala 2.9 instead of Kafka’s default Scala 2.8.</p>

<!-- more -->

<h1 id="what-we-want-to-do">What we want to do</h1>

<p>Here is an overview of what we want to do:</p>

<ul>
  <li>Build Kafka 0.8-trunk for Scala 2.9.2.
    <ul>
      <li>I also provide instructions for the default 2.8.0, just in case.</li>
    </ul>
  </li>
  <li>Use a single machine for this Kafka setup.</li>
  <li>Run 1 ZooKeeper instance on that machine.</li>
  <li>Run 3 Kafka brokers on that machine.</li>
  <li>Create a Kafka topic called “zerg.hydra” and send/receive messages for that topic via the console.  The topic
will be configured to use 3 partitions and 2 replicas per partition.</li>
</ul>

<p>The purpose of this article is not to present a production-ready configuration of a Kafka cluster.  However it should
get you started with using Kafka as a distributed messaging system in your own infrastructure.</p>

<h1 id="installing-kafka">Installing Kafka</h1>

<h2 id="background-why-kafka-and-scala-29">Background: Why Kafka and Scala 2.9?</h2>

<p>Personally I’d like to use Scala 2.9.2 for Kafka – which is still built for Scala 2.8.0 by default as of today –
because many related software packages that are of interest to me (such as Finagle, Kestrel) are based on Scala 2.9. 
Also, the current versions of many development and build tools (e.g. IDEs, sbt) for Scala require at least version 2.9. 
If you are working in a similar environment you may want build Kafka for Scala 2.9 just like I did – otherwise you can
expect to run into issues such as Scala version conflicts.</p>

<h2 id="option-1-preferred-kafka-08-trunk-with-scala-292">Option 1 (preferred): Kafka 0.8-trunk with Scala 2.9.2</h2>

<p>Unfortunately the current trunk of Kafka has problems to build against Scala 2.9.2 out of the box.  I created a
<a href="https://github.com/miguno/kafka/">fork of Kafka 0.8-trunk</a> that includes the
<a href="https://github.com/miguno/kafka/commit/9445d563e31967da5e4a62933a3407675be64693">required fix</a> (a change of one file)
in the <a href="https://github.com/miguno/kafka/tree/scala-2.9.2">branch “scala-2.9.2”</a>.  The fix ties the Scala version used by
Kafka’s shell scripts to 2.9.2 instead of 2.8.0.</p>

<p>The following instructions will use this fork to download, build and install Kafka for Scala 2.9.2:</p>

<pre><code>$ cd $HOME
$ git clone git@github.com:miguno/kafka.git
$ cd kafka
# this branch of includes a patched bin/kafka-run-class.sh for Scala 2.9.2
$ git checkout -b scala-2.9.2 remotes/origin/scala-2.9.2
$ ./sbt update
$ ./sbt "++2.9.2 package"
</code></pre>

<h2 id="option-2-kafka-08-trunk-with-scala-280">Option 2: Kafka 0.8-trunk with Scala 2.8.0</h2>

<p>If you are fine with Scala 2.8 you need to build and install Kafka as follows.</p>

<pre><code>$ cd $HOME
$ git clone git@github.com:apache/kafka.git
$ cd kafka
$ ./sbt update
$ ./sbt package
</code></pre>

<h1 id="configuring-and-running-kafka">Configuring and running Kafka</h1>

<p>Unless noted otherwise all commands below assume that you are in the top level directory of your Kafka installation.
If you followed the instructions above, this directory is <code>$HOME/kafka/</code>.</p>

<h2 id="start-zookeeper">Start ZooKeeper</h2>

<p>Kafka ships with a reasonable default ZooKeeper configuration for our simple use case.  The following command launches
a local ZooKeeper instance.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Start ZooKeeper  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>bin/zookeeper-server-start.sh config/zookeeper.properties
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>By default the ZooKeeper server will listen on <code>*:2181/tcp</code>.</p>

<h2 id="configure-and-start-the-kafka-brokers">Configure and start the Kafka brokers</h2>

<p>We will create 3 Kafka brokers, whose configurations are based on the default <code>config/server.properties</code>.  Apart from
the settings below the configurations of the brokers are identical.</p>

<p>The first broker:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Create the config file for broker 1  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>cp config/server.properties config/server1.properties
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Edit <code>config/server1.properties</code> and replace the existing config values as follows:</p>

<pre><code>broker.id=1
port=9092
log.dir=/tmp/kafka-logs-1
</code></pre>

<p>The second broker:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Create the config file for broker 2  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>cp config/server.properties config/server2.properties
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Edit <code>config/server2.properties</code> and replace the existing config values as follows:</p>

<pre><code>broker.id=2
port=9093
log.dir=/tmp/kafka-logs-2
</code></pre>

<p>The third broker:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Create the config file for broker 3  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>cp config/server.properties config/server3.properties
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Edit <code>config/server3.properties</code> and replace the existing config values as follows:</p>

<pre><code>broker.id=3
port=9094
log.dir=/tmp/kafka-logs-3
</code></pre>

<p>Now you can start each Kafka broker <strong>in a separate console</strong>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Start the first broker in its own terminal session  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>env <span class="nv">JMX_PORT</span><span class="o">=</span>9999  bin/kafka-server-start.sh config/server1.properties
</span></code></pre></td></tr></table></div></figure></notextile></div>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Start the second broker in its own terminal session  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>env <span class="nv">JMX_PORT</span><span class="o">=</span>10000 bin/kafka-server-start.sh config/server2.properties
</span></code></pre></td></tr></table></div></figure></notextile></div>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Start the third broker in its own terminal session  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>env <span class="nv">JMX_PORT</span><span class="o">=</span>10001 bin/kafka-server-start.sh config/server3.properties
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Here is a summary of the configured network interfaces and ports that the brokers will listen on:</p>

<pre><code>        Broker 1     Broker 2      Broker 3
----------------------------------------------
Kafka   *:9092/tcp   *:9093/tcp    *:9094/tcp
JMX     *:9999/tcp   *:10000/tcp   *:10001/tcp
</code></pre>

<h2 id="excursus-topics-partitions-and-replication-in-kafka">Excursus: Topics, partitions and replication in Kafka</h2>

<p>In a nutshell Kafka partitions incoming messages for a topic, and assigns those partitions to the available Kafka 
brokers.  The number of partitions is configurable and can be set per-topic and per-broker.</p>

<blockquote><p>First the stream [of messages] is partitioned on the brokers into a set of distinct partitions.  The semantic meaning of these partitions is left up to the producer and the producer specifies which partition a message belongs to.  Within a partition messages are stored in the order in which they arrive at the broker, and will be given out to consumers in that same order.</p><footer><strong>Kafka Design</strong> <cite><a href="http://kafka.apache.org/design.html">kafka.apache.org/design.html/&hellip;</a></cite></footer></blockquote>

<p>A new feature of Kafka 0.8 is that those partitions will be now be replicated across Kafka brokers to make the cluster
more resilient against host failures:</p>

<blockquote><p>Partitions are now replicated. Previously the topic would remain available in the case of server failure, but individual partitions within that topic could disappear when the server hosting them stopped.  If a broker failed permanently any unconsumed data it hosted would be lost.  Starting with 0.8 all partitions have a replication factor and we get the prior behavior as the special case where replication factor = 1. Replicas have a notion of committed messages and guarantee that committed messages won&#8217;t be lost as long as at least one replica survives.  Replica are byte-for-byte identical across replicas.</p><p>Producer and consumer are replication aware. When running in <tt>sync</tt> mode, by default, the producer <tt>send()</tt> request blocks until the messages sent is committed to the active replicas. As a result the sender can depend on the guarantee that a message sent will not be lost.  Latency sensitive producers have the option to tune this to block only on the write to the leader broker or to run completely async if they are willing to forsake this guarantee. The consumer will only see messages that have been committed.</p><footer><strong>Kafka 0.8 Quickstart</strong> <cite><a href="https://cwiki.apache.org/KAFKA/kafka-08-quick-start.html">cwiki.apache.org/KAFKA/&hellip;</a></cite></footer></blockquote>

<p>The following diagram illustrates the relationship between topics, partitions and replicas.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/kafka-topics-partitions-replicas.png" title="The relationship between topics, partitions and replicas in Kafka." /></p>

<div class="caption">The relationship between topics, partitions and replicas in Kafka.</div>

<p>Logically this relationship is very similar to how Hadoop manages blocks and replication in HDFS.</p>

<p>When a topic is created in Kafka 0.8, Kafka determines how each replica of a
partition is mapped to a broker.  In general Kafka tries to spread the replicas across all brokers
(<a href="http://grokbase.com/t/kafka/users/131vv7c7r9/are-topics-and-partitions-dynamic">source</a>).  Messages are first sent to
the first replica of a partition (i.e. to the current “leader” broker of that partition)  before they are replicated to
the remaining brokers.  Message producers may choose from different strategies for sending messages (e.g. synchronous
mode, asynchronous mode).  Producers discover the available brokers in a cluster and the number of partitions on each,
by <a href="http://kafka.apache.org/design.html">registering watchers in ZooKeeper</a>.</p>

<p>If you wonder how to configure the number of partitions per topic/broker, here’s feedback from LinkedIn developers:</p>

<blockquote><p>At LinkedIn, some of the high volume topics are configured with more than 1 partition per broker.  Having more partitions increases I/O parallelism for writes and also increases the degree of parallelism for consumers (since partition is the unit for distributing data to consumers).  On the other hand, more partitions adds some overhead: (a) there will be more files and thus more open file handlers; (b) there are more offsets to be checkpointed by consumers which can increase the load of ZooKeeper. So, you want to balance these tradeoffs.</p><footer><strong>[Kafka-users] Number of Partitions Per Broker</strong> <cite><a href="http://grokbase.com/t/kafka/users/131fk15cvr/number-of-partitions-per-broker">grokbase.com/t/kafka/users/&hellip;</a></cite></footer></blockquote>

<h2 id="create-a-kafka-topic">Create a Kafka topic</h2>

<p>In Kafka 0.8, there are 2 ways of creating a new topic:</p>

<ol>
  <li>Turn on <code>auto.create.topics.enable</code> option on the broker.  When the broker receives the first message for a
new topic, it creates that topic with <code>num.partitions</code> and <code>default.replication.factor</code>.</li>
  <li>Use the admin command <code>bin/kafka-topics.sh</code>.</li>
</ol>

<p>We will use the latter approach.  The following command creates a new topic “zerg.hydra”.  The topic is configured to
use 3 partitions and a replication factor of 2.  Note that in a production setting we’d rather set the replication
factor to 3, but a value of 2 is better for illustrative purposes (i.e. we intentionally use different values for the
number of partitions and replications to better see the effects of each setting).</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Create the &#8220;zerg.hydra&#8221; topic  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>bin/kafka-topics.sh --zookeeper localhost:2181 <span class="se">\</span>
</span><span class="line">    --create --topic zerg.hydra --partitions 3 --replication-factor 2
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>This has the following effects:</p>

<ul>
  <li>Kafka will create 3 logical partitions for the topic.</li>
  <li>Kafka will create a total of two replicas (copies) per partition.  For each partition it will pick two brokers that
will host those replicas.  For each partition Kafka will elect a “leader” broker.</li>
</ul>

<p>Ask Kafka for a list of available topics.  The list should include the new <code>zerg.hydra</code> topic:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>List the available topics in the Kafka cluster  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>bin/kafka-topics.sh --zookeeper localhost:2181 --list
</span><span class="line">&lt;snipp&gt;
</span><span class="line">zerg.hydra
</span><span class="line">&lt;/snipp&gt;
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>You can also inspect the configuration of the topic as well as the currently assigned brokers per partition and
replica.  Because a broker can only host a single replica per partition, Kafka has opted to use a broker’s ID also as
the corresponding replica’s ID.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>List the available topics in the Kafka cluster  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic zerg.hydra
</span><span class="line">&lt;snipp&gt;
</span><span class="line">zerg.hydra
</span><span class="line">    configs:
</span><span class="line">    partitions: 3
</span><span class="line">        partition 0
</span><span class="line">        leader: 1 <span class="o">(</span>192.168.0.153:9092<span class="o">)</span>
</span><span class="line">        replicas: 1 <span class="o">(</span>192.168.0.153:9092<span class="o">)</span>, 2 <span class="o">(</span>192.168.0.153:9093<span class="o">)</span>
</span><span class="line">        isr: 1 <span class="o">(</span>192.168.0.153:9092<span class="o">)</span>, 2 <span class="o">(</span>192.168.0.153:9093<span class="o">)</span>
</span><span class="line">        partition 1
</span><span class="line">        leader: 2 <span class="o">(</span>192.168.0.153:9093<span class="o">)</span>
</span><span class="line">        replicas: 2 <span class="o">(</span>192.168.0.153:9093<span class="o">)</span>, 3 <span class="o">(</span>192.168.0.153:9094<span class="o">)</span>
</span><span class="line">        isr: 2 <span class="o">(</span>192.168.0.153:9093<span class="o">)</span>, 3 <span class="o">(</span>192.168.0.153:9094<span class="o">)</span>
</span><span class="line">        partition 2
</span><span class="line">        leader: 3 <span class="o">(</span>192.168.0.153:9094<span class="o">)</span>
</span><span class="line">        replicas: 3 <span class="o">(</span>192.168.0.153:9094<span class="o">)</span>, 1 <span class="o">(</span>192.168.0.153:9092<span class="o">)</span>
</span><span class="line">        isr: 3 <span class="o">(</span>192.168.0.153:9094<span class="o">)</span>, 1 <span class="o">(</span>192.168.0.153:9092<span class="o">)</span>
</span><span class="line">&lt;snipp&gt;
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>In this example output the first broker (with <code>broker.id</code> = 1) happens to be the designated leader for partition 0
at the moment.  Similarly, the second and third brokers are the leaders for partitions 1 and 2, respectively.</p>

<p>The following diagram illustrates the setup (and also includes the producer and consumer that we will run shortly).</p>

<p><img src="http://www.michael-noll.com/blog/uploads/kafka-cluster-overview.png" title="Overview of the Kafka setup that we will create in this article." /></p>

<div class="caption">
Overview of our Kafka setup including the current state of the partitions and replicas.
The colored boxes represent replicas of partitions.  &#8220;P0 R1&#8221; denotes the replica with ID 1 for partition 0.
A bold box frame means that the corresponding broker is the leader for the given partition.
</div>

<p>You can also inspect the local filesystem to see how the <code>--describe</code> output above matches actual files.  By default
Kafka persists topics as “log files” (Kafka terminology) in the <code>log.dir</code> directory.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Local files that back up the partitions of Kafka topics  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>tree /tmp/kafka-logs-<span class="o">{</span>1,2,3<span class="o">}</span>
</span><span class="line">/tmp/kafka-logs-1                   <span class="c"># first broker (broker.id = 1)</span>
</span><span class="line">├── zerg.hydra-0                    <span class="c"># replica of partition 0 of topic &quot;zerg.hydra&quot; (this broker is leader)</span>
</span><span class="line">│   ├── 00000000000000000000.index
</span><span class="line">│   └── 00000000000000000000.log
</span><span class="line">├── zerg.hydra-2                    <span class="c"># replica of partition 2 of topic &quot;zerg.hydra&quot;</span>
</span><span class="line">│   ├── 00000000000000000000.index
</span><span class="line">│   └── 00000000000000000000.log
</span><span class="line">└── replication-offset-checkpoint
</span><span class="line">
</span><span class="line">/tmp/kafka-logs-2                   <span class="c"># second broker (broker.id = 2)</span>
</span><span class="line">├── zerg.hydra-0                    <span class="c"># replica of partition 0 of topic &quot;zerg.hydra&quot;</span>
</span><span class="line">│   ├── 00000000000000000000.index
</span><span class="line">│   └── 00000000000000000000.log
</span><span class="line">├── zerg.hydra-1                    <span class="c"># replica of partition 1 of topic &quot;zerg.hydra&quot; (this broker is leader)</span>
</span><span class="line">│   ├── 00000000000000000000.index
</span><span class="line">│   └── 00000000000000000000.log
</span><span class="line">└── replication-offset-checkpoint
</span><span class="line">
</span><span class="line">/tmp/kafka-logs-3                   <span class="c"># third broker (broker.id = 3)</span>
</span><span class="line">├── zerg.hydra-1                    <span class="c"># replica of partition 1 of topic &quot;zerg.hydra&quot;</span>
</span><span class="line">│   ├── 00000000000000000000.index
</span><span class="line">│   └── 00000000000000000000.log
</span><span class="line">├── zerg.hydra-2                    <span class="c"># replica of partition 2 of topic &quot;zerg.hydra&quot; (this broker is leader)</span>
</span><span class="line">│   ├── 00000000000000000000.index
</span><span class="line">│   └── 00000000000000000000.log
</span><span class="line">└── replication-offset-checkpoint
</span><span class="line">
</span><span class="line">6 directories, 15 files
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p><strong>Caveat</strong>: Deleting a topic via <code>bin/kafka-topics.sh --delete</code> will apparently not delete the corresponding local
files for that topic.  I am not sure whether this behavior is expected or not.</p>

<h2 id="start-a-producer">Start a producer</h2>

<p>Start a console producer in <code>sync</code> mode:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Start a console producer in sync mode  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>bin/kafka-console-producer.sh --broker-list localhost:9092,localhost:9093,localhost:9094 --sync <span class="se">\</span>
</span><span class="line">    --topic zerg.hydra
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Example producer output:</p>

<pre><code>[...] INFO Verifying properties (kafka.utils.VerifiableProperties)
[...] INFO Property broker.list is overridden to localhost:9092,localhost:9093,localhost:9094 (...)
[...] INFO Property compression.codec is overridden to 0 (kafka.utils.VerifiableProperties)
[...] INFO Property key.serializer.class is overridden to kafka.serializer.StringEncoder (...)
[...] INFO Property producer.type is overridden to sync (kafka.utils.VerifiableProperties)
[...] INFO Property queue.buffering.max.messages is overridden to 10000 (...)
[...] INFO Property queue.buffering.max.ms is overridden to 1000 (kafka.utils.VerifiableProperties)
[...] INFO Property queue.enqueue.timeout.ms is overridden to 0 (kafka.utils.VerifiableProperties)
[...] INFO Property request.required.acks is overridden to 0 (kafka.utils.VerifiableProperties)
[...] INFO Property request.timeout.ms is overridden to 1500 (kafka.utils.VerifiableProperties)
[...] INFO Property send.buffer.bytes is overridden to 102400 (kafka.utils.VerifiableProperties)
[...] INFO Property serializer.class is overridden to kafka.serializer.StringEncoder (...)
</code></pre>

<p>You can now enter new messages, one per line.  Here we enter two messages “Hello, world!” and “Rock: Nerf Paper. 
Scissors is fine.”:</p>

<pre><code>Hello, world!
Rock: Nerf Paper. Scissors is fine.
</code></pre>

<p>After the messages are produced, you should see the data being replicated to the three log directories for each of the
broker instances, i.e. <code>/tmp/kafka-logs-{1,2,3}/zerg.hydra-*/</code>.</p>

<h2 id="start-a-consumer">Start a consumer</h2>

<p>Start a console consumer that reads messages in <code>zerg.hydra</code> <em>from the beginning</em> (in a production setting you 
would usually NOT want to add the <code>--from-beginning</code> option):</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Start a console consumer  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic zerg.hydra --from-beginning
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The consumer will see a new message whenever you enter a message in the producer above.</p>

<p>Example consumer output:</p>

<pre><code>&lt;snipp&gt;
[...] INFO [console-consumer-28434_panama.local-1363174829799-954ed29e], Connecting to zookeeper instance at localhost:2181 ...
[...] INFO Starting ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
[...] INFO Client environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT ...
[...] INFO Client environment:host.name=192.168.0.153 (org.apache.zookeeper.ZooKeeper)
&lt;snipp&gt;
[...] INFO Fetching metadata with correlation id 0 for 1 topic(s) Set(zerg.hydra) (kafka.client.ClientUtils$)
[...] INFO Connected to 192.168.0.153:9092 for producing (kafka.producer.SyncProducer)
[...] INFO Disconnecting from 192.168.0.153:9092 (kafka.producer.SyncProducer)
[...] INFO [ConsumerFetcherThread-console-consumer-28434_panama.local-1363174829799-954ed29e-0-3], Starting ...
[...] INFO [ConsumerFetcherManager-1363174829916] adding fetcher on topic zerg.hydra, partion 2, initOffset -1 to broker 3 with fetcherId 0 ...
[...] INFO [ConsumerFetcherThread-console-consumer-28434_panama.local-1363174829799-954ed29e-0-2], Starting ...
[...] INFO [ConsumerFetcherManager-1363174829916] adding fetcher on topic zerg.hydra, partion 1, initOffset -1 to broker 2 with fetcherId 0 ...
[...] INFO [ConsumerFetcherThread-console-consumer-28434_panama.local-1363174829799-954ed29e-0-1], Starting ...
[...] INFO [ConsumerFetcherManager-1363174829916] adding fetcher on topic zerg.hydra, partion 0, initOffset -1 to broker 1 with fetcherId 0 ...
</code></pre>

<p>And at the end of the output you will see the following messages:</p>

<pre><code>Hello, world!
Rock: Nerf Paper. Scissors is fine.
</code></pre>

<p>That’s it!</p>

<h1 id="a-note-when-using-kafka-with-storm">A note when using Kafka with Storm</h1>

<p>The maximum parallelism you can have on a KafkaSpout is the number of partitions of the corresponding Kafka topic.  The
following question-answer thread (I slightly modified the original text for clarification purposes) is from the Storm
user mailing list, but supposedly refers to Kafka pre-0.8 and thereby before the replication feature was added:</p>

<blockquote><p><strong>Question</strong>: Suppose the number of Kafka partitions per broker is configured as 1 and the number of hosts is 2.  If we set the spout parallelism as 10, then how does Storm handle the difference between the number of Kafka partitions and the number of spout tasks?  Since there are only 2 partitions, does every other spout task (greater than first 2) not read the data or do they read the same data?</p><p><strong>Answer (by Nathan Marz)</strong>: The remaining 8 (= 10 - 2) spout tasks wouldn&#8217;t read any data from the Kafka topic.</p><footer><strong>Relationship between spout parallelism and number of Kafka partitions</strong> <cite><a href="https://groups.google.com/forum/?fromgroups=#!topic/storm-user/mBA1e6Y1MYY">groups.google.com/forum/&hellip;</a></cite></footer></blockquote>

<p>My current understanding is that the <em>number of partitions</em> (i.e. regardless of replicas) is still the limiting factor
for the parallelism of a KafkaSpout.  Why?  Because
<a href="https://cwiki.apache.org/KAFKA/kafka-replication.html">Kafka is not allowing consumers to read from replicas other than the (replica of the) leader of a partition</a>
to simplify concurrent access to data in Kafka.</p>

<h1 id="a-note-when-using-kafka-with-hadoop">A note when using Kafka with Hadoop</h1>

<p>LinkedIn has published their Kafka-&gt;HDFS pipeline named <a href="https://github.com/linkedin/camus">Camus</a>.  It is a MapReduce job
that does distributed data loads out of Kafka.</p>

<h1 id="where-to-go-from-here">Where to go from here</h1>

<p>The following documents provide plenty of information about Kafka that goes way beyond what I covered in this article:</p>

<ul>
  <li><a href="http://kafka.apache.org/design.html">Kafka Design</a></li>
  <li><a href="http://kafka.apache.org/configuration.html">Kafka Configuration</a></li>
  <li><a href="https://cwiki.apache.org/KAFKA/kafka-08-quick-start.html">Kafka 0.8 Quickstart</a></li>
  <li><a href="https://cwiki.apache.org/confluence/display/KAFKA/Operations">Kafka Operations</a> – how to run Kafka in production</li>
  <li><a href="https://cwiki.apache.org/KAFKA/kafka-replication.html">Kafka Replication</a> including
<a href="https://cwiki.apache.org/KAFKA/kafka-detailed-replication-design-v3.html">Detailed Replication Design v3 for Kafka 0.8</a></li>
  <li><a href="https://cwiki.apache.org/KAFKA/writing-a-driver-for-kafka.html">Writing a Driver for Kafka</a> – includes information
about topics, partitions, Kafka’s “log” files and message objects</li>
</ul>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bootstrapping a Java project with Gradle, TestNG, Mockito and Cobertura for Eclipse and Jenkins]]></title>
    <link href="http://www.michael-noll.com/blog/2013/01/25/bootstrapping-a-java-project-with-gradle/"/>
    <updated>2013-01-25T18:59:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2013/01/25/bootstrapping-a-java-project-with-gradle</id>
    <content type="html"><![CDATA[<p>When starting out with a fresh Java project one of the nuisances you have to deal with is setting up your build and
test environment.  It’s even more troublesome if you are trying to switch from Maven to Gradle for your builds.  In
this article I will provide you with a bootstrap Java project that is backed by Gradle, TestNG, Mockito, FEST-Assert 2
and Cobertura.  It took me quite a while to wire everything together and fix issues such as hidden dependencies and
early-bird Cobertura support, so I hope you find this information useful.  The included example code illustrates how
each of the previously mentioned packages is used.  Lastly I will cover how the bootstrap project integrates with
Eclipse and Jenkins.</p>

<!-- more -->

<h1 id="software-used-in-the-bootstrap-project">Software used in the bootstrap project</h1>

<ul>
  <li><a href="http://www.gradle.org/">Gradle</a> version 1.3 – build tool</li>
  <li><a href="http://testng.org/">TestNG</a> version 6.8 – unit testing framework</li>
  <li><a href="http://code.google.com/p/mockito/">Mockito</a> version 1.9.0 – mocking framework</li>
  <li><a href="https://github.com/alexruiz/fest-assert-2.x">FEST-Assert 2</a> version 2.0M8 – fluent interface for assertions that
allows you to write assertions that read more like natural language (unfortunately Java lacks something like the
awesome <a href="http://www.scalatest.org/">ScalaTest</a> framework)</li>
  <li><a href="https://github.com/Mapvine/gradle-cobertura-plugin">Cobertura plugin for Gradle</a> version 1.0 – allows Gradle to
generate Cobertura compatible test reports (mostly used for integrating test results with Jenkins)</li>
</ul>

<p>Packages only used for showcasing the functionality:</p>

<ul>
  <li><a href="http://code.google.com/p/guava-libraries/">Google Guava</a> version 13 – solely used to show how compile-time
dependencies are configured in Gradle</li>
</ul>

<p>The latest dependency information is always available in 
<a href="https://github.com/miguno/gradle-testng-mockito-bootstrap/blob/master/build.gradle">build.gradle</a> on GitHub.</p>

<h1 id="what-we-want-to-do">What we want to do</h1>

<p>We have two complimentary goals:</p>

<ol>
  <li>You should be able to use <a href="http://www.eclipse.org/">Eclipse</a> as the IDE of choice to work with the code (e.g. run
the build and tests locally on your machine).</li>
  <li>You should be able to integrate the code with the <a href="http://jenkins-ci.org/">Jenkins</a> continuous integration server
(e.g. to let it run the build and tests for your team and publish the test results).</li>
</ol>

<p>The first goal covers your personal workflow as a software engineer with the code.  The second goal covers integrating
the code with the your engineering team as a whole.</p>

<h1 id="how-to-use-the-bootstrap-project">How to use the bootstrap project</h1>

<ol>
  <li>Download the bootstrap project as described below.</li>
  <li>Configure Eclipse (optional).</li>
  <li>Configure Jenkins CI server (optional).</li>
  <li>Hack away!</li>
</ol>

<h1 id="download">Download</h1>

<p>You have the following two options to start your own project with the bootstrap project.</p>

<div class="software-link">
Homepage on GitHub: <a href="https://github.com/miguno/gradle-testng-mockito-bootstrap">https://github.com/miguno/gradle-testng-mockito-bootstrap</a>
</div>

<h2 id="option-1-you-do-not-have-a-github-account----clone-the-bootstrap-project">Option 1: You do not have a GitHub account – clone the bootstrap project</h2>

<p>If you don’t have a GitHub account, the simplest way is to just clone (i.e. checkout) the original bootstrap project.
The only requirement is a local installation of <a href="http://git-scm.com/">git</a> on your machine.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ git clone git://github.com/miguno/gradle-testng-mockito-bootstrap.git</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Now you can start hacking away!</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ cd gradle-testng-mockito-bootstrap
</span><span class="line"># ...start coding...</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="option-2-you-do-have-a-github-account-----fork-the-bootstrap-project">Option 2: You do have a GitHub account  – fork the bootstrap project</h2>

<p>If you do have a GitHub account, I recommend that you fork the bootstrap project.  Then start writing your own code
against your personal fork.</p>

<p>First, open the <a href="https://github.com/miguno/gradle-testng-mockito-bootstrap">bootstrap project on GitHub</a> and fork it.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-fork.png" title="Fork the Gradle Bootstrap project on GitHub" /></p>

<p>Then clone (i.e. checkout) your personal fork.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ git clone git@github.com:YOURUSERNAME/gradle-testng-mockito-bootstrap.git</span></code></pre></td></tr></table></div></figure></notextile></div>
<div class="note">
Note: Make sure to replace &#8220;YOURUSERNAME&#8220; in the git URL above with the name of your GitHub user account.
</div>

<p>Now you can start hacking away!</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ cd gradle-testng-mockito-bootstrap
</span><span class="line"># ...start coding...</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="about-the-actual-java-code-in-the-bootstrap-project">About the actual Java code in the bootstrap project</h1>

<p>The bootstrap project ships with only two classes:</p>

<ul>
  <li><a href="https://github.com/miguno/gradle-testng-mockito-bootstrap/blob/master/src/main/java/com/miguno/bootstrap/gtm/BobRoss.java">BobRoss.java</a>
– A simple class that implements a few features that we can write unit tests for.  We pretend to be the late 
painting instructor <a href="http://en.wikipedia.org/wiki/Bob_Ross">Bob Ross</a> who, well, is painting a picture with us.</li>
  <li><a href="https://github.com/miguno/gradle-testng-mockito-bootstrap/blob/master/src/test/java/com/miguno/bootstrap/gtm/BobRossTest.java">BobRossTest.java</a>
– This class tests the former class.  It illustrates the use of TestNG, Mockito and FEST-Assert 2 to write these
unit tests.  Don’t pay too much attention to the semantics of the actual tests, we’re just showcasing here.</li>
</ul>

<h1 id="using-gradle-on-the-command-line">Using Gradle on the command line</h1>

<h2 id="installing-gradle">Installing Gradle</h2>

<p>Before continuing you must <a href="http://www.gradle.org/downloads">download and install Gradle</a> if you haven’t done so already.</p>

<p>If you are on a Mac and have the <a href="http://mxcl.github.com/homebrew/">Homebrew</a> package manager installed, you just need to
run:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ brew install gradle</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="command-examples">Command Examples</h2>

<p>The bootstrap project is a normal gradle project.  Have a look at the
<a href="http://www.gradle.org/documentation">gradle documentation</a> what this allows you to do.  I will only list the most
important commands here.  If you want to see what gradle tasks are available out of the box in the bootstrap project,
run <code>gradle tasks</code>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class=""><span class="line"># General commands
</span><span class="line">$ gradle clean          # deletes the build directory
</span><span class="line">$ gradle clean test     # runs the unit tests (and compile before if needed)
</span><span class="line">$ gradle clean build    # assembles and tests this project
</span><span class="line">
</span><span class="line"># Eclipse related
</span><span class="line">$ gradle cleanEclipse   # cleans all Eclipse files
</span><span class="line">$ gradle eclipse        # generates all Eclipse files</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>By default, executing the commands above will create output in the following locations:</p>

<ul>
  <li><code>build/</code> – this sub-directory is used by Gradle</li>
  <li><code>build/reports/cobertura/main/coverage.xml</code> – Cobertura test coverage report in XML format</li>
  <li><code>build/reports/tests/testng-results.xml</code> – TestNG Results in XML format</li>
  <li><code>bin/</code> – this sub-directory is used by Eclipse</li>
</ul>

<p>Feel free to browse the directory tree to find additional files that you might need.</p>

<h1 id="configuring-eclipse">Configuring Eclipse</h1>

<h2 id="importing-the-bootstrap-project">Importing the bootstrap project</h2>

<p>The following steps will import your local clone of the bootstrap project into Eclipse.</p>

<div class="note">
Note: Yes, there is a Gradle plugin for Eclipse (don&#8217;t confuse it with the Eclipse plugin for Gradle in <tt>build.gradle</tt>, imported via <tt>apply plugin: &#8216;eclipse&#8217;</tt>).  However in my personal experience it was not working that well, oftentimes reporting build errors when everything was actually fine.  I found the approach described below much more stable and reliable.  But your mileage may vary.
</div>

<p>First, you must generate the required project files for Eclipse via gradle:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ gradle cleanEclipse eclipse</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Then you import the bootstrap project into Eclipse as follows.</p>

<p>Open <code>File &gt; Import...</code>:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-eclipse-import-01.png" title="Import the bootstrap project in Eclipse, step 1" /></p>

<p>Select <code>General &gt; Existing Projects into Workspace</code>:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-eclipse-import-02.png" title="Import the bootstrap project in Eclipse, step 2" /></p>

<p>Navigate to your local copy of the bootstrap project.  In the dialogue window you can leave the other values at their
defaults and just click <code>Finish</code>.  Of course feel free to modify the default values as you see fit (e.g. to add the
project to a working set of your choice).</p>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-eclipse-import-03.png" title="Import the bootstrap project in Eclipse, step 3" /></p>

<p>Now you should see the bootstrap project in the Package Explorer of Eclipse:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-eclipse-import-04.png" title="Import the bootstrap project in Eclipse, step 4" /></p>

<h2 id="installing-the-testng-plugin-for-eclipse">Installing the TestNG plugin for Eclipse</h2>

<p>You will need the TestNG plugin for Eclipse so that you can conveniently run the included unit tests from within
Eclipse.</p>

<p>To install the plugin open <code>Help &gt; Eclipse Marketplace...</code> in Eclipse.  Search for “gradle” and then install the
“TestNG for Eclipse” plugin by Cédric Beust.  Make sure to restart Eclipse after installing the plugin.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-eclipse-testng-plugin.png" title="Installing the TestNG plugin for Eclipse" /></p>

<div class="caption">
Installing the TestNG plugin for Eclipse.  Note that the screenshot above actually shows an &#8220;Uninstall&#8221; button &#8211; this is only because on my machine the plugin is already installed.
</div>

<p>Now you can run the TestNG unit tests, for instance, by right-clicking on the <code>BobRossTest</code> class in the Package
Explorer and selecting <code>Run as... &gt; TestNG Test</code>.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-eclipse-testng-plugin-running-tests.png" title="Running TestNG tests from Eclipse" /></p>

<p>If you give it a go, you should see the following results in the TestNG <em>view</em> in Eclipse.  If you don’t find the
TestNG view in your Eclipse, make sure it is enabled via <code>Window &gt; Show View &gt; TestNG</code>.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-eclipse-testng-plugin-test-results.png" title="TestNG test results in Eclipse" /></p>

<div class="caption">Results of a TestNG run in Eclipse as shown in the TestNG view</div>

<h1 id="configuring-jenkins">Configuring Jenkins</h1>

<p>This section assumes that you are familiar with Jenkins and have a Jenkins instance installed and running.</p>

<h2 id="installing-required-jenkins-plugins">Installing required Jenkins plugins</h2>

<p>Install the following Jenkins plugins via <code>Jenkins &gt; Manage Jenkins &gt; Manage Plugins</code>:</p>

<ul>
  <li><a href="https://wiki.jenkins-ci.org/display/JENKINS/Cobertura+Plugin">Cobertura plugin for Jenkins</a></li>
  <li><a href="https://wiki.jenkins-ci.org/display/JENKINS/Gradle+Plugin">Gradle plugin for Jenkins</a></li>
  <li><a href="https://wiki.jenkins-ci.org/display/JENKINS/testng-plugin">TestNG plugin for Jenkins</a></li>
  <li><a href="https://wiki.jenkins-ci.org/display/JENKINS/Git+Plugin">Git plugin for Jenkins</a></li>
</ul>

<h2 id="creating-a-new-jenkins-build-job">Creating a new Jenkins build job</h2>

<p>Now you can add a new Jenkins build job via <code>Jenkins &gt; New Job</code>.</p>

<ul>
  <li>Select “Build a free-style software project” and give your project a name.  Click <code>Ok</code>.</li>
  <li>In section “Source Code Management” select <code>Git</code> and enter your repository URL.</li>
  <li>In section “Build” click on “Add build step” and select “Invoke Gradle script”.
    <ul>
      <li>Enter <code>clean build javadoc</code> in the “Tasks” field.</li>
    </ul>
  </li>
  <li>In section “Post-build Actions” click on “Add post-build action” and select “Publish Cobertura Coverage Report”.
    <ul>
      <li>Enter <code>**/build/reports/cobertura/main/coverage.xml</code> in the “Cobertura xml report pattern” field.</li>
    </ul>
  </li>
  <li>In section “Post-build Actions” click again on “Add post-build action” and select “Publish TestNG Reports”.
    <ul>
      <li>Enter <code>./build/reports/tests/testng-results.xml</code> in the “TestNG XML report pattern” field.   </li>
    </ul>
  </li>
  <li>Make sure you click on the <code>Save</code> button at the very bottom to actually save your new Jenkins build job!</li>
</ul>

<div class="note">
Note: I will not cover how to integrate Jenkins with your git repository in this article.  That said it is pretty straight-forward to configure Jenkins to use a self-hosted git repository or your GitHub repository &#8211; such as the fork of the bootstrap project that you might have created above if you picked download option 2 (see the StackOverflow thread <a href="http://stackoverflow.com/questions/5212304/authenticate-jenkins-ci-for-github-private-repository">Authenticate Jenkins CI for Github private repository</a>, for instance).  This might a topic of a future blog post.
</div>

<p>Now you can start using Jenkins to execute the build and run the tests of your bootstrap project!</p>

<p>Here are some screenshots how it will look like:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-jenkins-ui.png" title="Bootstrap project configured in Jenkins" /></p>

<div class="caption">The bootstrap project configured in Jenkins.</div>

<p><img src="http://www.michael-noll.com/blog/uploads/gradle-bootstrap-jenkins-ui-testng.png" title="Bootstrap project configured in Jenkins" /></p>

<div class="caption">The TestNG test report in Jenkins.</div>

<h2 id="what-else-for-jenkins">What else for Jenkins?</h2>

<p>You can also configure Jenkins to fetch the latest contents of your GitHub code repository when you run a new build
(or even trigger a build automatically when a new commit is pushed to the repository).  I will not cover this setup
in this article though.</p>

<h1 id="summary">Summary</h1>

<p>This bootstrap project should get you started quickly with your own Java code project backed by Gradle, TestNG &amp; Co.
If you come up with any improvements, feel free to write a comment here or to send me a pull request on the
<a href="https://github.com/miguno/gradle-testng-mockito-bootstrap">bootstrap GitHub repository</a>.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Implementing Real-Time Trending Topics With A Distributed Rolling Count Algorithm in Storm]]></title>
    <link href="http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/"/>
    <updated>2013-01-18T12:56:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm</id>
    <content type="html"><![CDATA[<p>A common pattern in real-time data workflows is performing rolling counts of incoming data points, also known as
sliding window analysis.  A typical use case for rolling counts is identifying trending topics in a user community –
such as on <a href="https://twitter.com">Twitter</a> – where a topic is considered trending when it has been among the top N
topics in a given window of time.  In this article I will describe how to implement such an algorithm in a distributed
and scalable fashion using the <a href="http://storm-project.net/">Storm</a> real-time data processing platform.  The same code
can also be used in other areas such as infrastructure and security monitoring.</p>

<!-- more -->

<h1 id="about-trending-topics-and-sliding-windows">About Trending Topics and Sliding Windows</h1>

<p>First, let me explain what I mean by “trending topics” so that we have a common understanding.  Here is an explanation
taken from Wikipedia:</p>

<blockquote><p><strong>Trending topics</strong></p><p>A word, phrase or topic that is tagged at a greater rate than other tags is said to be a trending topic.  Trending topics become popular either through a concerted effort by users or because of an event that prompts people to talk about one specific topic.  These topics help Twitter and their users to understand what is happening in the world.</p><footer><strong>Wikipedia page on Twitter</strong> <cite><a href="http://en.wikipedia.org/wiki/Twitter#Trending_topics">en.wikipedia.org/wiki/&hellip;</a></cite></footer></blockquote>

<p>In other words, it is a measure of “What’s hot?” in a user community.  Typically, you are interested in trending topics
for a given <em>time span</em>; for instance, the most popular topics in the past five minutes or the current day.  So
the question “What’s hot?” is more precisely stated as “What’s hot <em>today</em>?” or “What’s hot <em>this week</em>?”.</p>

<p>In this article we assume we have a system that uses the Twitter API to pull the latest tweets from the live Twitter
stream.  We assume further that we have a mechanism in place that extracts topical information in the form of words from
those tweets.  For instance, we could opt to use a simple pattern matching algorithm that treats #hashtags in tweets as
topics.  Here, we would consider a tweet such as</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">@miguno The #Storm project rocks for real-time distributed #data processing!</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>to “mention” the topics</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">storm
</span><span class="line">data</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>We design our system so that it considers topic A more popular than topic B (for a given time span) if topic A has been
mentioned more often in tweets than topic B.  This means we only need to <em>count</em> the number of occurrences of topics in
tweets.</p>

<script type="math/tex; mode=display">
popularity(A) \ge popularity(B) \iff mentions(A) \ge mentions(B)
</script>

<p>For the context of this article we do not care how the topics are actually derived from user content or user activities
as long as the derived topics are represented as textual words.  Then, the Storm topology described in this article
will be able to identify in real-time the <em>trending</em> topics in this input data using a time-sensitive rolling count
algorithm (rolling counts are also known as <em>sliding windows</em>) coupled with a ranking step.  The former aspect takes
care of filtering user input by time span, the latter of ranking the most trendy topics at the top the list.</p>

<p>Eventually we want our Storm topology to periodically produce the top N of trending topics similar to the following
example output, where <em>t0</em> to <em>t2</em> are different points in time:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class=""><span class="line">Rank @ t0   -----&gt;   t1   -----&gt;   t2
</span><span class="line">---------------------------------------------
</span><span class="line">1.    java   (33)   ruby   (41)   scala  (32)
</span><span class="line">2.    php    (30)   scala  (28)   python (29)
</span><span class="line">3.    scala  (21)   java   (27)   ruby   (24)
</span><span class="line">4.    ruby   (16)   python (21)   java   (21)
</span><span class="line">5.    python (15)   php    (14)   erlang (18)</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>In this example we can see that over time “scala” has become the hottest trending topic.</p>

<h2 id="sliding-windows">Sliding Windows</h2>

<p>The last background aspect I want to cover are sliding windows aka rolling counts.  A picture is worth a thousand words:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/Storm_rolling-count_sliding-window.png" /></p>

<div class="caption">
Figure 1: As the sliding window advances, the slice of its input data changes.  In the example above the algorithm uses
the current sliding window data to compute the sum of the window&#8217;s elements.
</div>

<p>A formula might also be worth a bunch of words – ok, ok, maybe not a full <em>thousand</em> of them – so mathematically
speaking we could formalize such a sliding-window sum algorithm as follows:</p>

<script type="math/tex; mode=display">
\text{m-sized rolling sum} = \sum_{i=t}^{i+m} element(i)
</script>

<p>where <em>t</em> continually advances (most often with time) and <em>m</em> is the window size.</p>

<p><em>From size to time</em>: If the window is advanced with time, say every <em>N</em> minutes, then the individual elements in the
input represent data collected over the same interval of time (here: <em>N</em> minutes).  In that case the window size is
equivalent to <em>N x m</em> minutes.  Simply speaking, if <em>N=1</em> and <em>m=5</em>, then our sliding window algorithm emits the latest
five-minute aggregates every one minute.</p>

<p>Now that we have introduced <em>trending topics</em> and <em>sliding windows</em> we can finally start talking about writing code for
Storm that implements all this in practice – large-scale, distributed, in real time.</p>

<h1 id="before-we-start">Before We Start</h1>

<h2 id="about-storm-starter">About storm-starter</h2>

<p>The <a href="">storm-starter</a> project on GitHub provides example implementations of various
<a href="https://github.com/nathanmarz/storm">Storm</a> real-time data processing topologies such as a simple streaming WordCount
algorithm.  It also includes a Rolling Top Words topology that can be used for computing trending topics, the purpose of
which is exactly what I want to cover in this article.</p>

<p>When I began to tackle trending topic analysis with Storm I expected that I could re-use most if not all of the Rolling
Top Words code in <code>storm-starter</code>.  But I soon realized that the old code would need some serious redesigning
and refactoring before one could actually use it in a real-world environment – including being able to efficiently
maintain and augment the code in a team of engineers across release cycles.</p>

<p>In the next section I will briefly summarize the state of the Rolling Top Words topology before and after my
refactoring to highlight some important changes and things to consider when writing your own Storm code.  Then I will
continue with covering the most important aspects of the new implementation in further detail.  And of course I
<a href="https://github.com/nathanmarz/storm-starter/pull/27">contributed the new implementation back</a> to the Storm project.</p>

<h2 id="the-old-code-and-my-goals-for-the-new-code">The Old Code and My Goals for the New Code</h2>

<div class="note">
Just to absolutely clear here: I am talking about the defects of the old code to highlight some typical pitfalls during
software development for a distributed system such as Storm.  My intention is to make other developers aware of these
gotchas so that we make less mistakes in our profession.  I am by no means implying that the authors of the old code
did a bad job (after all, the old code was perfectly adequate to get me started with trending topics in Storm) or that
the new implementation I came up with is the pinnacle of coding. :-)
</div>

<p>My initial reaction to the old code was that, frankly speaking, I had no idea what and how it was doing its job.  The
various logical responsibilities of the code were mixed together in the existing classes, clearly not abiding by the
<a href="http://en.wikipedia.org/wiki/Single_responsibility_principle">Single Responsibility Principle</a>.  And I am not talking
about academic treatments of SRP and such – I was hands-down struggling to wrap my head around the old code because of
this.</p>

<p><span class="pullquote-right" data-pullquote="In practice this dirty-write bug in the old rolling count implementation caused data corruption, i.e. the code was not carrying out its main responsibility correctly &#8211; that of counting objects.">
Also, I noticed a few <code>synchronized</code> statements and threads being launched manually, hinting at additional parallel
operations beyond what the Storm framework natively provides you with.  Here, I was particularly concerned with those
functionalities that interacted with the system time (calls to <code>System.currentTimeMillis()</code>).  I couldn’t help the
feeling that they looked prone to concurrency issues.  And my suspicions were eventually confirmed when I discovered a
dirty-write bug in the <code>RollingCountObjects</code> bolt code for the slot-based counting (using <code>long[]</code>) of object
occurrences.  In practice this dirty-write bug in the old rolling count implementation caused data corruption, i.e.
the code was not carrying out its main responsibility correctly – that of counting objects. That said I’d argue
that it would not have been trivial to spot this error in the old code prior to refactoring (where it was eventually
plain to see), so please don’t think it was just negligence on the part of the original authors.  With the new tick
tuple feature in Storm 0.8 I was feeling confident that this part of the code could be significantly simplified and
fixed.
</span></p>

<p>In general I figured that completely refactoring the code and untangling these responsibilities would not only make the
code more approachable and readable for me and others – after all the
<a href="https://github.com/nathanmarz/storm-starter">storm-starter</a>  code’s main purpose is to jumpstart Storm beginners –
but it would also allow me to write meaningful unit tests, which would have been very difficult to do with the old code.</p>

<table>
    <tr>
        <th>What</th>
        <th>Before refactoring</th>
        <th>After refactoring</th>
    </tr>
    <tr>
        <td>Storm Bolts</td>
        <td>
            <a href="https://github.com/nathanmarz/storm-starter/blob/6b1384b8c584df639fd6b5ac25627ee9a3b1bfd6/src/jvm/storm/starter/bolt/RollingCountObjects.java">RollingCountObjects</a>,
            <a href="https://github.com/nathanmarz/storm-starter/blob/6b1384b8c584df639fd6b5ac25627ee9a3b1bfd6/src/jvm/storm/starter/bolt/RankObjects.java">RankObjects</a>,
            <a href="https://github.com/nathanmarz/storm-starter/blob/6b1384b8c584df639fd6b5ac25627ee9a3b1bfd6/src/jvm/storm/starter/bolt/MergeObjects.java">MergeObjects</a>
        </td>
        <td>
            <a href="https://github.com/nathanmarz/storm-starter/blob/f5bdc720f50a0c46e90f0085c10217f2a6a3249f/src/jvm/main/storm/starter/bolt/RollingCountBolt.java">RollingCountBolt</a>,
            <a href="https://github.com/nathanmarz/storm-starter/blob/f5bdc720f50a0c46e90f0085c10217f2a6a3249f/src/jvm/main/storm/starter/bolt/IntermediateRankingsBolt.java">IntermediateRankingsBolt</a>,
            <a href="https://github.com/nathanmarz/storm-starter/blob/f5bdc720f50a0c46e90f0085c10217f2a6a3249f/src/jvm/main/storm/starter/bolt/TotalRankingsBolt.java">TotalRankingsBolt</a>,
        </td>
    </tr>
    <tr>
        <td>Storm Spouts</td>
        <td>
            <a href="https://github.com/nathanmarz/storm/blob/master/src/jvm/backtype/storm/testing/TestWordSpout.java">TestWordSpout</a>
        </td>
        <td>
            <a href="https://github.com/nathanmarz/storm/blob/master/src/jvm/backtype/storm/testing/TestWordSpout.java">TestWordSpout</a> (not modified)
        </td>
    </tr>
    <tr>
        <td>Data Structures</td>
        <td>-</td>
        <td>
            <a href="https://github.com/nathanmarz/storm-starter/blob/f5bdc720f50a0c46e90f0085c10217f2a6a3249f/src/jvm/main/storm/starter/tools/SlotBasedCounter.java">SlotBasedCounter</a>,
            <a href="https://github.com/nathanmarz/storm-starter/blob/f5bdc720f50a0c46e90f0085c10217f2a6a3249f/src/jvm/main/storm/starter/tools/SlidingWindowCounter.java">SlidingWindowCounter</a>,
            <a href="https://github.com/nathanmarz/storm-starter/blob/f5bdc720f50a0c46e90f0085c10217f2a6a3249f/src/jvm/main/storm/starter/tools/Rankings.java">Rankings</a>,
            <a href="https://github.com/nathanmarz/storm-starter/blob/f5bdc720f50a0c46e90f0085c10217f2a6a3249f/src/jvm/main/storm/starter/tools/Rankable.java">Rankable</a>,
            <a href="https://github.com/nathanmarz/storm-starter/blob/f5bdc720f50a0c46e90f0085c10217f2a6a3249f/src/jvm/main/storm/starter/tools/RankableObjectWithFields.java">RankableObjectWithFields</a>
        </td>
    </tr>
    <tr>
        <td>Unit Tests</td>
        <td>-</td>
        <td>Every class has its own suite of tests.</td>
    </tr>
    <tr>
        <td>Additional Notes</td>
        <td>
            Uses manually launched background threads instead of native Storm features to execute periodic
            activities.
        </td>
        <td>Uses new tick tuple feature in Storm 0.8 to trigger periodic activities in Storm components.</td>
    </tr>
</table>

<div class="caption">
Table 1: The state of the trending topics Storm implementation before and after the refactoring.
</div>

<p>The design and implementation that I will describe in the following sections are the result of a number of refactoring
iterations.  I started with smaller code changes that served me primarily to understand the existing code better (e.g.
more meaningful variable names, splitting long methods into smaller logical units).  The more I felt comfortable
the more I started to introduce substantial changes.  Unfortunately the existing code was not accompanied by any unit
tests, so while refactoring I was in the dark, risking to break something that I was not even aware of breaking.  I
considered writing unit tests for the existing code first and then go back to refactoring but I figured that this would
not be the best approach given the state of the code and the time I had available.</p>

<p>In summary my goals for the new trending topics implementation were:</p>

<ol>
  <li>The new code <a href="http://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882">should be clean</a> and
easy to understand, both for the benefit of other developers when adapting or maintaining the code and for reasoning
about its correctness.  Notably, the code should decouple its data structures from the Storm sub-system and, if
possible, favor native Storm features for concurrency instead of custom approaches.</li>
  <li>The new code should be <a href="http://practicalunittesting.com/">covered by meaningful unit tests</a>.</li>
  <li>The new code should be good enough to contribute it back to the Storm project to help its community.</li>
</ol>

<h1 id="implementing-the-data-structures">Implementing the Data Structures</h1>

<p>Eventually I settled down to the following core data structures for the new distributed Rolling Count algorithm.  As
you will see, an interesting characteristic is that these data structures are completely decoupled from any Storm
internals.  Our Storm bolts will make use of them, of course, but there is no dependency in the opposite direction from
the data structures to Storm.</p>

<ul>
  <li>Classes used for counting objects:
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/SlotBasedCounter.java">SlotBasedCounter</a>,
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/SlidingWindowCounter.java">SlidingWindowCounter</a></li>
  <li>Classes used for ranking objects by their count:
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/Rankings.java">Rankings</a>,
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/Rankable.java">Rankable</a>,
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/RankableObjectWithFields.java">RankableObjectWithFields</a></li>
</ul>

<p><span class="pullquote-right" data-pullquote="Eliminating direct calls to system time and manually started background threads makes the new code much simpler and testable than before.">
Another notable improvement is that the new code removes any need and use of concurrency-related code such as
<code>synchronized</code> statements or manually started background threads.  Also, none of the data structures are interacting
with the system time.  Eliminating direct calls to system time and manually started background threads makes the new
code much simpler and testable than before.
</span></p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>No more interacting with system time in the low level data structures, yay!  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="c1">// such code from the old RollingCountObjects bolt is not needed anymore</span>
</span><span class="line"><span class="kt">long</span> <span class="n">delta</span> <span class="o">=</span> <span class="n">millisPerBucket</span><span class="o">(</span><span class="n">_numBuckets</span><span class="o">)</span>
</span><span class="line">               <span class="o">-</span> <span class="o">(</span><span class="n">System</span><span class="o">.</span><span class="na">currentTimeMillis</span><span class="o">()</span> <span class="o">%</span> <span class="n">millisPerBucket</span><span class="o">(</span><span class="n">_numBuckets</span><span class="o">));</span>
</span><span class="line"><span class="n">Utils</span><span class="o">.</span><span class="na">sleep</span><span class="o">(</span><span class="n">delta</span><span class="o">);</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="slotbasedcounter">SlotBasedCounter</h2>

<p>The
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/SlotBasedCounter.java">SlotBasedCounter</a>
class provides per-slot counts of the occurrences of objects.  The number of slots of a given counter instance is fixed.
The class provides four public methods:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>SlotBasedCounter API  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">incrementCount</span><span class="o">(</span><span class="n">T</span> <span class="n">obj</span><span class="o">,</span> <span class="kt">int</span> <span class="n">slot</span><span class="o">);</span>
</span><span class="line"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">wipeSlot</span><span class="o">(</span><span class="kt">int</span> <span class="n">slot</span><span class="o">):</span>
</span><span class="line"><span class="kd">public</span> <span class="kt">long</span> <span class="nf">getCount</span><span class="o">(</span><span class="n">T</span> <span class="n">obj</span><span class="o">,</span> <span class="kt">int</span> <span class="n">slot</span><span class="o">)</span>
</span><span class="line"><span class="c1">// get the *total* counts of all objects across all slots</span>
</span><span class="line"><span class="kd">public</span> <span class="n">Map</span><span class="o">&lt;</span><span class="n">T</span><span class="o">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">getCounts</span><span class="o">();</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Here is a usage example:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Using SlotBasedCounter  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="c1">// we want to count Object&#39;s using five slots</span>
</span><span class="line"><span class="n">SlotBasedCounter</span> <span class="n">counter</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SlotBasedCounter</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">&gt;(</span><span class="mi">5</span><span class="o">);</span>
</span><span class="line">
</span><span class="line"><span class="c1">// counting</span>
</span><span class="line"><span class="n">Object</span> <span class="n">trackMe</span> <span class="o">=</span> <span class="o">...;</span>
</span><span class="line"><span class="kt">int</span> <span class="n">currentSlot</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line"><span class="n">counter</span><span class="o">.</span><span class="na">incrementCount</span><span class="o">(</span><span class="n">trackMe</span><span class="o">,</span> <span class="n">currentSlot</span><span class="o">);</span>
</span><span class="line">
</span><span class="line"><span class="c1">// the counts of an object for a given slot</span>
</span><span class="line"><span class="kt">long</span> <span class="n">counts</span> <span class="o">=</span> <span class="n">counter</span><span class="o">.</span><span class="na">getCount</span><span class="o">(</span><span class="n">trackMe</span><span class="o">,</span> <span class="n">currentSlot</span><span class="o">);</span>
</span><span class="line">
</span><span class="line"><span class="c1">// the total counts (across all slots) of all objects</span>
</span><span class="line"><span class="n">Map</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">counts</span> <span class="o">=</span> <span class="n">counter</span><span class="o">.</span><span class="na">getCounts</span><span class="o">();</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Internally <code>SlotBasedCounter</code> is backed by a <code>Map&lt;T, long[]&gt;</code> for the actual count state.  You might be surprised
to see the low-level <code>long[]</code> array here – wouldn’t it be better OO style to introduce a new, separate class that is
just used for the counting of a single slot, and then we use a couple of these single-slot counters to form the
SlotBasedCounter?  Well, yes we could.  But for performance reasons and for not deviating too far from the old code I
decided not to go down this route.  Apart from updating the counter – which is a WRITE operation – the most common
operation in our use case is a READ operation to get the <em>total</em> counts of tracked objects.  Here, we must calculate
the sum of an object’s counts <em>across all slots</em>.  And for this it is preferable to have the individual data points for
an object close to each other (kind of data locality), which the <code>long[]</code> array allows us to do.  Your mileage may
vary though.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/Storm_rolling-count_SlotBasedCounter.png" title="The SlotBasedCounter class" /></p>

<div class="caption">
Figure 2: The <tt>SlotBasedCounter</tt> class keeps track of multiple counts of a given object.  In the example above,
the SlotBasedCounter has five logical slots which allows you to track up to five counts per object.
</div>

<p>The <code>SlotBasedCounter</code> is a primitive class that can be used, for instance, as a building block for implementing
sliding window counting of objects.  And this is exactly what I will describe in the next section.</p>

<h2 id="slidingwindowcounter">SlidingWindowCounter</h2>

<p>The
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/SlidingWindowCounter.java">SlidingWindowCounter</a>
class provides <em>rolling</em> counts of the occurrences of “things”, i.e. a sliding window count for each tracked object.
Its counting functionality is based on the previously described <code>SlotBasedCounter</code>.  The size of the sliding window
is equivalent to the (fixed) number of slots number of a given <code>SlidingWindowCounter</code> instance.  It is used by
<code>RollingCountBolt</code> for counting incoming data tuples.</p>

<p>The class provides two public methods:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>SlidingWindowCounter API  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">incrementCount</span><span class="o">(</span><span class="n">T</span> <span class="n">obj</span><span class="o">);</span>
</span><span class="line"><span class="n">Map</span><span class="o">&lt;</span><span class="n">T</span><span class="o">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">getCountsThenAdvanceWindow</span><span class="o">();</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>What might be surprising to some readers is that this class does not have any notion of time even though “sliding
window” normally means a time-based window of some kind.  In our case however the window does not advance with time but
whenever (and only when) the method <code>getCountsThenAdvanceWindow()</code> is called.  This means <code>SlidingWindowCounter</code>
behaves just like a normal ring buffer in terms of advancing from one window to the next.</p>

<div class="note">
Note: While working on the code I realized that parts of my redesign decisions &#8211; teasing apart the concerns &#8211; were
close in mind to those of the <a href="http://lmax-exchange.github.com/disruptor/">LMAX Disruptor</a> concurrent ring
buffer, albeit much simpler of course.  Firstly, to <em>limit concurrent access</em> to the relevant data structures
(here: mostly what SlidingWindowCounter is being used for).  In my case I followed the
<a href="http://en.wikipedia.org/wiki/Single_responsibility_principle">SRP</a> and split the concerns into new data
structures in a way that actually allowed me to eliminate the need for ANY concurrent access.  Secondly, to put a
<em>strict sequencing concept</em> in place (the way <tt>incrementCount(T obj)</tt> and
<tt>getCountsThenAdvanceWindow()</tt> interact) that would prevent dirty reads or dirty writes from happening as was
unfortunately possible in the old, system time based code.
<br /><br />
If you have not heard about LMAX Disruptor before, make sure to read their
<a href="http://disruptor.googlecode.com/files/Disruptor-1.0.pdf">LMAX technical paper (PDF)</a> on the
<a href="http://lmax-exchange.github.com/disruptor/">LMAX homepage</a> for inspirations.  It&#8217;s worth the time!</div>

<p><img src="http://www.michael-noll.com/blog/uploads/Storm_rolling-count_SlidingWindowCounter.png" title="The SlidingWindowCounter class" /></p>

<div class="caption">
Figure 3: The <tt>SlidingWindowCounter</tt> class keeps track of multiple <em>rolling</em> counts of objects, i.e. a
sliding window count for each tracked object.  Please note that the example of an 8-slot sliding window counter above is
simplified as it only shows a single count per slot.  In reality <tt>SlidingWindowCounter</tt> tracks multiple counts
for multiple objects.
</div>

<p>Here is an illustration showing the behavior of <code>SlidingWindowCounter</code> over multiple iterations:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/Storm_rolling-count_SlidingWindowCounter-example.png" title="Example of SlidingWindowCounter behavior" /></p>

<div class="caption">
Figure 4: Example of <tt>SlidingWindowCounter</tt> behavior for a counter of size 4.  Again, the example is simplified
as it only shows a single count per slot.
</div>

<h2 id="rankings-and-rankable">Rankings and Rankable</h2>

<p>The <a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/Rankings.java">Rankings</a> class
represents fixed-size rankings of objects, for instance to implement “Top 10” rankings.  It ranks its objects
descendingly according to their <strong>natural order</strong>, i.e. from largest to smallest.  This class is used by
<code>AbstractRankerBolt</code> and its derived bolts to track the current rankings of incoming objects over time.</p>

<div class="note">
Note: The <tt>Rankings</tt> class itself is completely unaware of the bolts&#8217; time-based behavior.
</div>

<p>The class provides five public methods:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Rankings API  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">updateWith</span><span class="o">(</span><span class="n">Rankable</span> <span class="n">r</span><span class="o">);</span>
</span><span class="line"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">updateWith</span><span class="o">(</span><span class="n">Rankings</span> <span class="n">other</span><span class="o">);</span>
</span><span class="line"><span class="kd">public</span> <span class="n">List</span><span class="o">&lt;</span><span class="n">Rankable</span><span class="o">&gt;</span> <span class="nf">getRankings</span><span class="o">();</span>
</span><span class="line"><span class="kd">public</span> <span class="kt">int</span> <span class="nf">maxSize</span><span class="o">();</span> <span class="c1">// as supplied to constructor</span>
</span><span class="line"><span class="kd">public</span> <span class="kt">int</span> <span class="nf">size</span><span class="o">();</span> <span class="c1">// current size, might be less than maximum size</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Whenever you update <code>Rankings</code> with new data, it will discard any elements that are smaller than the updated top
<code>N</code>, where <code>N</code> is the maximum size of the <code>Rankings</code> instance (e.g. <code>10</code> for a top 10 ranking).</p>

<p>Now the sorting aspect of the ranking is driven by the <em>natural order</em> of the ranked objects.  In my specific case, I
created a <a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/Rankable.java">Rankable</a>
interface that in turn implements the <a href="http://docs.oracle.com/javase/6/docs/api/java/lang/Comparable.html">Comparable</a>
interface.  In practice, you simply pass a <code>Rankable</code> object to the <code>Rankings</code> class, and the latter will update
its rankings accordingly.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Using the Rankings class  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="n">Rankings</span> <span class="n">topTen</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Rankings</span><span class="o">(</span><span class="mi">10</span><span class="o">);</span>
</span><span class="line"><span class="n">Rankable</span> <span class="n">C</span> <span class="o">=</span> <span class="o">...;</span>
</span><span class="line"><span class="n">topTen</span><span class="o">.</span><span class="na">updateWith</span><span class="o">(</span><span class="n">r</span><span class="o">);</span>
</span><span class="line">
</span><span class="line"><span class="n">List</span><span class="o">&lt;</span><span class="n">Rankable</span><span class="o">&gt;</span> <span class="n">rankings</span> <span class="o">=</span> <span class="n">topTen</span><span class="o">.</span><span class="na">getRankings</span><span class="o">();</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>As you can see it is really straight-forward and intuitive in its use.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/Storm_rolling-count_Rankings.png" title="The Rankings class" /></p>

<div class="caption">
Figure 5: The <tt>Rankings</tt> class ranks <tt>Rankable</tt> objects descendingly according to their <strong>natural
order</strong>, i.e. from largest to smallest.  The example above shows a Rankings instance with a maximum size of 10
and a current size of 8.
</div>

<p>The concrete class implementing <code>Rankable</code> is
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/RankableObjectWithFields.java">RankableObjectWithFields</a>.
The bolt <code>IntermediateRankingsBolt</code>, for instance, creates <code>Rankables</code> from incoming data tuples via a factory
method of this class:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>IntermediateRankingsBolt.java  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="java"><span class="line">    <span class="nd">@Override</span>
</span><span class="line">    <span class="kt">void</span> <span class="nf">updateRankingsWithTuple</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">        <span class="n">Rankable</span> <span class="n">rankable</span> <span class="o">=</span> <span class="n">RankableObjectWithFields</span><span class="o">.</span><span class="na">from</span><span class="o">(</span><span class="n">tuple</span><span class="o">);</span>
</span><span class="line">        <span class="kd">super</span><span class="o">.</span><span class="na">getRankings</span><span class="o">().</span><span class="na">updateWith</span><span class="o">(</span><span class="n">rankable</span><span class="o">);</span>
</span><span class="line">    <span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Have a look at
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/Rankings.java">Rankings</a>,
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/Rankable.java">Rankable</a> and
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/tools/RankableObjectWithFields.java">RankableObjectWithFields</a>
for details.  If you run into a situation where you have to implement classes like these yourself, make sure you follow
<a href="http://www.amazon.com/Effective-Java-2nd-Joshua-Bloch/dp/0321356683">good engineering practice</a> and add standard
methods such as <code>equals()</code> and <code>hashCode()</code> as well to your data structures.</p>

<h1 id="implementing-the-rolling-top-words-topology">Implementing the Rolling Top Words Topology</h1>

<p>So where are we?  In the sections above we have already discussed a number of Java classes but not even a single
one of them has been directly related to Storm.  It’s about time that we start writing some Storm code!</p>

<p>In the following sections I will describe the Storm components that make up the Rolling Top Words topology.  When
reading the sections keep in mind that the “words” in this topology represent the topics that are currently being
mentioned by the users in our imaginary system.</p>

<h2 id="overview-of-the-topology">Overview of the Topology</h2>

<p>The high-level view of the Rolling Top Words topology is shown in the figure below.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/Storm_rolling-count_topology.png" title="Overview of the Rolling Top Words topology" /></p>

<div class="caption">
Figure 6: The Rolling Top Words topology consists of instances of <tt>TestWordSpout</tt>, <tt>RollingCountBolt</tt>,
<tt>IntermediateRankingsBolt</tt> and <tt>TotalRankingsBolt</tt>.  The length of the sliding window (in secs) as well
as the various emit frequencies (in secs) are just example values &#8211; depending on your use case you would, for
instance, prefer to have a sliding window of five minutes and emit the latest rolling counts every minute.
</div>

<p>The main responsibilities are split as follows:</p>

<ol>
  <li>In the first layer the topology runs many <code>TestWordSpout</code> instances in parallel to simulate the load of incoming
data – in our case this would be the names of the topics (represented as words) that are currently being mentioned
by our users.</li>
  <li>The second layer comprises multiple instances of <code>RollingCountBolt</code>, which perform a rolling count of incoming
words/topics.</li>
  <li>The third layer uses multiple instances of <code>IntermediateRankingsBolt</code> (“I.R. Bolt” in the figure) to distribute
the load of pre-aggregating the various incoming rolling counts into intermediate rankings.  Hadoop users will see
a strong similarity here to the functionality of a <em>combiner</em> in Hadoop.</li>
  <li>Lastly, there is the final step in the topology.  Here, a single instance of <code>TotalRankingsBolt</code> aggregates the
incoming intermediate rankings into a global, consolidated total ranking.  The output of this bolt are the
currently trending topics in the system.  These trending topics can then be used by downstream data consumers to
provide all the cool user-facing and backend features you want to have in your platform.</li>
</ol>

<p>In code the topology wiring looks as follows in
<a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/RollingTopWords.java">RollingTopWords</a>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>RollingTopWords.java  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="n">builder</span><span class="o">.</span><span class="na">setSpout</span><span class="o">(</span><span class="n">spoutId</span><span class="o">,</span> <span class="k">new</span> <span class="n">TestWordSpout</span><span class="o">(),</span> <span class="mi">2</span><span class="o">);</span>
</span><span class="line"><span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="n">counterId</span><span class="o">,</span> <span class="k">new</span> <span class="n">RollingCountBolt</span><span class="o">(</span><span class="mi">9</span><span class="o">,</span> <span class="mi">3</span><span class="o">),</span> <span class="mi">3</span><span class="o">)</span>
</span><span class="line">            <span class="o">.</span><span class="na">fieldsGrouping</span><span class="o">(</span><span class="n">spoutId</span><span class="o">,</span> <span class="k">new</span> <span class="n">Fields</span><span class="o">(</span><span class="s">&quot;word&quot;</span><span class="o">));</span>
</span><span class="line"><span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="n">intermediateRankerId</span><span class="o">,</span> <span class="k">new</span> <span class="n">IntermediateRankingsBolt</span><span class="o">(</span><span class="n">TOP_N</span><span class="o">),</span> <span class="mi">2</span><span class="o">)</span>
</span><span class="line">            <span class="o">.</span><span class="na">fieldsGrouping</span><span class="o">(</span><span class="n">counterId</span><span class="o">,</span> <span class="k">new</span> <span class="n">Fields</span><span class="o">(</span><span class="s">&quot;obj&quot;</span><span class="o">));</span>
</span><span class="line"><span class="n">builder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="n">totalRankerId</span><span class="o">,</span> <span class="k">new</span> <span class="n">TotalRankingsBolt</span><span class="o">(</span><span class="n">TOP_N</span><span class="o">))</span>
</span><span class="line">            <span class="o">.</span><span class="na">globalGrouping</span><span class="o">(</span><span class="n">intermediateRankerId</span><span class="o">);</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<div class="note">
Note: The integer parameters of the <tt>setSpout()</tt> and <tt>setBolt()</tt> methods (do not confuse them with the integer parameters of the bolt constructors) configure the <a href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/">parallelism</a> of the Storm components.  See my article <a href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/">Understanding the Parallelism of a Storm Topology</a> for details.
</div>

<h2 id="testwordspout">TestWordSpout</h2>

<p>The only spout we will be using is the
<a href="https://github.com/nathanmarz/storm/blob/master/src/jvm/backtype/storm/testing/TestWordSpout.java">TestWordSpout</a> that
is part of <code>backtype.storm.testing</code> package of Storm itself.  I will not cover the spout in detail because it is
a trivial class.  The only thing it does is to select a random word from a fixed list of five words (“nathan”, “mike”,
“jackson”, “golda”, “bertels”) and emit that word to the downstream topology every 100ms.  For the sake of this article,
we consider these words to be our “topics”, of which we want to identify the trending ones.</p>

<div class="note">
Note: Because <tt>TestWordSpout</tt> selects its output words at random (and each word having the same probability of
being selected) in most cases the counts of the various words are pretty close to each other.  This is ok for example
code such as ours.  In a production setting though you most likely want to generate &#8220;better&#8221; simulation data.
</div>

<p>The spout’s output can be visualized as follows.  Note that the <code>@XXXms</code> milliseconds timeline is not part of the
actual output.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class=""><span class="line">@100ms: nathan
</span><span class="line">@200ms: golda
</span><span class="line">@300ms: golda
</span><span class="line">@400ms: jackson
</span><span class="line">@500ms: mike
</span><span class="line">@600ms: nathan
</span><span class="line">@700ms: bertels
</span><span class="line">...</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="excursus-tick-tuples-in-storm-08">Excursus: Tick Tuples in Storm 0.8+</h2>

<p>A new and very helpful (read: awesome) feature of Storm 0.8 is the so-called <em>tick tuple</em>.  Whenever you want a spout
or bolt execute a task at periodic intervals – in other words, you want to trigger an event or activity – using a tick
tuple is normally the best practice.</p>

<p>Nathan Marz described tick tuples in the Storm 0.8 announcement as follows:</p>

<blockquote><p>Tick tuples: It&#8217;s common to require a bolt to &#8220;do something&#8221; at a fixed interval, like flush writes to a database.  Many people have been using variants of a ClockSpout to send these ticks. The problem with a ClockSpout is that you can&#8217;t internalize the need for ticks within your bolt, so if you forget to set up your bolt correctly within your topology it won&#8217;t work correctly. 0.8.0 introduces a new &#8220;tick tuple&#8221; config that lets you specify the frequency at which you want to receive tick tuples via the &#8220;topology.tick.tuple.freq.secs&#8221; component-specific config, and then your bolt will receive a tuple from the <tt>__system</tt> component and <tt>__tick</tt> stream at that frequency.</p><footer><strong>Nathan Marz on the Storm mailing list</strong> <cite><a href="https://groups.google.com/forum/#!msg/storm-user/8addaQm3OT4/0OQfSgQkRwEJ">groups.google.com/forum/#!msg/&hellip;</a></cite></footer></blockquote>

<p>Here is how you configure a bolt/spout to receive tick tuples every 10 seconds:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Configuring a bolt/spout to receive tick tuples every 10 seconds  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="java"><span class="line">    <span class="nd">@Override</span>
</span><span class="line">    <span class="kd">public</span> <span class="n">Map</span><span class="o">&lt;</span><span class="n">String</span><span class="o">,</span> <span class="n">Object</span><span class="o">&gt;</span> <span class="n">getComponentConfiguration</span><span class="o">()</span> <span class="o">{</span>
</span><span class="line">        <span class="n">Config</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Config</span><span class="o">();</span>
</span><span class="line">        <span class="kt">int</span> <span class="n">tickFrequencyInSeconds</span> <span class="o">=</span> <span class="mi">10</span><span class="o">;</span>
</span><span class="line">        <span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_TICK_TUPLE_FREQ_SECS</span><span class="o">,</span> <span class="n">tickFrequencyInSeconds</span><span class="o">);</span>
</span><span class="line">        <span class="k">return</span> <span class="n">conf</span><span class="o">;</span>
</span><span class="line">    <span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Usually you will want to add a conditional switch to the component’s <code>execute</code> method to tell tick tuples and
“normal” tuples apart:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Telling tick tuples and normal tuples apart  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="java"><span class="line">    <span class="nd">@Override</span>
</span><span class="line">    <span class="kd">public</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">        <span class="k">if</span> <span class="o">(</span><span class="n">isTickTuple</span><span class="o">(</span><span class="n">tuple</span><span class="o">))</span> <span class="o">{</span>
</span><span class="line">            <span class="c1">// now you can trigger e.g. a periodic activity</span>
</span><span class="line">        <span class="o">}</span>
</span><span class="line">        <span class="k">else</span> <span class="o">{</span>
</span><span class="line">            <span class="c1">// do something with the normal tuple</span>
</span><span class="line">        <span class="o">}</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">
</span><span class="line">    <span class="kd">private</span> <span class="kd">static</span> <span class="kt">boolean</span> <span class="nf">isTickTuple</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">        <span class="k">return</span> <span class="n">tuple</span><span class="o">.</span><span class="na">getSourceComponent</span><span class="o">().</span><span class="na">equals</span><span class="o">(</span><span class="n">Constants</span><span class="o">.</span><span class="na">SYSTEM_COMPONENT_ID</span><span class="o">)</span>
</span><span class="line">            <span class="o">&amp;&amp;</span> <span class="n">tuple</span><span class="o">.</span><span class="na">getSourceStreamId</span><span class="o">().</span><span class="na">equals</span><span class="o">(</span><span class="n">Constants</span><span class="o">.</span><span class="na">SYSTEM_TICK_STREAM_ID</span><span class="o">);</span>
</span><span class="line">    <span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>I hope that, like me, you can appreciate the elegance of solely using Storm’s existing primitives to implement the new
tick tuple feature. :-)</p>

<h2 id="rollingcountbolt">RollingCountBolt</h2>

<p>This bolt performs rolling counts of incoming objects, i.e. sliding window based counting.  Accordingly it uses the
<code>SlidingWindowCounter</code> class described above to achieve this.  In contrast to the old implementation only this bolt
(more correctly: the instances of this bolt that run as Storm tasks) is interacting with the <code>SlidingWindowCounter</code>
data structure.  Each instance of the bolt has its own private <code>SlidingWindowCounter</code> field, which eliminates the need
for any custom inter-thread communication and synchronization.</p>

<p>The bolt combines the previously described tick tuples (that trigger at fix intervals in time) with the time-agnostic
behavior of <code>SlidingWindowCounter</code> to achieve time-based sliding window counting.  Whenever the bolt receives a tick
tuple, it will advance the window of its private <code>SlidingWindowCounter</code> instance and emit its latest rolling counts.
In the case of normal tuples it will simply count the object and ack the tuple.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>RollingCountBolt  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="nd">@Override</span>
</span><span class="line"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="k">if</span> <span class="o">(</span><span class="n">TupleHelpers</span><span class="o">.</span><span class="na">isTickTuple</span><span class="o">(</span><span class="n">tuple</span><span class="o">))</span> <span class="o">{</span>
</span><span class="line">        <span class="n">LOG</span><span class="o">.</span><span class="na">info</span><span class="o">(</span><span class="s">&quot;Received tick tuple, triggering emit of current window counts&quot;</span><span class="o">);</span>
</span><span class="line">        <span class="n">emitCurrentWindowCounts</span><span class="o">();</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">    <span class="k">else</span> <span class="o">{</span>
</span><span class="line">        <span class="n">countObjAndAck</span><span class="o">(</span><span class="n">tuple</span><span class="o">);</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="kd">private</span> <span class="kt">void</span> <span class="nf">emitCurrentWindowCounts</span><span class="o">()</span> <span class="o">{</span>
</span><span class="line">    <span class="n">Map</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">counts</span> <span class="o">=</span> <span class="n">counter</span><span class="o">.</span><span class="na">getCountsThenAdvanceWindow</span><span class="o">();</span>
</span><span class="line">    <span class="o">...</span>
</span><span class="line">    <span class="n">emit</span><span class="o">(</span><span class="n">counts</span><span class="o">,</span> <span class="n">actualWindowLengthInSeconds</span><span class="o">);</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="kd">private</span> <span class="kt">void</span> <span class="nf">emit</span><span class="o">(</span><span class="n">Map</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">counts</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="k">for</span> <span class="o">(</span><span class="n">Entry</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">,</span> <span class="n">Long</span><span class="o">&gt;</span> <span class="n">entry</span> <span class="o">:</span> <span class="n">counts</span><span class="o">.</span><span class="na">entrySet</span><span class="o">())</span> <span class="o">{</span>
</span><span class="line">        <span class="n">Object</span> <span class="n">obj</span> <span class="o">=</span> <span class="n">entry</span><span class="o">.</span><span class="na">getKey</span><span class="o">();</span>
</span><span class="line">        <span class="n">Long</span> <span class="n">count</span> <span class="o">=</span> <span class="n">entry</span><span class="o">.</span><span class="na">getValue</span><span class="o">();</span>
</span><span class="line">        <span class="n">collector</span><span class="o">.</span><span class="na">emit</span><span class="o">(</span><span class="k">new</span> <span class="n">Values</span><span class="o">(</span><span class="n">obj</span><span class="o">,</span> <span class="n">count</span><span class="o">));</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="kd">private</span> <span class="kt">void</span> <span class="nf">countObjAndAck</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="n">Object</span> <span class="n">obj</span> <span class="o">=</span> <span class="n">tuple</span><span class="o">.</span><span class="na">getValue</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span>
</span><span class="line">    <span class="n">counter</span><span class="o">.</span><span class="na">incrementCount</span><span class="o">(</span><span class="n">obj</span><span class="o">);</span>
</span><span class="line">    <span class="n">collector</span><span class="o">.</span><span class="na">ack</span><span class="o">(</span><span class="n">tuple</span><span class="o">);</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>That’s all there is to it!  The new tick tuples in Storm 0.8 and the cleaned code of the bolt and its collaborators also
make the code much more testable (the new code of this bolt has 98% test coverage).  Compare the code above to the old
implementation of the bolt and decide for yourself which one you’d prefer adapting or maintaining:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>RollingCountObjects BEFORE Storm tick tuples and refactoring  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
<span class="line-number">40</span>
<span class="line-number">41</span>
<span class="line-number">42</span>
<span class="line-number">43</span>
<span class="line-number">44</span>
<span class="line-number">45</span>
<span class="line-number">46</span>
<span class="line-number">47</span>
<span class="line-number">48</span>
<span class="line-number">49</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">prepare</span><span class="o">(</span><span class="n">Map</span> <span class="n">stormConf</span><span class="o">,</span> <span class="n">TopologyContext</span> <span class="n">context</span><span class="o">,</span> <span class="n">OutputCollector</span> <span class="n">collector</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="n">_collector</span> <span class="o">=</span> <span class="n">collector</span><span class="o">;</span>
</span><span class="line">    <span class="n">cleaner</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Thread</span><span class="o">(</span><span class="k">new</span> <span class="n">Runnable</span><span class="o">()</span> <span class="o">{</span>
</span><span class="line">        <span class="kd">public</span> <span class="kt">void</span> <span class="nf">run</span><span class="o">()</span> <span class="o">{</span>
</span><span class="line">            <span class="n">Integer</span> <span class="n">lastBucket</span> <span class="o">=</span> <span class="n">currentBucket</span><span class="o">(</span><span class="n">_numBuckets</span><span class="o">);</span>
</span><span class="line">
</span><span class="line">            <span class="k">while</span><span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">              <span class="kt">int</span> <span class="n">currBucket</span> <span class="o">=</span> <span class="n">currentBucket</span><span class="o">(</span><span class="n">_numBuckets</span><span class="o">);</span>
</span><span class="line">              <span class="k">if</span><span class="o">(</span><span class="n">currBucket</span><span class="o">!=</span><span class="n">lastBucket</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">                  <span class="kt">int</span> <span class="n">bucketToWipe</span> <span class="o">=</span> <span class="o">(</span><span class="n">currBucket</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span> <span class="o">%</span> <span class="n">_numBuckets</span><span class="o">;</span>
</span><span class="line">                  <span class="kd">synchronized</span><span class="o">(</span><span class="n">_objectCounts</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">                      <span class="n">Set</span> <span class="n">objs</span> <span class="o">=</span> <span class="k">new</span> <span class="n">HashSet</span><span class="o">(</span><span class="n">_objectCounts</span><span class="o">.</span><span class="na">keySet</span><span class="o">());</span>
</span><span class="line">                      <span class="k">for</span> <span class="o">(</span><span class="n">Object</span> <span class="nl">obj:</span> <span class="n">objs</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">                        <span class="kt">long</span><span class="o">[]</span> <span class="n">counts</span> <span class="o">=</span> <span class="n">_objectCounts</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">obj</span><span class="o">);</span>
</span><span class="line">                        <span class="kt">long</span> <span class="n">currBucketVal</span> <span class="o">=</span> <span class="n">counts</span><span class="o">[</span><span class="n">bucketToWipe</span><span class="o">];</span>
</span><span class="line">                        <span class="n">counts</span><span class="o">[</span><span class="n">bucketToWipe</span><span class="o">]</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
</span><span class="line">                        <span class="kt">long</span> <span class="n">total</span> <span class="o">=</span> <span class="n">totalObjects</span><span class="o">(</span><span class="n">obj</span><span class="o">);</span>
</span><span class="line">                        <span class="k">if</span><span class="o">(</span><span class="n">currBucketVal</span><span class="o">!=</span><span class="mi">0</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">                            <span class="n">_collector</span><span class="o">.</span><span class="na">emit</span><span class="o">(</span><span class="k">new</span> <span class="n">Values</span><span class="o">(</span><span class="n">obj</span><span class="o">,</span> <span class="n">total</span><span class="o">));</span>
</span><span class="line">                        <span class="o">}</span>
</span><span class="line">                        <span class="k">if</span><span class="o">(</span><span class="n">total</span><span class="o">==</span><span class="mi">0</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">                            <span class="n">_objectCounts</span><span class="o">.</span><span class="na">remove</span><span class="o">(</span><span class="n">obj</span><span class="o">);</span>
</span><span class="line">                        <span class="o">}</span>
</span><span class="line">                      <span class="o">}</span>
</span><span class="line">                  <span class="o">}</span>
</span><span class="line">                  <span class="n">lastBucket</span> <span class="o">=</span> <span class="n">currBucket</span><span class="o">;</span>
</span><span class="line">              <span class="o">}</span>
</span><span class="line">              <span class="kt">long</span> <span class="n">delta</span> <span class="o">=</span> <span class="n">millisPerBucket</span><span class="o">(</span><span class="n">_numBuckets</span><span class="o">)</span> <span class="o">-</span> <span class="o">(</span><span class="n">System</span><span class="o">.</span><span class="na">currentTimeMillis</span><span class="o">()</span> <span class="o">%</span> <span class="n">millisPerBucket</span><span class="o">(</span><span class="n">_numBuckets</span><span class="o">));</span>
</span><span class="line">              <span class="n">Utils</span><span class="o">.</span><span class="na">sleep</span><span class="o">(</span><span class="n">delta</span><span class="o">);</span>
</span><span class="line">            <span class="o">}</span>
</span><span class="line">        <span class="o">}</span>
</span><span class="line">    <span class="o">});</span>
</span><span class="line">    <span class="n">cleaner</span><span class="o">.</span><span class="na">start</span><span class="o">();</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="n">Object</span> <span class="n">obj</span> <span class="o">=</span> <span class="n">tuple</span><span class="o">.</span><span class="na">getValue</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span>
</span><span class="line">    <span class="kt">int</span> <span class="n">bucket</span> <span class="o">=</span> <span class="n">currentBucket</span><span class="o">(</span><span class="n">_numBuckets</span><span class="o">);</span>
</span><span class="line">    <span class="kd">synchronized</span><span class="o">(</span><span class="n">_objectCounts</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">        <span class="kt">long</span><span class="o">[]</span> <span class="n">curr</span> <span class="o">=</span> <span class="n">_objectCounts</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">obj</span><span class="o">);</span>
</span><span class="line">        <span class="k">if</span><span class="o">(</span><span class="n">curr</span><span class="o">==</span><span class="kc">null</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">            <span class="n">curr</span> <span class="o">=</span> <span class="k">new</span> <span class="kt">long</span><span class="o">[</span><span class="n">_numBuckets</span><span class="o">];</span>
</span><span class="line">            <span class="n">_objectCounts</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">obj</span><span class="o">,</span> <span class="n">curr</span><span class="o">);</span>
</span><span class="line">        <span class="o">}</span>
</span><span class="line">        <span class="n">curr</span><span class="o">[</span><span class="n">bucket</span><span class="o">]++;</span>
</span><span class="line">        <span class="n">_collector</span><span class="o">.</span><span class="na">emit</span><span class="o">(</span><span class="k">new</span> <span class="n">Values</span><span class="o">(</span><span class="n">obj</span><span class="o">,</span> <span class="n">totalObjects</span><span class="o">(</span><span class="n">obj</span><span class="o">)));</span>
</span><span class="line">        <span class="n">_collector</span><span class="o">.</span><span class="na">ack</span><span class="o">(</span><span class="n">tuple</span><span class="o">);</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h3 id="unit-test-example">Unit Test Example</h3>

<p>Since I mentioned unit testing a couple of times in the previous section, let me briefly discuss this point in further
detail.  I implemented the unit tests with <a href="http://testng.org/">TestNG</a>, <a href="http://code.google.com/p/mockito/">Mockito</a>
and <a href="https://github.com/alexruiz/fest-assert-2.x">FEST-Assert</a>.  Here is an example unit test for <code>RollingCountBolt</code>,
taken from
<a href="https://github.com/nathanmarz/storm-starter/blob/master/test/jvm/storm/starter/bolt/RollingCountBoltTest.java">RollingCountBoltTest</a>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Example unit test  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="nd">@Test</span>
</span><span class="line"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">shouldEmitNothingIfNoObjectHasBeenCountedYetAndTickTupleIsReceived</span><span class="o">()</span> <span class="o">{</span>
</span><span class="line">    <span class="c1">// given</span>
</span><span class="line">    <span class="n">Tuple</span> <span class="n">tickTuple</span> <span class="o">=</span> <span class="n">MockTupleHelpers</span><span class="o">.</span><span class="na">mockTickTuple</span><span class="o">();</span>
</span><span class="line">    <span class="n">RollingCountBolt</span> <span class="n">bolt</span> <span class="o">=</span> <span class="k">new</span> <span class="n">RollingCountBolt</span><span class="o">();</span>
</span><span class="line">    <span class="n">Map</span> <span class="n">conf</span> <span class="o">=</span> <span class="n">mock</span><span class="o">(</span><span class="n">Map</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
</span><span class="line">    <span class="n">TopologyContext</span> <span class="n">context</span> <span class="o">=</span> <span class="n">mock</span><span class="o">(</span><span class="n">TopologyContext</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
</span><span class="line">    <span class="n">OutputCollector</span> <span class="n">collector</span> <span class="o">=</span> <span class="n">mock</span><span class="o">(</span><span class="n">OutputCollector</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
</span><span class="line">    <span class="n">bolt</span><span class="o">.</span><span class="na">prepare</span><span class="o">(</span><span class="n">conf</span><span class="o">,</span> <span class="n">context</span><span class="o">,</span> <span class="n">collector</span><span class="o">);</span>
</span><span class="line">
</span><span class="line">    <span class="c1">// when</span>
</span><span class="line">    <span class="n">bolt</span><span class="o">.</span><span class="na">execute</span><span class="o">(</span><span class="n">tickTuple</span><span class="o">);</span>
</span><span class="line">
</span><span class="line">    <span class="c1">// then</span>
</span><span class="line">    <span class="n">verifyZeroInteractions</span><span class="o">(</span><span class="n">collector</span><span class="o">);</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="abstractrankerbolt">AbstractRankerBolt</h2>

<p>This abstract bolt provides the basic behavior of bolts that rank objects according to their natural order.  It uses the
<a href="http://en.wikipedia.org/wiki/Template_method_pattern">template method design pattern</a> for its <code>execute()</code> method to
allow actual bolt implementations to specify how incoming tuples are processed, i.e. how the objects embedded within
those tuples are retrieved and counted.</p>

<p>This bolt has a private <code>Rankings</code> field to rank incoming tuples (those must contain <code>Rankable</code> objects, of course)
according to their natural order.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>AbstractRankerBolt  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="cm">/**</span>
</span><span class="line"><span class="cm"> * This method functions as a template method (design pattern).</span>
</span><span class="line"><span class="cm"> */</span>
</span><span class="line"><span class="nd">@Override</span>
</span><span class="line"><span class="kd">public</span> <span class="kd">final</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">,</span> <span class="n">BasicOutputCollector</span> <span class="n">collector</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="k">if</span> <span class="o">(</span><span class="n">TupleHelpers</span><span class="o">.</span><span class="na">isTickTuple</span><span class="o">(</span><span class="n">tuple</span><span class="o">))</span> <span class="o">{</span>
</span><span class="line">        <span class="n">getLogger</span><span class="o">().</span><span class="na">info</span><span class="o">(</span><span class="s">&quot;Received tick tuple, triggering emit of current rankings&quot;</span><span class="o">);</span>
</span><span class="line">        <span class="n">emitRankings</span><span class="o">(</span><span class="n">collector</span><span class="o">);</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">    <span class="k">else</span> <span class="o">{</span>
</span><span class="line">        <span class="n">updateRankingsWithTuple</span><span class="o">(</span><span class="n">tuple</span><span class="o">);</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">
</span><span class="line">    <span class="kd">abstract</span> <span class="kt">void</span> <span class="nf">updateRankingsWithTuple</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">);</span>
</span><span class="line">
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The two actual implementations used in the Rolling Top Words topology, <code>IntermediateRankingsBolt</code> and
<code>TotalRankingsBolt</code>, only need to implement the <code>updateRankingsWithTuple()</code> method.</p>

<h2 id="intermediaterankingsbolt">IntermediateRankingsBolt</h2>

<p>This bolt extends <code>AbstractRankerBolt</code> and ranks incoming objects by their count in order to produce intermediate
rankings.  This type of aggregation is similar to the functionality of a <em>combiner</em> in Hadoop.  The topology runs many
of such intermediate ranking bolts in parallel to distribute the load of processing the incoming rolling counts from
the <code>RollingCountBolt</code> instances.</p>

<p>This bolt only needs to override <code>updateRankingsWithTuple()</code> of <code>AbstractRankerBolt</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>IntermediateRankingsBolt  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="nd">@Override</span>
</span><span class="line"><span class="kt">void</span> <span class="nf">updateRankingsWithTuple</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="n">Rankable</span> <span class="n">rankable</span> <span class="o">=</span> <span class="n">RankableObjectWithFields</span><span class="o">.</span><span class="na">from</span><span class="o">(</span><span class="n">tuple</span><span class="o">);</span>
</span><span class="line">    <span class="kd">super</span><span class="o">.</span><span class="na">getRankings</span><span class="o">().</span><span class="na">updateWith</span><span class="o">(</span><span class="n">rankable</span><span class="o">);</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="totalrankingsbolt">TotalRankingsBolt</h2>

<p>This bolt extends <code>AbstractRankerBolt</code> and merges incoming intermediate <code>Rankings</code> emitted by the
<code>IntermediateRankingsBolt</code> instances.</p>

<p>Like <code>IntermediateRankingsBolt</code>, this bolt only needs to override the <code>updateRankingsWithTuple()</code> method:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>TotalRankingsBolt  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="nd">@Override</span>
</span><span class="line"><span class="kt">void</span> <span class="nf">updateRankingsWithTuple</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="n">Rankings</span> <span class="n">rankingsToBeMerged</span> <span class="o">=</span> <span class="o">(</span><span class="n">Rankings</span><span class="o">)</span> <span class="n">tuple</span><span class="o">.</span><span class="na">getValue</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span>
</span><span class="line">    <span class="kd">super</span><span class="o">.</span><span class="na">getRankings</span><span class="o">().</span><span class="na">updateWith</span><span class="o">(</span><span class="n">rankingsToBeMerged</span><span class="o">);</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Since this bolt is responsible for creating a global, consolidated ranking
of currently trending topics, the topology must run only a single instance of <code>TotalRankingsBolt</code>.  In other words,
it must be a singleton in the topology.</p>

<p>The bolt’s current code in <code>storm-starter</code> does not enforce this behavior though – instead it relies on the
<code>RollingTopWords</code> class to configure the bolt’s parallelism correctly (if you ask yourself why it doesn’t: that was
simply oversight on my part, oops).  If you want to improve that, you can provide a so-called <em>per-component</em> Storm
configuration for this bolt that sets its maximum task parallelism to 1:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>TotalRankingsBolt  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="nd">@Override</span>
</span><span class="line"><span class="kd">public</span> <span class="n">Map</span><span class="o">&lt;</span><span class="n">String</span><span class="o">,</span> <span class="n">Object</span><span class="o">&gt;</span> <span class="n">getComponentConfiguration</span><span class="o">()</span> <span class="o">{</span>
</span><span class="line">    <span class="n">Map</span><span class="o">&lt;</span><span class="n">String</span><span class="o">,</span> <span class="n">Object</span><span class="o">&gt;</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">HashMap</span><span class="o">&lt;</span><span class="n">String</span><span class="o">,</span> <span class="n">Object</span><span class="o">&gt;();</span>
</span><span class="line">    <span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_TICK_TUPLE_FREQ_SECS</span><span class="o">,</span> <span class="n">emitFrequencyInSeconds</span><span class="o">);</span>
</span><span class="line">    <span class="c1">// run only a single instance of this bolt in the Storm topology</span>
</span><span class="line">    <span class="n">conf</span><span class="o">.</span><span class="na">setMaxTaskParallelism</span><span class="o">(</span><span class="mi">1</span><span class="o">);</span>
</span><span class="line">    <span class="k">return</span> <span class="n">conf</span><span class="o">;</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="rollingtopwords">RollingTopWords</h2>

<p>The class <a href="https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/RollingTopWords.java">RollingTopWords</a>
ties all the previously discussed code pieces together.  It implements the actual Storm topology, configures spouts
and bolts, wires them together and launches the topology in local mode (Storm’s local mode is similar to a
<a href="http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/">pseudo-distributed, single-node Hadoop cluster</a>).</p>

<p>By default, it will produce the top 5 rolling words (our trending topics) and run for one minute before terminating.
If you want to twiddle with the topology’s configuration settings, here are the most important:</p>

<ul>
  <li>Configure the number of generated trending topics by setting the <code>TOP_N</code> constant in <code>RollingTopWords</code>.</li>
  <li>Configure the length and emit frequencies (both in seconds) for the sliding window counting in the constructor of
<code>RollingCountBolt</code> in <code>RollingTopWords#wireTopology()</code>.</li>
  <li>Similarly, configure the emit frequencies (in seconds) of the ranking bolts by using their corresponding constructors.</li>
  <li>Configure the
<a href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/">parallelism of the topology</a> by setting the
<code>parallelism_hint</code> parameter of each bolt and spout accordingly.</li>
</ul>

<p>Apart from this there is nothing special about this class. And because we have already seen the most important code
snippet from this class in the section <em>Overview of the Topology</em> I will not describe it any further here.</p>

<h2 id="running-the-rolling-top-words-topology">Running the Rolling Top Words topology</h2>

<p>Now that you know how the trending topics Storm code works it is about time we actually launch the topology!  The
topology is configured to run in local mode, which means you can just grab the code to your development box and launch
it right away.  You do not need any special Storm cluster installation or similar setup.</p>

<p>First you must checkout the latest code of the <a href="https://github.com/nathanmarz/storm-starter/">storm-starter</a> project
from GitHub:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>git clone git://github.com/nathanmarz/storm-starter.git
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Then you compile and run the Rolling Top Words topology:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># make sure you are in the top-level directory of the storm-starter repo</span>
</span><span class="line"><span class="nv">$ </span><span class="nb">cd </span>storm-starter
</span><span class="line">
</span><span class="line"><span class="nv">$ </span>mvn -f m2-pom.xml compile <span class="nb">exec</span>:java -Dexec.classpathScope<span class="o">=</span>compile -Dexec.mainClass<span class="o">=</span>storm.starter.RollingTopWords
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>By default the topology will run for one minute and then terminate automatically.</p>

<div class="note">
More information about running and packaging the code for use in a Storm cluster is available in the repo&#8217;s
<a href="https://github.com/nathanmarz/storm-starter/blob/master/README.markdown">README</a> file.
</div>

<h3 id="example-logging-output">Example Logging Output</h3>

<p>Here is some example logging output of the topology.  The first colum is the current time in milliseconds since the
topology was started (i.e. it is <code>0</code> at the very beginning).  The second colum is the ID of the thread that logged
the message.  I deliberately removed some entries in the log flow to make the output easier to read.  For this reason
please take a close look on the timestamps (first column) when you want to compare the various example outputs below.</p>

<p>Also, the Rolling Top Words  topology has debugging output enabled.  This means that Storm itself will by default log
information such as what data a bolt/spout has emitted.  For that reason you will see seemingly duplicate lines in the
logs below.</p>

<p>Lastly, to make the logging output easier to read here is some information about the various thread IDs in this example
run:</p>

<table>
  <thead>
    <tr>
      <th>Thread ID</th>
      <th>Java Class</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Thread-37</td>
      <td>TestWordSpout</td>
    </tr>
    <tr>
      <td>Thread-39</td>
      <td>TestWordSpout</td>
    </tr>
    <tr>
      <td>Thread-19</td>
      <td>RollingCountBolt</td>
    </tr>
    <tr>
      <td>Thread-21</td>
      <td>RollingCountBolt</td>
    </tr>
    <tr>
      <td>Thread-25</td>
      <td>RollingCountBolt</td>
    </tr>
    <tr>
      <td>Thread-31</td>
      <td>IntermediateRankingsBolt</td>
    </tr>
    <tr>
      <td>Thread-33</td>
      <td>IntermediateRankingsBolt</td>
    </tr>
    <tr>
      <td>Thread-27</td>
      <td>TotalRankingsBolt</td>
    </tr>
  </tbody>
</table>

<div class="note">
Note: The Rolling Top Words code in the <tt>storm-starter</tt> repository runs more instances of the various spouts and
bolts than the code used in this article.  I downscaled the settings only to make the figures etc. easier to read.
This means your own logging output will look slightly different.
</div>

<p>The topology has just started to run.  The spouts generate their first output messages:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class=""><span class="line">2056 [Thread-37] INFO  backtype.storm.daemon.task  - Emitting: wordGenerator default [golda]
</span><span class="line">2057 [Thread-19] INFO  backtype.storm.daemon.executor  - Processing received message source: wordGenerator:11, stream: default, id: {}, [golda]
</span><span class="line">2063 [Thread-39] INFO  backtype.storm.daemon.task  - Emitting: wordGenerator default [nathan]
</span><span class="line">2064 [Thread-25] INFO  backtype.storm.daemon.executor  - Processing received message source: wordGenerator:12, stream: default, id: {}, [nathan]
</span><span class="line">2069 [Thread-37] INFO  backtype.storm.daemon.task  - Emitting: wordGenerator default [mike]
</span><span class="line">2069 [Thread-21] INFO  backtype.storm.daemon.executor  - Processing received message source: wordGenerator:13, stream: default, id: {}, [mike]</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The three RollingCountBolt instances start to emit their first sliding window counts:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class=""><span class="line">4765 [Thread-19] INFO  backtype.storm.daemon.executor  - Processing received message source: __system:-1, stream: __tick, id: {}, [3]
</span><span class="line">4765 [Thread-19] INFO  storm.starter.bolt.RollingCountBolt  - Received tick tuple, triggering emit of current window counts
</span><span class="line">4765 [Thread-25] INFO  backtype.storm.daemon.executor  - Processing received message source: __system:-1, stream: __tick, id: {}, [3]
</span><span class="line">4765 [Thread-25] INFO  storm.starter.bolt.RollingCountBolt  - Received tick tuple, triggering emit of current window counts
</span><span class="line">4766 [Thread-21] INFO  backtype.storm.daemon.executor  - Processing received message source: __system:-1, stream: __tick, id: {}, [3]
</span><span class="line">4766 [Thread-21] INFO  storm.starter.bolt.RollingCountBolt  - Received tick tuple, triggering emit of current window counts
</span><span class="line">4766 [Thread-19] INFO  backtype.storm.daemon.task  - Emitting: counter default [golda, 24, 2]
</span><span class="line">4766 [Thread-25] INFO  backtype.storm.daemon.task  - Emitting: counter default [nathan, 33, 2]
</span><span class="line">4766 [Thread-21] INFO  backtype.storm.daemon.task  - Emitting: counter default [mike, 27, 2]</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The two <code>IntermediateRankingsBolt</code> instances emit their intermediate rankings:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class=""><span class="line">5774 [Thread-31] INFO  backtype.storm.daemon.task  - Emitting: intermediateRanker default [[[mike|27|2], [golda|24|2]]]
</span><span class="line">5774 [Thread-33] INFO  backtype.storm.daemon.task  - Emitting: intermediateRanker default [[[bertels|31|2], [jackson|19|2]]]
</span><span class="line">5774 [Thread-31] INFO  storm.starter.bolt.IntermediateRankingsBolt  - Rankings: [[mike|27|2], [golda|24|2]]
</span><span class="line">5774 [Thread-33] INFO  storm.starter.bolt.IntermediateRankingsBolt  - Rankings: [[bertels|31|2], [jackson|19|2]]</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The single <code>TotalRankingsBolt</code> instance emits its global rankings:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class=""><span class="line">3765 [Thread-27] INFO  storm.starter.bolt.TotalRankingsBolt  - Rankings: []
</span><span class="line">5767 [Thread-27] INFO  storm.starter.bolt.TotalRankingsBolt  - Rankings: []
</span><span class="line">7768 [Thread-27] INFO  storm.starter.bolt.TotalRankingsBolt  - Rankings: [[nathan|33|2], [bertels|31|2], [mike|27|2], [golda|24|2], [jackson|19|2]]
</span><span class="line">9770 [Thread-27] INFO  storm.starter.bolt.TotalRankingsBolt  - Rankings: [[bertels|76|5], [nathan|58|5], [mike|49|5], [golda|24|2], [jackson|19|2]]
</span><span class="line">11771 [Thread-27] INFO  storm.starter.bolt.TotalRankingsBolt  - Rankings: [[bertels|76|5], [nathan|58|5], [jackson|52|5], [mike|49|5], [golda|49|5]]
</span><span class="line">13772 [Thread-27] INFO  storm.starter.bolt.TotalRankingsBolt  - Rankings: [[bertels|110|8], [nathan|85|8], [golda|85|8], [jackson|83|8], [mike|71|8]]</span></code></pre></td></tr></table></div></figure></notextile></div>

<div class="note">
Note: During the first few seconds after startup you will observe that <tt>IntermediateRankingsBolt</tt> and
<tt>TotalRankingsBolt</tt> instances will emit empty rankings.  This is normal and the expected behavior &#8211; during the
first seconds the <tt>RollingCountBolt</tt> instances will collect incoming words/topics and fill their sliding windows
before emitting the first rolling counts to the <tt>IntermediateRankingsBolt</tt> instances.  The same kind of thing
happens for the combination of <tt>IntermediateBolt</tt> instances and the <tt>TotalRankingsBolt</tt> instance.
This is an important behavior of the code that must be understood by downstream data consumers of the trending topics
emitted by the topology.
</div>

<h1 id="what-i-did-not-cover">What I Did Not Cover</h1>

<p>I introduced a new feature to the Rolling Top Words code that I contributed back to <code>storm-starter</code>.  This feature
is a metric that tracks the difference between the configured length of the sliding window (in seconds) and the actual
window length as seen in the emitted output data.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">4763 [Thread-25] WARN  storm.starter.bolt.RollingCountBolt  - Actual window length is 2 seconds when it should be 9 seconds (you can safely ignore this warning during the startup phase)</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>This metric provides downstream data consumers with additional meta data, namely the time range that a data tuple
actually covers.  It is a nifty addition that will make the life of your fellow data scientists easier.  Typically, you
will see a difference between configured and actual window length a) during startup for the reasons mentioned above and
b) when your machines are under high load and therefore not respond perfectly in time.  I omitted the discussion of
this new feature to prevent this article from getting too long.</p>

<p>Also, there are some minor changes in my own code that I did not contribute back to <code>storm-starter</code> because I did not
want to introduce too many changes at once (such as a refactored <code>TestWordSpout</code> class).</p>

<h1 id="summary">Summary</h1>

<p>In this article I described how to implement a distributed, real-time trending topics algorithm in Storm.  It uses
the latest features available in Storm 0.8 (namely tick tuples) and should be a good starting point for anyone trying
to implement such an algorithm for their own application.  The new code is now available in the official
<a href="https://github.com/nathanmarz/storm-starter">storm-starter</a> repository, so feel free to take a deeper look.</p>

<p><span class="pullquote-right" data-pullquote="The sliding window analysis described here applies to a broader range of problems than computing trending topics.">
You might ask whether there is a use of a distributed sliding window analysis beyond the use case I presented in this
article.  And for sure there is.  The sliding window analysis described here applies to a broader range of problems
than computing trending topics.  Another typical area of application is real-time infrastructure monitoring, for
instance to identify broken servers by detecting a surge of errors originating from problematic machines.  A similar
use case is identifying attacks against your technical infrastructure, notably flood-type DDoS attacks.  All of these
scenarios can benefit from sliding window analyses of incoming real-time data through tools such as Storm.
</span></p>

<p>If you think the starter code can be improved further, please contribute your changes back to the
<a href="https://github.com/nathanmarz/storm-starter">storm-starter</a> project on GitHub by forking the official repository,
making your changes and sending a pull request.</p>

<h1 id="related-links">Related Links</h1>

<ul>
  <li><a href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/">Understanding the Parallelism of a Storm Topology</a></li>
</ul>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Understanding the Parallelism of a Storm Topology]]></title>
    <link href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/"/>
    <updated>2012-10-16T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology</id>
    <content type="html"><![CDATA[<p>In the past few days I have been test-driving Twitter’s <a href="http://storm-project.net/">Storm</a> project, which is a
distributed real-time data processing platform.  One of my findings so far has been that the quality of Storm’s
documentation and example code is pretty good – it is very easy to get up and running with Storm.  Big props to the
Storm developers!  At the same time, I found the sections on how a Storm topology runs in a cluster not perfectly
clear, and learned that the recent releases of Storm changed some of its behavior in a way that is not yet fully
reflected in the Storm wiki and in the API docs.</p>

<p>In this article I want to share my own understanding of the parallelism of a Storm topology after reading the
documentation and writing some first prototype code.  More specifically, I describe the relationships of worker
processes, executors (threads) and tasks, and how you can configure them according to your needs.  This article is
based on Storm release 0.8.1, the latest version as of October 2012.</p>

<!--more-->

<div class="pointer">
<strong>Update 2012-11-05</strong>: This blog post has been merged into
<a href="https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology">Storm&#8217;s documentation</a>.
</div>

<h1 id="what-is-storm">What is Storm?</h1>

<p>For those readers unfamiliar with Storm here is a <a href="http://storm-project.net/">brief description taken from its homepage</a>:</p>

<blockquote><p>Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!</p><p>Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.</p></blockquote>

<h1 id="what-makes-a-running-topology-worker-processes-executors-and-tasks">What makes a running topology: worker processes, executors and tasks</h1>

<p>Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm
cluster:</p>

<ul>
  <li>Worker processes</li>
  <li>Executors (threads)</li>
  <li>Tasks</li>
</ul>

<p>Here is a simple illustration of their relationships:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/Storm_worker-processes_executors_tasks.png" width="563" height="341" title="Relationships of worker processes, executors (threads) and tasks in Storm" /></p>

<div class="caption">
Figure 1: The relationships of worker processes, executors (threads) and tasks in Storm
</div>

<p>A <em>worker process</em> executes a subset of a topology.  A worker process belongs to a specific topology and may run one or
more executors for one or more components (spouts or bolts) of this topology.  A running topology consists of many such
processes running on many machines within a Storm cluster.</p>

<p>An <em>executor</em> is a thread that is spawned by a worker process.  It may run one or more tasks for the same component
(spout or bolt).</p>

<p>A <em>task</em> performs the actual data processing – each spout or bolt that you implement in your code executes as many
tasks across the cluster.  The number of tasks for a component is always the same throughout the lifetime of a
topology, but the number of executors (threads) for a component can change over time.  This means that the following
condition holds true: <code>#threads &lt;= #tasks</code>.  By default, the number of tasks is set to be the same as the number
of executors, i.e. Storm will run one task per thread.</p>

<h1 id="configuring-the-parallelism-of-a-topology">Configuring the parallelism of a topology</h1>

<p>Note that in Storm’s terminology “parallelism” is specifically used to describe the so-called <em>parallelism hint</em>, which
means the initial number of executors (threads) of a component.  In this article though I use the term “parallelism”
in a more general sense to describe how you can configure not only the number of executors but also the number of worker
processes and the number of tasks of a Storm topology.  I will specifically call out when “parallelism” is used in the
narrow definition of Storm.</p>

<p>The following table gives an overview of the various configuration options and how to set them in your code.  There is
more than one way of setting these options though, and the table lists only some of them.  Storm currently has the
following <a href="https://github.com/nathanmarz/storm/wiki/Configuration">order of precedence for configuration settings</a>:
<code>defaults.yaml</code> &gt; <code>storm.yaml</code> &gt; topology-specific configuration &gt; internal component-specific configuration &gt;
external component-specific configuration.  Please take a look at the Storm documentation for more details.</p>

<table>
<tr>
<th>What</th>
<th>Description</th>
<th>Configuration option</th>
<th>How to set in your code (examples)</th>
</tr>
<tr>
<td>#worker processes</td>
<td>How many worker processes to create <em>for the topology</em> across machines in the cluster.</td>
<td><a href="http://nathanmarz.github.com/storm/doc-0.8.1/backtype/storm/Config.html#TOPOLOGY_WORKERS">TOPOLOGY_WORKERS</a></td>
<td><a href="http://nathanmarz.github.com/storm/doc-0.8.1/backtype/storm/Config.html#setNumWorkers(int)">Config#setNumWorkers</a></td>
</tr>
<tr>
<td>#executors&nbsp;(threads)</td>
<td>How many executors to spawn <em>per component</em>.</td>
<td>?</td>
<td><a href="http://nathanmarz.github.com/storm/doc-0.8.1/backtype/storm/topology/TopologyBuilder.html#setSpout(java.lang.String, backtype.storm.topology.IRichSpout, java.lang.Number)">TopologyBuilder#setSpout()</a> and <a href="http://nathanmarz.github.com/storm/doc-0.8.1/backtype/storm/topology/TopologyBuilder.html#setBolt(java.lang.String, backtype.storm.topology.IRichBolt, java.lang.Number)">TopologyBuilder#setBolt()</a><br /><br /><br />Note that as of Storm 0.8 the <em>parallelism_hint</em> parameter now specifies the initial number of executors (not tasks!) for that bolt.</td>
</tr>
<tr>
<td>#tasks</td>
<td>How many tasks to create <em>per component</em>.</td>
<td><a href="http://nathanmarz.github.com/storm/doc-0.8.1/backtype/storm/Config.html#TOPOLOGY_TASKS">TOPOLOGY_TASKS</a></td>
<td><a href="http://nathanmarz.github.com/storm/doc-0.8.1/backtype/storm/topology/ComponentConfigurationDeclarer.html#setNumTasks(java.lang.Number)"> ComponentConfigurationDeclarer
#setNumTasks()</a></td>
</tr>
</table>

<p>Here is an example code snippet to show these settings in practice:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Configuring the parallelism of a Storm bolt  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="n">topologyBuilder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">&quot;green-bolt&quot;</span><span class="o">,</span> <span class="k">new</span> <span class="n">GreenBolt</span><span class="o">(),</span> <span class="mi">2</span><span class="o">)</span>
</span><span class="line">               <span class="o">.</span><span class="na">setNumTasks</span><span class="o">(</span><span class="mi">4</span><span class="o">)</span>
</span><span class="line">               <span class="o">.</span><span class="na">shuffleGrouping</span><span class="o">(</span><span class="s">&quot;blue-spout&quot;</span><span class="o">);</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>In the above code we configured Storm to run the bolt <code>GreenBolt</code> with an initial number of two executors and four
associated tasks.  Storm will run two tasks per executor (thread).  If you do not explicitly configure the number of
tasks, Storm will run by default one task per executor.</p>

<h1 id="example-of-a-running-topology">Example of a running topology</h1>

<p>The following illustration shows how a simple topology would look like in operation.  The topology consists of three
components: one spout called <code>BlueSpout</code> and two bolts called <code>GreenBolt</code> and <code>YellowBolt</code>.  The components are
linked such that <code>BlueSpout</code> sends its output to <code>GreenBolt</code>, which in turns sends its own output to
<code>YellowBolt</code>.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/Storm_example_of_a_running_topology.png" width="565" height="514" title="Storm: Example of a running topology" /></p>

<div class="caption">
Figure 2: Example of a running topology
</div>

<p>The <code>GreenBolt</code> was configured as per the code snippet above whereas <code>BlueSpout</code> and <code>YellowBolt</code> only set the parallelism hint
(number of executors).  Here is the relevant code:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>Configuring the parallelism of a simple Storm topology  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="n">Config</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Config</span><span class="o">();</span>
</span><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">setNumWorkers</span><span class="o">(</span><span class="mi">2</span><span class="o">);</span> <span class="c1">// use two worker processes</span>
</span><span class="line">
</span><span class="line"><span class="n">topologyBuilder</span><span class="o">.</span><span class="na">setSpout</span><span class="o">(</span><span class="s">&quot;blue-spout&quot;</span><span class="o">,</span> <span class="k">new</span> <span class="n">BlueSpout</span><span class="o">(),</span> <span class="mi">2</span><span class="o">);</span> <span class="c1">// parallelism hint</span>
</span><span class="line">
</span><span class="line"><span class="n">topologyBuilder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">&quot;green-bolt&quot;</span><span class="o">,</span> <span class="k">new</span> <span class="n">GreenBolt</span><span class="o">(),</span> <span class="mi">2</span><span class="o">)</span>
</span><span class="line">               <span class="o">.</span><span class="na">setNumTasks</span><span class="o">(</span><span class="mi">4</span><span class="o">)</span>
</span><span class="line">               <span class="o">.</span><span class="na">shuffleGrouping</span><span class="o">(</span><span class="s">&quot;blue-spout&quot;</span><span class="o">);</span>
</span><span class="line">
</span><span class="line"><span class="n">topologyBuilder</span><span class="o">.</span><span class="na">setBolt</span><span class="o">(</span><span class="s">&quot;yellow-bolt&quot;</span><span class="o">,</span> <span class="k">new</span> <span class="n">YellowBolt</span><span class="o">(),</span> <span class="mi">6</span><span class="o">)</span>
</span><span class="line">               <span class="o">.</span><span class="na">shuffleGrouping</span><span class="o">(</span><span class="s">&quot;green-bolt&quot;</span><span class="o">);</span>
</span><span class="line">
</span><span class="line"><span class="n">StormSubmitter</span><span class="o">.</span><span class="na">submitTopology</span><span class="o">(</span>
</span><span class="line">        <span class="s">&quot;mytopology&quot;</span><span class="o">,</span>
</span><span class="line">        <span class="n">conf</span><span class="o">,</span>
</span><span class="line">        <span class="n">topologyBuilder</span><span class="o">.</span><span class="na">createTopology</span><span class="o">()</span>
</span><span class="line">    <span class="o">);</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>And of course Storm comes with additional configuration settings to control the parallelism of a topology, including:</p>

<ul>
  <li><a href="http://nathanmarz.github.com/storm/doc-0.8.1/backtype/storm/Config.html#TOPOLOGY_MAX_TASK_PARALLELISM">TOPOLOGY_MAX_TASK_PARALLELISM</a>:
This setting puts a ceiling on the number of executors that can be spawned for a single component.  It is typically
used during testing to limit the number of threads spawned when running a topology in local mode.  You can set this
option via e.g. <a href="http://nathanmarz.github.com/storm/doc-0.8.1/backtype/storm/Config.html#setMaxTaskParallelism(int)">Config#setMaxTaskParallelism()</a>.</li>
</ul>

<p><em>Update Oct 18: Nathan Marz informed me that <code>TOPOLOGY_OPTIMIZE</code> will be removed in a future release.  I have
therefore removed its entry from the configuration list above.</em></p>

<h1 id="how-to-change-the-parallelism-of-a-running-topology">How to change the parallelism of a running topology</h1>

<p>A nifty feature of Storm is that you can increase or decrease the number of worker processes and/or executors without
being required to restart the cluster or the topology.  The act of doing so is called <em>rebalancing</em>.</p>

<p>You have two options to rebalance a topology:</p>

<ol>
  <li>Use the Storm web UI to rebalance the topology.</li>
  <li>Use the CLI tool <tt>storm rebalance</tt> as described below.</li>
</ol>

<p>Here is an example of using the CLI tool:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class=""><span class="line"> # Reconfigure the topology "mytopology" to use 5 worker processes,
</span><span class="line"> # the spout "blue-spout" to use 3 executors and
</span><span class="line"> # the bolt "yellow-bolt" to use 10 executors.
</span><span class="line">
</span><span class="line"> $ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="references-for-this-article">References for this article</h1>

<p>To compile this article (and to write my related test code) I used information primarily from the following sources:</p>

<ul>
  <li>The <a href="https://github.com/nathanmarz/storm/wiki/">Storm wiki</a>, most notably the pages
<a href="https://github.com/nathanmarz/storm/wiki/Concepts">Concepts</a>,
<a href="https://github.com/nathanmarz/storm/wiki/Running-topologies-on-a-production-cluster">Running topologies on a production cluster</a>,
<a href="https://github.com/nathanmarz/storm/wiki/Local-mode">Local mode</a>,
<a href="https://github.com/nathanmarz/storm/wiki/Tutorial">Tutorial</a></li>
  <li><a href="http://nathanmarz.github.com/storm/doc-0.8.1/">Storm 0.8.1 API documentation</a>, most notably the class
<a href="http://nathanmarz.github.com/storm/doc-0.8.1/backtype/storm/Config.html">Config</a></li>
  <li>The <a href="https://groups.google.com/d/msg/storm-user/8addaQm3OT4/0OQfSgQkRwEJ">announcement of Storm 0.8.0 release</a>
on the storm-user mailing list.</li>
</ul>

<h1 id="summary">Summary</h1>

<p>My personal impression is that Storm is a very promising tool.  On the one hand I like its clean and elegant design,
and on the other hand I loved to find out that a young open source tool can still have an excellent documentation.
In this article I tried to summarize my own understanding of the parallelism of topologies, which may or may not be
100% correct – feel free to let me know if there are any mistakes in the description above!</p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Understanding HDFS quotas and Hadoop fs and fsck tools]]></title>
    <link href="http://www.michael-noll.com/blog/2011/10/20/understanding-hdfs-quotas-and-hadoop-fs-and-fsck-tools/"/>
    <updated>2011-10-20T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2011/10/20/understanding-hdfs-quotas-and-hadoop-fs-and-fsck-tools</id>
    <content type="html"><![CDATA[<p>In my experience Hadoop users often confuse the file size numbers reported by commands such as <code>hadoop fsck</code>,
<code>hadoop fs -dus</code> and <code>hadoop fs -count -q</code> when it comes to reasoning about HDFS space quotas.  Here is a short
summary how the various filesystem tools in Hadoop work in unison.</p>

<!--more-->

<p>In this blog post we will look at three commands:</p>

<ul>
  <li><code>hadoop fsck</code> (<a href="http://hadoop.apache.org/common/docs/r0.20.203.0/commands_manual.html#fsck">docs</a>)</li>
  <li><code>hadoop fs -dus</code> (<a href="http://hadoop.apache.org/common/docs/r0.20.203.0/file_system_shell.html#dus">docs</a>)</li>
  <li><code>hadoop fs -count -q</code> (<a href="http://hadoop.apache.org/common/docs/r0.20.203.0/file_system_shell.html#count">docs</a> and
<a href="http://hadoop.apache.org/common/docs/r0.20.203.0/hdfs_quota_admin_guide.html#Reporting+Command">more docs</a>)</li>
</ul>

<h1 id="hadoop-fsck-and-hadoop-fs--dus">hadoop fsck and hadoop fs -dus</h1>

<p>First, let’s start with <code>hadoop fsck</code> and <code>hadoop fs -dus</code> because they will report identical numbers.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop fsck /path/to/directory
</span><span class="line"> Total size:    16565944775310 B    &lt;=== see here
</span><span class="line"> Total dirs:    3922
</span><span class="line"> Total files:   418464
</span><span class="line"> Total blocks (validated):      502705 (avg. block size 32953610 B)
</span><span class="line"> Minimally replicated blocks:   502705 (100.0 %)
</span><span class="line"> Over-replicated blocks:        0 (0.0 %)
</span><span class="line"> Under-replicated blocks:       0 (0.0 %)
</span><span class="line"> Mis-replicated blocks:         0 (0.0 %)
</span><span class="line"> Default replication factor:    3
</span><span class="line"> Average block replication:     3.0
</span><span class="line"> Corrupt blocks:                0
</span><span class="line"> Missing replicas:              0 (0.0 %)
</span><span class="line"> Number of data-nodes:          18
</span><span class="line"> Number of racks:               1
</span><span class="line">FSCK ended at Thu Oct 20 20:49:59 CET 2011 in 7516 milliseconds
</span><span class="line">
</span><span class="line">The filesystem under path '/path/to/directory' is HEALTHY
</span><span class="line">
</span><span class="line">$ hadoop fs -dus /path/to/directory
</span><span class="line">hdfs://master:54310/path/to/directory        16565944775310    &lt;=== see here</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>As you can see, <code>hadoop fsck</code> and <code>hadoop fs -dus</code> report the effective HDFS storage space used, i.e. they show the
“normal” file size (as you would see on a local filesystem) and do not account for replication in HDFS.  In this case,
the directory <code>path/to/directory</code> has stored data with a size of <code>16565944775310</code> bytes (15.1 TB).  Now fsck tells
us that the average replication factor for all files in <code>path/to/directory</code> is exactly <code>3.0</code>  This means that the
total <em>raw</em> HDFS storage space used by these files – i.e. factoring in replication – is actually:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">3.0 x 16565944775310 (15.1 TB) = 49697834325930 Bytes (45.2 TB)</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>This is how much HDFS storage is consumed by files in <code>path/to/directory</code></p>

<h1 id="hadoop-fs--count--q">hadoop fs -count -q</h1>

<p>Now, let us inspect the HDFS quota set for <code>path/to/directory</code>.  If we run <code>hadoop fs -count -q</code> we get this result:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop fs -count -q /path/to/directory
</span><span class="line">  QUOTA  REMAINING_QUOTA     SPACE_QUOTA  REMAINING_SPACE_QUOTA    DIR_COUNT  FILE_COUNT      CONTENT_SIZE FILE_NAME
</span><span class="line">   none              inf  54975581388800          5277747062870        3922       418464    16565944775310 hdfs://master:54310/path/to/directory</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>(I manually added the column headers like <code>QUOTA</code> to the output for making it easier to read.)</p>

<p>The seventh column <code>CONTENT_SIZE</code> is, again, the effective HDFS storage space used: <code>16565944775310</code> Bytes (15.1
TB).</p>

<p>The third column <code>SPACE_QUOTA</code> however, <code>54975581388800</code> is the <em>raw</em> HDFS space quota in bytes.  The fourth
column <code>REMAINING_SPACE_QUOTA</code> with <code>5277747062870</code> is the remaining <em>raw</em> HDFS space quota in bytes.  Note
that whereas <code>hadoop fsck</code> and <code>hadoop fs -dus</code> report the effective data size (= the same numbers you see on a
local filesystem), the third and fourth columns of <code>hadoop fs -count -q</code> indirectly return how many bytes this data
actually consumes across the hard disks of the distributed cluster nodes – and for this it is counting each of the
<code>3.0</code> replicas of an HDFS block individually (here, the value <code>3.0</code> has been taken from the <code>hadoop fsck</code> output
above and actually matches the default value of the replication count).  So if we make the subtraction of these two
quota-related numbers we get back the number from above:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">54975581388800 (50 TB) - 5277747062870 (4.8 TB) = 49697834325930 (45.2 TB)</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Now keep in mind that the Hadoop space quota always counts against the raw HDFS disk space consumed.  So if you have a
quota of 10MB, you can store only a single 1MB file if you set its replication to 10.  Or you can store up to three 1MB
files if their replication is set to 3.  The reason why Hadoop’s quotas work like that is because the replication count
of an HDFS file is a <em>user-configurable</em> setting.  Though Hadoop ships with a default value of <code>3</code> it is up to
the users to decide whether they want to keep this value or change it.  And because Hadoop can’t anticipate how users
might be playing around with the replication setting for their files, it was decided that the Hadoop quotas always
operate on the raw HDFS disk space consumed.</p>

<h1 id="tldr-and-summary">TL;DR and Summary</h1>

<p>If you never change the default value of 3 for the HDFS replication count of any files you store in your
Hadoop cluster, this means in a nutshell that you should always multiply the numbers reported by <code>hadoop fsck</code> or
<code>hadoop fs -dus</code> times 3 when you want to reason about HDFS space quotas.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Reported file size</th>
      <th style="text-align: right">Local filesystem</th>
      <th style="text-align: right"><code>hadoop fsck</code> and <br /> <code>hadoop fs -dus</code></th>
      <th style="text-align: right"><code>hadoop fs -count -q</code><br />(if replication factor == 3)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">If a file was of size…</td>
      <td style="text-align: right">1GB</td>
      <td style="text-align: right">1GB</td>
      <td style="text-align: right">3GB</td>
    </tr>
  </tbody>
</table>

<p>I hope this clears things up a bit!</p>

<h1 id="related-articles">Related Articles</h1>

<ul>
  <li><a href="http://www.michael-noll.com/blog/2011/03/28/hadoop-space-quotas-hdfs-block-size-replication-and-small-files/">Hadoop space quotas, HDFS block size, replication and small files</a> (earlier blog post)</li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Performing an HDFS Upgrade of an Hadoop Cluster]]></title>
    <link href="http://www.michael-noll.com/blog/2011/08/23/performing-an-hdfs-upgrade-of-an-hadoop-cluster/"/>
    <updated>2011-08-23T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2011/08/23/performing-an-hdfs-upgrade-of-an-hadoop-cluster</id>
    <content type="html"><![CDATA[<p>In this guide I will describe how to upgrade the Distributed Filesystem (HDFS) of an Hadoop cluster.  An HDFS upgrade
might be required when you update the Hadoop software itself.</p>

<!--more-->

<h1 id="before-we-start">Before We Start</h1>

<p>This guide should cover the majority of tasks and aspects of upgrading HDFS but it may not necessarily be complete.
Feel free to provide your feedback and suggestions.</p>

<p>Apart from my own input, this article uses information from the
<a href="http://wiki.apache.org/hadoop/Hadoop_Upgrade">Hadoop Upgrade</a> guide, the
<a href="http://hadoop.apache.org/common/docs/r0.20.203.0/hdfs_user_guide.html#Upgrade+and+Rollback">HDFS User Guide</a> and
Tom White’s book <a href="http://oreilly.com/catalog/9780596521981">Hadoop: The Definitive Guide</a> (see References section at
the bottom).</p>

<h1 id="background-information-on-hdfs-upgrades">Background information on HDFS upgrades</h1>

<p>An HDFS upgrade may be required when you upgrade from an older version of Hadoop to a newer version.  For instance, an
HDFS upgrade is required when you upgrade from Hadoop 0.20.2 to Hadoop 0.20.203.0.</p>

<blockquote><p>An upgrade of HDFS makes a copy of the previous version’s metadata and data. Doing an upgrade does not double the storage requirements of the cluster, as the datanodes use hard links to keep two references (for the current and previous version) to the same block of data. This design makes it straightforward to roll back to the previous version of the filesystem, should you need to. You should understand that any changes made to the data on the upgraded system will be lost after the rollback completes.</p><p>You can keep only the previous version of the filesystem: you can’t roll back several versions.  Therefore, to carry out another upgrade to HDFS data and metadata, you will need to delete the previous version, a process called finalizing the upgrade. Once an upgrade is finalized, there is no procedure for rolling back to a previous version.</p><footer><strong>Tom White</strong> <cite>Hadoop: The Definitive Guide (2nd Ed.) P. 317</cite></footer></blockquote>

<h1 id="how-to-check-whether-an-hdfs-upgrade-is-required">How to check whether an HDFS upgrade is required</h1>

<p>Unfortunately, the most reliable way of finding out whether you need to upgrade the HDFS filesystem is by performing a
trial on a test cluster (doh!).</p>

<p>If you have installed a new version of Hadoop and it expects a different HDFS layout version, the NameNode will refuse
to run. A message like the following will appear in the NameNode logs:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class=""><span class="line">File system image contains an old layout version -18.
</span><span class="line">An upgrade to version -31 is required.
</span><span class="line">Please restart NameNode with -upgrade option.</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="hdfs-upgrade-instructions">HDFS upgrade instructions</h1>

<p>There are several steps you have to perform for upgrading HDFS.  First, you have to perform some preparation steps
before installing the new (updated) Hadoop software.  Then, after having updated Hadoop (e.g. from version 0.20.2 to
0.20.203.0), you will follow-up with another set of steps to actually start the HDFS upgrade and, after successful
testing, finalize it.</p>

<h2 id="before-you-install-the-new-hadoop-version">Before you install the new Hadoop version</h2>

<ol>
  <li>
    <p>Make sure that any previous upgrade is finalized before proceeding with another upgrade.  To find out whether the
cluster needs to be finalized run the command:</p>

    <pre><code>$ hadoop dfsadmin -upgradeProgress status
</code></pre>
  </li>
  <li>Stop all client applications running on the MapReduce cluster.</li>
  <li>
    <p>Stop the MapReduce cluster with</p>

    <pre><code>$ stop-mapred.sh
</code></pre>

    <p>and kill any orphaned task process on the TaskTrackers.</p>
  </li>
  <li>Stop all client applications running on the HDFS cluster.</li>
  <li>Perform some sanity checks on HDFS.
    <ul>
      <li>
        <p>Perform a filesystem check:</p>

        <pre><code>  $ hadoop fsck / -files -blocks -locations &gt; dfs-v-old-fsck-1.log
</code></pre>
      </li>
      <li>Fix HDFS to the point there are no errors. The resulting file will contain a complete block map of the file
system.
<em>Note: Redirecting the <code>fsck</code> output is recommend for large clusters in order to avoid time consuming output to
<code>STDOUT</code>.</em></li>
      <li>
        <p>Save a complete listing of the HDFS namespace to a local file:</p>

        <pre><code>  $ hadoop dfs -lsr / &gt; dfs-v-old-lsr-1.log
</code></pre>
      </li>
      <li>
        <p>Create a list of DataNodes participating in the cluster:</p>

        <pre><code>  $ hadoop dfsadmin -report &gt; dfs-v-old-report-1.log
</code></pre>
      </li>
    </ul>
  </li>
  <li>Optionally, copy all or unrecoverable only data stored in HDFS to a local file system or a backup instance of HDFS.</li>
  <li>
    <p>Optionally, stop and restart HDFS cluster, in order to create an up-to-date namespace checkpoint of the old version:</p>

    <pre><code>$ stop-dfs.sh
$ start-dfs.sh
</code></pre>
  </li>
  <li>Optionally,  repeat 5, 6, 7 and compare the results with the previous run to ensure the state of the file system
remained unchanged.</li>
  <li>Create a backup copy of the <code>dfs.name.dir</code> directory on the NameNode (if you followed my
<a href="http://www.michael-noll.com/hadoop/">Hadoop tutorials</a>: <code>/app/hadoop/tmp/dfs/name</code>).  Among other important files, the <code>dfs.name.dir</code>
directory includes the checkpoint files <code>edits</code> and <code>fsimage</code>.</li>
  <li>
    <p>Stop the HDFS cluster.</p>

    <pre><code>$ stop-dfs.sh
</code></pre>

    <p>Verify that HDFS has really stopped, and kill any orphaned DataNode processes on the DataNodes.</p>
  </li>
</ol>

<h2 id="install-the-new-hadoop-version">Install the new Hadoop version</h2>

<p>Now you can install the new version of the Hadoop software.</p>

<p><strong>Note:</strong> Make sure to update any symlinks you are using; e.g. if you have symlinks such as <code>/usr/local/hadoop</code> →
<code>/usr/local/hadoop-0.20.203.0</code>.</p>

<h2 id="after-you-have-installed-the-new-hadoop-version">After you have installed the new Hadoop version</h2>

<ol>
  <li>Optionally, update the <code>conf/slaves</code> file before starting to reflect the current set of active nodes.</li>
  <li>Optionally, change the configuration of the NameNode’s and the JobTracker’s port numbers (i.e. the
<code>fs.default.name</code> property in <code>conf/core-site.xml</code> plus <code>conf/hdfs-site.xml</code> and the <code>mapred.job.tracker</code> property in <code>conf/mapred-site.xml</code> respectively) in order to ignore unreachable nodes that are still running the old version (with the old port numbers),  preventing them from connecting and disrupting system operation.</li>
  <li>Start the actual HDFS upgrade process.
    <ul>
      <li>
        <p>Upgrade the NameNode by converting the checkpoint to the new version format</p>

        <pre><code>  $ hadoop-daemon.sh start namenode -upgrade
</code></pre>

        <p><strong>Note:</strong> You need to add the <code>-upgrade</code> switch only once for actual upgrade process.  Once it has successfully
completed, you can start the NameNode via <code>hadoop-daemon.sh</code> <code>start-dfs.sh</code> and <code>start-all.sh</code> like you’d
normally do.  The un-finalized upgrade will be in effect until you either finalize the upgrade to make it
permanent or until you perform a rollback of the upgrade (see below).</p>

        <p>The NameNode log will show messages like the following:</p>

        <pre><code>  # NameNode log file
  2011-06-21 14:40:32,579 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading image directory /path/to/nn_namespace_dir.
     old LV = -18; old CTime = 0.
     new LV = -31; new CTime = 1308660032579
  2011-06-21 14:40:32,581 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 8447 saved in 0 seconds.
  2011-06-21 14:40:32,639 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrade of /path/to/nn_namespace_dir is complete.
  2011-06-21 14:40:32,639 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading image directory /path/to/nn_namespace_dir_bk.
     old LV = -18; old CTime = 0.
     new LV = -31; new CTime = 1308660032579
  2011-06-21 14:40:32,644 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 8447 saved in 0 seconds.
  2011-06-21 14:40:32,650 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrade of /path/to/nn_namespace_dir_bk is complete.
  2011-06-21 14:40:32,651 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 0 entries 0 lookups
  2011-06-21 14:40:32,652 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 441 msecs
  2011-06-21 14:40:32,660 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode ON.
</code></pre>
      </li>
      <li>
        <p>The NameNode upgrade process can take a while depending on how many files you have. You can follow the process by
inspecting the NameNode logs, by running “<code>hadoop dfsadmin -upgradeProgress status</code> and/or by accessing the
NameNode Web UI.  Once the upgrade process has completed, the NameNode Web UI will show a message similar to
<code>Upgrade for version -31 has been completed. Upgrade is not finalized</code> You will finalize the upgrade in a later
step. Right now, the <strong>NameNode is in Safe Mode</strong> waiting for the DataNodes to connect.</p>

        <p><img src="http://www.michael-noll.com/blog/uploads/HDFS_upgrade_NN-upgrade-600x439.png" alt="The NameNode is in Safe Mode waiting for the DataNodes to connect." title="The NameNode is in Safe Mode waiting for the DataNodes to connect." width="600" height="439" class="size-large wp-image-1046" /></p>

        <p>An example status output of the <code>upgradeProgress</code> command at this stage:</p>

        <pre><code>  $ hadoop dfsadmin -upgradeProgress status
  Upgrade for version -31 has been completed.
  Upgrade is not finalized.
</code></pre>
      </li>
      <li>
        <p>Optionally, save a complete listing of the new HDFS namespace to a local file:</p>

        <pre><code>  $ hadoop dfs -lsr / &gt; dfs-v-new-lsr-0.log
</code></pre>

        <p>and compare it with <code>dfs-v-old-lsr-1.log</code> you created previously.</p>
      </li>
      <li>
        <p>Start the HDFS cluster. Since the NameNode is already running, only the DataNodes and the SecondaryNameNode will actually be started.</p>

        <pre><code>  $ start-dfs.sh
</code></pre>

        <p><em>Note: You do not need to add the <code>upgrade</code> switch here because it is only passed to the NameNode anyways, and
the NameNode has already been instructed to perform an upgrade.</em></p>

        <p>After your DataNodes have completed the upgrade process, you should see a message <code>Safe mode will be turned off
automatically in X seconds.</code> in the NameNode Web UI.  The NameNode should then automatically exit Safe Mode and
HDFS will be in full operation. You can monitor the process via the NameNode Web UI and the NameNode/DataNode
logs.</p>

        <pre><code>  # DataNode log file
  2011-06-21 14:50:56,103 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading storage directory /app/hadoop/tmp/dfs/data.
     old LV = -18; old CTime = 0.
     new LV = -31; new CTime = 1308660032579
  2011-06-21 14:50:56,196 INFO org.apache.hadoop.hdfs.server.common.Storage: HardLinkStats: 1 Directories, including 0 Empty Directories, 0 single Link o
  perations, 1 multi-Link operations, linking 80 files, total 80 linkable files.  Also physically copied 1 other files.
  2011-06-21 14:50:56,196 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrade of /app/hadoop/tmp/dfs/data is complete.
</code></pre>

        <p>You can check the NameNode Web UI whether the NameNode has already exited Safe Mode (see screenshot below).
Alternatively, you can run <code>hadoop dfsadmin -safemode get</code>.</p>

        <p><img src="http://www.michael-noll.com/blog/uploads/HDFS_upgrade_NN-upgrade-with-DNs-600x397.png" alt="The NameNode has exited Safe Mode, and DataNodes have started to connect to it." title="The NameNode has exited Safe Mode, and DataNodes have started to connect to it." width="600" height="397" class="size-large wp-image-1049" /></p>

        <p>Note that the status output of the <code>upgradeProgress</code> command should not have changed at this point:</p>

        <pre><code>  $ hadoop dfsadmin -upgradeProgress status
  Upgrade for version -31 has been completed.
  Upgrade is not finalized.
</code></pre>
      </li>
    </ul>
  </li>
  <li>Perform some sanity checks on the new HDFS
    <ul>
      <li>
        <p>Create a list of DataNodes participating in the updated cluster.</p>

        <pre><code>  $ hadoop dfsadmin -report &gt; dfs-v-new-report-1.log
</code></pre>

        <p>and compare it with <code>dfs-v-old-report-1.log</code> to ensure all DataNodes previously belonging to the cluster are up
and running.</p>
      </li>
      <li>
        <p>Save a complete listing of the new HDFS namespace to a local file:</p>

        <pre><code>  $ hadoop dfs -lsr / &gt; dfs-v-new-lsr-1.log
</code></pre>

        <p>and compare it with <code>dfs-v-old-lsr-1.log</code>.  These files should be identical unless the format of
<code>hadoop fs -lsr</code> reporting or the data structures have changed in the new version.</p>
      </li>
      <li>
        <p>Perform a filesystem check:</p>

        <pre><code>  $ hadoop fsck / -files -blocks -locations &gt; dfs-v-new-fsck-1.log
</code></pre>

        <p>and compare with <code>dfs-v-old-fsck-1.log</code>.  These files should be identical, unless the <code>hadoop fsck</code> reporting
format has changed in the new version.</p>
      </li>
    </ul>
  </li>
  <li>
    <p>Start the MapReduce cluster</p>

    <pre><code>$ start-mapred.sh
</code></pre>
  </li>
  <li>Let internal customers perform their own testing on the new HDFS filesystem version.</li>
  <li>Roll back or finalize the upgrade (optional).</li>
</ol>

<h1 id="finishing-the-hdfs-upgrade-process">Finishing the HDFS upgrade process</h1>

<p>After you have initiated the HDFS upgrade in the sections above, you will eventually have to decide – hopefully after
thorough testing – whether you want to stick to the upgraded HDFS (= your tests were successful) or to revert the HDFS
upgrade (= your tests failed).  The following two sections describe how to finalize (i.e. stick to) an HDFS upgrade and
how to perform a rollback of an HDFS upgrade.</p>

<h2 id="how-to-finalize-an-hdfs-upgrade">How to finalize an HDFS upgrade</h2>

<p>If the HDFS upgrade worked out fine and your subsequent testing was successful, you may want to finalize the HDFS
upgrade.</p>

<div class="warning">
<strong>Warning:</strong> Finalizing an HDFS upgrade is a point of no return.  You cannot perform a rollback once the
cluster is finalized!
</div>

<p>You can check with the following command whether the cluster needs to be finalized:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop dfsadmin -upgradeProgress status</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Run the actual finalize command to make the HDFS upgrade permanent:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop dfsadmin -finalizeUpgrade</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The <code>-finalizeUpgrade</code> command removes the previous version of the NameNode’s and DataNodes’ storage directories.</p>

<p>In the NameNode Web UI you should see a message <code>"Upgrades: There are no upgrades in progress"</code>.</p>

<p>The NameNode logs should contain entries similar to the following:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class=""><span class="line"># NameNode log file
</span><span class="line">2011-06-22 20:21:22,333 INFO org.apache.hadoop.hdfs.server.common.Storage: Finalizing upgrade for storage directory /app/hadoop/tmp/dfs/name.
</span><span class="line">   cur LV = -31; cur CTime = 1308774082
</span><span class="line">2011-06-22 20:21:22,336 INFO org.apache.hadoop.hdfs.server.common.Storage: Finalize upgrade for /app/hadoop/tmp/dfs/name is complete.</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Also, running <code>hadoop dfsadmin</code> will now report that there are no pending upgrades.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop dfsadmin -upgradeProgress status
</span><span class="line">There are no upgrades in progress.</span></code></pre></td></tr></table></div></figure></notextile></div>

<p><strong>Note:</strong> The finalize upgrade procedure can run in the background without disrupting the performance of the Hadoop
cluster.</p>

<p>It is worth mentioning that deleting files that existed before the upgrade does not free up real disk space on the
DataNodes until the HDFS cluster is finalized.</p>

<p>The official
<a href="http://hadoop.apache.org/common/docs/r0.20.203.0/hdfs_user_guide.html#Upgrade+and+Rollback">HDFS User Guide</a> in the
Hadoop documentation provides the full instructions for
<a href="http://hadoop.apache.org/common/docs/r0.20.203.0/hdfs_user_guide.html#Upgrade+and+Rollback">HDFS Upgrade and Rollback</a>.</p>

<h2 id="how-to-roll-back-an-hdfs-upgrade">How to roll back an HDFS upgrade</h2>

<p>If the HDFS upgrade failed and/or your testing was not successful, you may want to revert the HDFS upgrade.</p>

<div class="warning">
<strong>Warning:</strong> When you perform a rollback on an upgraded HDFS, you will lose all the data that has been created in the time window between the upgrade initialization (i.e. when you run &#8220;upgrade&#8220;) and the rollback (i.e. when you run &#8220;rollback&#8220;).  The rollback procedure will revert the state of the HDFS filesystem &#8211; and its version &#8211; to how it was before the upgrade was started. In other words, it rolls back to the previous state of the filesystem, rather than downgrading/converting the current state of the filesystem to a former one.
</div>

<p>The official <a href="http://hadoop.apache.org/common/docs/r0.20.203.0/hdfs_user_guide.html#Upgrade+and+Rollback">HDFS User Guide</a> in the Hadoop documentation provides the full instructions for <a href="http://hadoop.apache.org/common/docs/r0.20.203.0/hdfs_user_guide.html#Upgrade+and+Rollback">HDFS Upgrade and Rollback</a>.  This section only highlights the most important parts.</p>

<p>If there is a need to move back to the old version, you have to:</p>

<ol>
  <li>Stop all client applications on the cluster.</li>
  <li>Stop the full cluster (MapReduce + HDFS).</li>
  <li>Distribute the previous version of Hadoop, i.e. the Hadoop version used before the attempted upgrade.</li>
  <li>Start the HDFS cluster with the rollback option.
    $ start-dfs.sh -rollback</li>
</ol>

<p>The NameNode logs should contain entries similar to the following:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class=""><span class="line"># NameNode log file
</span><span class="line">2011-06-22 19:31:54,039 INFO org.apache.hadoop.hdfs.server.common.Storage: Rolling back storage directory /app/hadoop/tmp/dfs/name.
</span><span class="line">   new LV = -18; new CTime = 0
</span><span class="line">2011-06-22 19:31:54,041 INFO org.apache.hadoop.hdfs.server.common.Storage: Rollback of /app/hadoop/tmp/dfs/name is complete.
</span><span class="line">2011-06-22 19:31:54,042 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1117
</span><span class="line">2011-06-22 19:31:54,191 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The DataNode logs should contain entries similar to the following:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class=""><span class="line"># DataNode log file
</span><span class="line">2011-06-22 19:31:55,816 INFO org.apache.hadoop.hdfs.server.common.Storage: Rolling back storage directory /app/hadoop/tmp/dfs/data.
</span><span class="line">   target LV = -18; target CTime = 0
</span><span class="line">2011-06-22 19:32:02,363 INFO org.apache.hadoop.hdfs.server.common.Storage: Rollback of /app/hadoop/tmp/dfs/data is complete.</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The rollback will take some time to complete.  The NameNode will stay in Safe Mode until all DataNodes have finished
their rollback work and have registered back at the NameNode.</p>

<p>Also, running dfsadmin will now also report that there are no pending upgrades:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop dfsadmin -upgradeProgress status
</span><span class="line">There are no upgrades in progress.</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The HDFS is now reverted back to its previous state, i.e. the state before you run <code>-upgrade</code>.</p>

<h1 id="references">References</h1>

<ul>
  <li><a href="http://wiki.apache.org/hadoop/Hadoop_Upgrade">Hadoop Upgrade</a> on the Hadoop Wiki</li>
  <li><a href="http://hadoop.apache.org/common/docs/r0.20.203.0/hdfs_user_guide.html#Upgrade+and+Rollback">HDFS User Guide</a> for Hadoop 0.20.203.0</li>
  <li><a href="http://hadoop.apache.org/common/docs/r0.20.203.0/file_system_shell.html">File System Shell Guide</a> for Hadoop 0.20.203.0</li>
  <li><a href="http://hadoop.apache.org/common/docs/r0.20.203.0/commands_manual.html">Commands Guide</a> for Hadoop 0.20.203.0</li>
  <li>Tom White, <a href="http://oreilly.com/catalog/9780596521981">Hadoop: The Definitive Guide (2nd ed.)</a>, O’Reilly, pp. 316-319 (plus some other pages)</li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Building an Hadoop 0.20.x version for HBase 0.90.2]]></title>
    <link href="http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-version-for-hbase-0-90-2/"/>
    <updated>2011-04-14T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-version-for-hbase-0-90-2</id>
    <content type="html"><![CDATA[<p>As of today, Hadoop 0.20.2 is the latest stable release of Apache Hadoop that is marked as ready for production
(neither 0.21 nor 0.22 are).  Unfortunately, Hadoop 0.20.2 release <em>is not</em> compatible with the latest stable version of
HBase: if you run HBase on top of Hadoop 0.20.2, you risk to lose data! Hence HBase users are required to build their own
Hadoop 0.20.x version if they want to run HBase on a production cluster of Hadoop.  In this article, I describe how to
build such a production-ready version of Hadoop 0.20.x that is compatible with HBase 0.90.2.</p>

<!--more-->

<div class="note">
<strong>Update October 17, 2011</strong>: As of <a href="http://hadoop.apache.org/common/docs/r0.20.205.0/">version 0.20.205.0</a> (marked as beta release), Hadoop does now supports HDFS append/hsynch/hflush out of the box and is thus compatible with Hbase 0.90.x.  You can still follow the instructions described in this article to build your own version of Hadoop.
</div>

<h1 id="before-we-start">Before we start</h1>

<h2 id="the-examples-below-use-git-not-svn">The examples below use git (not svn)</h2>

<p>In the following sections, I will use <a href="http://git-scm.com/">git</a> as the version control system to work on the Hadoop source
code.  Why? Because I am much more comfortable with git than svn, so please bear with me.</p>

<p>If you are using Subversion, feel free to adapt the git commands described below.  You are invited to write a comment to
this article about your SVN experience so that other SVN users can benefit, too!</p>

<h2 id="hadoop-0202-versus-0202030">Hadoop 0.20.2 versus 0.20.203.0</h2>

<div class="note">
<strong>Update June 11, 2011</strong>: Hadoop 0.20.203.0 and HBase 0.90.3 were released a few weeks after this article was published. While the article talks mostly about Hadoop 0.20.2, the build instructions should also work for Hadoop 0.20.203.0 but I haven&#8217;t had the time to test it yet myself. Feel free to leave a comment at the end of the article if you have run into any issues!
</div>

<h2 id="hadoop-is-covered-what-about-hbase-then">Hadoop is covered. What about HBase then?</h2>

<p>I focus solely in this article on building a Hadoop 0.20.x version (see the Background section below) that is compatible
with HBase 0.90.2.  In a future article, I may describe how to actually install and set up HBase 0.90.2 on the Hadoop
0.20.x version that we created here.</p>

<h2 id="version-of-hadoop-020-append-used-in-this-article">Version of Hadoop 0.20-append used in this article</h2>

<p>The instructions below use the latest version of <code>branch-0.20-append</code>.  As of this writing, the latest commit to the
append branch is git commit <code>df0d79cc</code> aka Subversion <code>rev 1057313</code>.  For reference, the corresponding commit
message is “HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.” from January 10, 2011.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class=""><span class="line">commit df0d79cc2b09438c079fdf10b913936492117917
</span><span class="line">Author: Hairong Kuang &lt;hairong@apache.org&gt;
</span><span class="line">Date:   Mon Jan 10 19:01:36 2011 +0000
</span><span class="line">
</span><span class="line">    HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.
</span><span class="line">
</span><span class="line">    git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append@1057313 13f79535-47bb-0310-9956-ffa450edef68</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>That said, the steps should also work for newer versions of <code>branch-0.20-append</code>.</p>

<h1 id="background">Background</h1>

<h2 id="hadoop-and-hbase-which-versions-to-pick-for-production-clusters">Hadoop and HBase: Which versions to pick for production clusters?</h2>

<p>Hadoop 0.20.2 is the latest stable release of <a href="http://hadoop.apache.org/">Apache Hadoop</a> that is marked ready for
production.  Unfortunately, the latest stable release of <a href="http://hbase.apache.org/">Apache HBase</a>, i.e. HBase 0.90.2, is
<em>not</em> compatible with Hadoop 0.20.2: <em>If you try to run HBase 0.90.2 on an unmodified version of Hadoop 0.20.2 release,
you might lose data!</em></p>

<p>The following lines are taken slightly modified from the
<a href="http://hbase.apache.org/book/notsoquick.html#ftn.d0e294">HBase documentation</a>:</p>

<blockquote><p>This version of HBase [0.90.2] will only run on Hadoop 0.20.x. It will not run on Hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an HDFS that has a durable sync. Currently only the branch-0.20-append branch has this attribute. No official releases have been made from this branch up to now so you will have to build your own Hadoop from the tip of this branch.</p></blockquote>

<p>Here is a quick overview:</p>

<table>
<tr>
<th>Hadoop version</th><th>HBase version</th><th>Compatible?</th>
</tr><tr>
<td>0.20.2 release</td><td>0.90.2</td><td>NO</td>
</tr><tr>
<td>0.20-append</td><td>0.90.2</td><td><strong>YES</strong></td>
</tr><tr>
<td>0.21.0 release</td><td>0.90.2</td><td>NO</td>
</tr><tr>
<td>0.22.x (in development)</td><td>0.90.2</td><td>NO</td>
</tr>
</table>

<p>To be honest, it took me quite some time to get up to speed with the various requirements, dependencies, project
statuses, etc. for marrying Hadoop 0.20.x and HBase 0.90.2.  Hence I want to contribute back to the Hadoop and HBase
communities by writing this article.</p>

<h2 id="alternatives-to-what-we-are-doing-here">Alternatives to what we are doing here</h2>

<p>Another option you have to get HBase up and running on Hadoop – rather than build Hadoop 0.20-append yourself – is
using <a href="http://www.cloudera.com/">Cloudera’s CDH3 distribution</a>.  CDH3 has the Hadoop 0.20-append patches needed to add
a durable sync, i.e. to make Hadoop 0.20.x compatible with HBase 0.90.2.</p>

<h1 id="a-word-of-caution-and-a-thank-you">A word of caution and a Thank You</h1>

<p>First, a warning: while I have taken great care to compile and describe the steps in the following sections, I still
cannot give you any guarantees.  If in doubt, join our discussions on the HBase mailing list.</p>

<p>Second, I am only stitching together the pieces of the puzzle here.  The heavy lifting has done by others.  Hence I
would like to thank <a href="http://osdir.com/ml/general/2011-04/msg07730.html">Michael Stack for his great feedback</a> while
preparing the information for this article, and to both him and the rest of the HBase developers for their help on the
HBase mailing list.  It’s much appreciated!</p>

<h1 id="building-hadoop-020-append-from-branch-020-append">Building Hadoop 0.20-append from branch-0.20-append</h1>

<h2 id="retrieve-the-hadoop-020-append-sources">Retrieve the Hadoop 0.20-append sources</h2>

<p>Hadoop as of version 0.20.x is not separated into the Common, HDFS and MapReduce components as the versions &gt;= 0.21.0
are.  Hence you find all the required code in the Hadoop Common repository.</p>

<p>So the first step is to check out the Hadoop Common repository.</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>git clone http://git.apache.org/hadoop-common.git
</span><span class="line"><span class="nv">$ </span><span class="nb">cd </span>hadoop-common
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>However, the previous <code>git</code> command only retrieved the <em>latest</em> version of Hadoop common, i.e. the tip aka <code>HEAD</code>
of the development for Hadoop Common.  We however are only interested in the code tree for Hadoop 0.20-append, i.e.
the branch <code>branch-0.20-append</code>.  Because <code>git</code> by default does not download remote branches from a cloned
repository, we must instruct it to explicitly do so:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># Retrieve the (remote) Hadoop 0.20-append branch as git normally checks out</span>
</span><span class="line"><span class="c"># only the ``master`` tree (``trunk`` in Subversion language).</span>
</span><span class="line"><span class="nv">$ </span>git checkout -t origin/branch-0.20-append
</span><span class="line">Branch branch-0.20-append <span class="nb">set </span>up to track remote branch branch-0.20-append from origin.
</span><span class="line">Switched to a new branch <span class="s1">&#39;branch-0.20-append&#39;</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="hadoop-0202-release-vs-hadoop-020-append">Hadoop 0.20.2 release vs. Hadoop 0.20-append</h2>

<p>Up to now, you might have asked yourself what the difference between the 0.20.2 release of Hadoop and its append branch actually is. Here’s the answer: The Hadoop 0.20-append branch is effectively a superset of Hadoop 0.20.2 release. In other words, there is not a single “real” commit in Hadoop 0.20.2 release that is not also in Hadoop 0.20-append. This means that Hadoop 0.20-append brings all the goodies that Hadoop 0.20.2 release has, great!</p>

<p>Run the following <code>git</code> command to verify this:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>git show-branch release-0.20.2 branch-0.20-append
</span><span class="line">! <span class="o">[</span>release-0.20.2<span class="o">]</span> Hadoop 0.20.2 release
</span><span class="line"> * <span class="o">[</span>branch-0.20-append<span class="o">]</span> HDFS-1554. New semantics <span class="k">for </span>recoverLease. Contributed by Hairong Kuang.
</span><span class="line">--
</span><span class="line"> * <span class="o">[</span>branch-0.20-append<span class="o">]</span> HDFS-1554. New semantics <span class="k">for </span>recoverLease. Contributed by Hairong Kuang.
</span><span class="line"> * <span class="o">[</span>branch-0.20-append^<span class="o">]</span> HDFS-1555. Disallow pipelien recovery <span class="k">if </span>a file is already being lease recovered. Contributed by Hairong Kuang.
</span><span class="line"> * <span class="o">[</span>branch-0.20-append~2<span class="o">]</span> Revert the change made to HDFS-1555: merge -c -1056483 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append
</span><span class="line"> * <span class="o">[</span>branch-0.20-append~3<span class="o">]</span> HDFS-1555. Disallow pipeline recovery <span class="k">if </span>a file is already being lease recovered. Contributed by Hairong Kuang.
</span><span class="line"><span class="o">[</span>...<span class="o">]</span>
</span><span class="line"> * <span class="o">[</span>branch-0.20-append~50<span class="o">]</span> JDiff output <span class="k">for </span>release 0.20.2
</span><span class="line"> * <span class="o">[</span>branch-0.20-append~51<span class="o">]</span> HADOOP-1849. Merge -r 916528:916529 from trunk to branch-0.20.
</span><span class="line">+  <span class="o">[</span>release-0.20.2<span class="o">]</span> Hadoop 0.20.2 release
</span><span class="line">+  <span class="o">[</span>release-0.20.2^<span class="o">]</span> Hadoop 0.20.2-rc4
</span><span class="line">+* <span class="o">[</span>branch-0.20-append~52<span class="o">]</span> Prepare <span class="k">for </span>0.20.2-rc4
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>As you can see, there are only two commits in 0.20.2 release that are not in <code>branch-0.20-append</code>, namely the
commits “Hadoop 0.20.2 release” and “Hadoop 0.20.2-rc4”.  Both of these commits are simple tagging commits, i.e. they
are just used for release management but do not introduce any changes to the content of the Hadoop source code.</p>

<h2 id="run-the-build-process">Run the build process</h2>

<h3 id="build-commands">Build commands</h3>

<p>First, we have to create the <code>build.properties</code> file
(<a href="http://wiki.apache.org/hadoop/GitAndHadoop&quot;">see full instructions</a>).</p>

<p>Here are the contents of mine:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c">#this is essential</span>
</span><span class="line"><span class="nv">resolvers</span><span class="o">=</span>internal
</span><span class="line"><span class="c">#you can increment this number as you see fit</span>
</span><span class="line"><span class="nv">version</span><span class="o">=</span>0.20-append-for-hbase
</span><span class="line">project.version<span class="o">=</span><span class="k">${</span><span class="nv">version</span><span class="k">}</span>
</span><span class="line">hadoop.version<span class="o">=</span><span class="k">${</span><span class="nv">version</span><span class="k">}</span>
</span><span class="line">hadoop-core.version<span class="o">=</span><span class="k">${</span><span class="nv">version</span><span class="k">}</span>
</span><span class="line">hadoop-hdfs.version<span class="o">=</span><span class="k">${</span><span class="nv">version</span><span class="k">}</span>
</span><span class="line">hadoop-mapred.version<span class="o">=</span><span class="k">${</span><span class="nv">version</span><span class="k">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<div class="note">
Note: The &#8220;version&#8221; key in <tt>build.properties</tt> will also determine the names of the generated Hadoop JAR files. If, for instance, you set &#8220;version&#8221; to &#8220;0.20-append-for-hbase&#8221;, the build process will generate files named <tt>hadoop-core-0.20-append-for-hbase.jar</tt> etc. Basically, you can use any version identifier that you like (though it would help if it makes sense).
</div>

<p>The <code>build.properties</code> file should be placed (or available) in the <code>hadoop-common</code> top directory, i.e.
<code>hadoop-common/build.properties</code>.  You can either place the file there directly or you can follow the
<a href="http://wiki.apache.org/hadoop/GitAndHadoop">recommended approach</a>, where you place the file in a parent directory
and create a symlink to it.  The latter approach is convenient if you also have checked out the repositories of the
Hadoop sub-projects <code>hadoop-hdfs</code> and <code>hadoop-mapreduce</code> and thus want to use the same <code>build.properties</code> file
for all three sub-projects.</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span><span class="nb">pwd</span>
</span><span class="line">/your/path/to/hadoop-common
</span><span class="line">
</span><span class="line"><span class="c"># Create/edit the build.properties file</span>
</span><span class="line"><span class="nv">$ </span>vi ../build.properties
</span><span class="line">
</span><span class="line"><span class="c"># Create a symlink to it</span>
</span><span class="line"><span class="nv">$ </span>ln -s ../build.properties build.properties
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Now we are ready to compile Hadoop from source with <code>ant</code>.  I used the command <code>ant mvn-install</code> as described on
<a href="http://wiki.apache.org/hadoop/GitAndHadoop">Git and Hadoop</a>.  The build itself should only take a few minutes.  Be
sure to run <code>ant test</code> as well (or only <code>ant test-core</code> if you’re lazy) but be aware that the tests take much
longer than the build (two hours on my 3-year old MacBook Pro, for instance).</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># Make sure we are using the branch-0.20-append sources</span>
</span><span class="line"><span class="nv">$ </span>git checkout branch-0.20-append
</span><span class="line">
</span><span class="line"><span class="c"># Run the build process</span>
</span><span class="line"><span class="nv">$ </span>ant mvn-install
</span><span class="line">
</span><span class="line"><span class="c"># Optional: run the full test suite or just the core test suite</span>
</span><span class="line"><span class="nv">$ </span>ant <span class="nb">test</span>
</span><span class="line"><span class="nv">$ </span>ant <span class="nb">test</span>-core
</span></code></pre></td></tr></table></div></figure></notextile></div>

<div class="note">
If you want to re-run builds or build tests: By default, &#8220;ant mvn-install&#8221; places the build output into <tt>$HOME/.m2/repository</tt>.  In case you re-run the compile you might want to remove the previous build output from <tt>$HOME/.m2/repository</tt>, e.g. via &#8220;rm -rf $HOME/.m2/repository&#8221;.  You might also want to run &#8220;ant clean-cache&#8221;.  For details, see <a href="http://wiki.apache.org/hadoop/GitAndHadoop">Git and Hadoop</a>.
</div>

<h3 id="the-build-test-fails-now-what">The build test fails, now what?</h3>

<p>Now comes the more delicate part: If you run the build tests via <code>ant test</code>, you will notice that the build test
process always fails!  One consistent test error is reported by <code>TestFileAppend4</code> and logged to the file
<code>build/test/TEST-org.apache.hadoop.hdfs.TestFileAppend4.txt</code>.  Here is a short excerpt of the test’s output:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class=""><span class="line">2011-04-06 09:40:28,666 INFO  ipc.Server (Server.java:run(970)) - IPC Server handler 5 on 47574, call append(/bbw.test, DFSClient_1066000827) from 127.0.0.1:45323: error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /bbw.test for DFSClient_1066000827 on client 127.0.0.1, because this file is already being created by DFSClient_-95621936 on 127.0.0.1
</span><span class="line">        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
</span><span class="line">        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1054)
</span><span class="line">        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1221)
</span><span class="line">        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:396)
</span><span class="line">        [...]
</span><span class="line">        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
</span><span class="line">2011-04-06 09:40:28,667 INFO  hdfs.TestFileAppend4 (TestFileAppend4.java:recoverFile(161)) - Failed open for append, waiting on lease recovery
</span><span class="line">
</span><span class="line">[...]
</span><span class="line">Testcase: testRecoverFinalizedBlock took 5.555 sec
</span><span class="line">    Caused an ERROR
</span><span class="line">No lease on /testRecoverFinalized File is not open for writing. Holder DFSClient_1816717192 does not have any open files.
</span><span class="line">org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /testRecoverFinalized File is not open for writing. Holder DFSClient_1816717192 does not have any open files.
</span><span class="line">    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1439)
</span><span class="line">        [...]
</span><span class="line">    at org.apache.hadoop.hdfs.TestFileAppend4$1.run(TestFileAppend4.java:636) </span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Fortunately, this error
<a href="http://osdir.com/ml/general/2011-04/msg07730.html">does not mean that the build is not working</a>.  From what we know
this is a problem of the unit tests in <code>branch-0.20-append</code> themselves (see also
<a href="https://issues.apache.org/jira/browse/HBASE-3285?focusedCommentId=13003510&amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13003510">Michael Stack’s comment on HBASE-3285</a>).</p>

<p>Occasionally, you might run into other build failures and/or build errors.  On my machine, for instance, I have also
seen the following tests fail:</p>

<ul>
  <li><code>org.apache.hadoop.hdfs.server.namenode.TestEditLogRace</code> (<a href="https://gist.github.com/4549045">see error</a>)</li>
  <li><code>org.apache.hadoop.hdfs.TestMultiThreadedSync</code> (<a href="https://gist.github.com/4549064">see error</a>)</li>
</ul>

<p>I do not know what might cause these occasional errors – maybe it is a problem of the machine I am running the tests
on.  Still working on this.</p>

<p>Frankly, what I wrote above may sound discomforting to you.  At least it does to me.  Still, the feedback I have
received on the HBase mailing list indicates that the Hadoop 0.20-append build as done above <em>is indeed correct</em>.</p>

<h3 id="locate-the-build-output-hadoop-jar-files">Locate the build output (Hadoop JAR files)</h3>

<p>By default, the build run via <code>ant mvn-install</code> places the generated Hadoop JAR files in <code>$HOME/.m2/repository</code>.
You can find the actual JAR files with the following command.</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>find <span class="nv">$HOME</span>/.m2/repository -name <span class="s2">&quot;hadoop-*.jar&quot;</span>
</span><span class="line">
</span><span class="line">.../repository/org/apache/hadoop/hadoop-examples/0.20-append-for-hbase/hadoop-examples-0.20-append-for-hbase.jar
</span><span class="line">.../repository/org/apache/hadoop/hadoop-test/0.20-append-for-hbase/hadoop-test-0.20-append-for-hbase.jar
</span><span class="line">.../repository/org/apache/hadoop/hadoop-tools/0.20-append-for-hbase/hadoop-tools-0.20-append-for-hbase.jar
</span><span class="line">.../repository/org/apache/hadoop/hadoop-streaming/0.20-append-for-hbase/hadoop-streaming-0.20-append-for-hbase.jar
</span><span class="line">.../repository/org/apache/hadoop/hadoop-core/0.20-append-for-hbase/hadoop-core-0.20-append-for-hbase.jar
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="install-your-hadoop-020-append-build-in-your-hadoop-cluster">Install your Hadoop 0.20-append build in your Hadoop cluster</h2>

<p>The only thing left to do now is to install the Hadoop 0.20-append build in your cluster.  This step is easy: simply
<em>replace</em> the Hadoop JAR files of your existing installation of Hadoop 0.20.2 release with the ones you just created
above.  You will also have to <em>replace</em> the Hadoop core JAR file in your HBase 0.90.2 installation
(<code>$HBASE_HOME/lib/hadoop-core-0.20-append-r1056497.jar</code>) with the Hadoop core JAR file you created above
(<code>hadoop-core-0.20-apppend-for-hbase.jar</code> if you followed the instructions above).</p>

<div class="warning">
Important: Since this is such an important step, I will repeat it again: The Hadoop JAR files used by Hadoop itself and by HBase <strong>must match</strong>!
</div>

<h3 id="rename-the-build-jar-files-if-you-run-hadoop-0202">Rename the build JAR files if you run Hadoop 0.20.2</h3>

<div class="note">
<strong>Update June 11, 2011</strong>: The renaming instructions of this section are NOT required if you are using the latest stable release Hadoop 0.20.203.0.
</div>

<p>Hadoop 0.20.2 release names its JAR files in the form of <code>hadoop-VERSION-PACKAGE.jar</code>, e.g.
<code>hadoop-0.20.2-examples.jar</code>.  The build process above uses the different scheme <code>hadoop-PACKAGE-VERSION.jar</code>,
e.g. <code>hadoop-examples-0.20-append-for-hbase.jar</code>.  You might therefore want to rename the JAR files you created in
the previous section so that they match the naming scheme of Hadoop 0.20.2 release (otherwise the <code>bin/hadoop</code> script
will not be able to add the Hadoop core JAR file to its <code>CLASSPATH</code>, and also command examples such as
<code>hadoop jar hadoop-*-examples.jar pi 50 1000</code> in the Hadoop docs will not work as is).</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class=""><span class="line"># When you replace the Hadoop JAR files *in your Hadoop installation*,
</span><span class="line"># you might want to rename your Hadoop 0.20-append JAR files like so.
</span><span class="line">hadoop-examples-0.20-append-for-hbase.jar  --&gt; hadoop-0.20-append-for-hbase-examples.jar
</span><span class="line">hadoop-test-0.20-append-for-hbase.jar      --&gt; hadoop-0.20-append-for-hbase-test.jar
</span><span class="line">hadoop-tools-0.20-append-for-hbase.jar     --&gt; hadoop-0.20-append-for-hbase-tools.jar
</span><span class="line">hadoop-streaming-0.20-append-for-hbase.jar --&gt; hadoop-0.20-append-for-hbase-streaming.jar
</span><span class="line">hadoop-core-0.20-append-for-hbase.jar      --&gt; hadoop-0.20-append-for-hbase-core.jar</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>In contrast, HBase uses the <code>hadoop-PACKAGE-VERSION.jar</code> scheme.  So when you replace the Hadoop core JAR file
shipped with HBase 0.90.2 in <code>$HBASE_HOME/lib</code>, you can here opt for leaving the name of the newly built Hadoop core
JAR file as is.</p>

<div class="note">
Note for users running HBase 0.90.0 or 0.90.1: The Hadoop 0.20-append JAR files we created above are based on the tip of &#8220;branch-0.20-append&#8220; and thus use an RPC version of 43. This is ok for HBase 0.90.2 but it will cause problems for HBase 0.90.0 and 0.90.1. See <a href="https://issues.apache.org/jira/browse/HBASE-3520">HBASE-3520</a> or <a href="http://osdir.com/ml/general/2011-04/msg07730.html">Michael Stack&#8217;s comment</a> for more information.
</div>

<h1 id="maintaining-your-own-version-of-hadoop-020-append">Maintaining your own version of Hadoop 0.20-append</h1>

<p>If you must integrate some additional patches into Hadoop 0.20.2 and/or Hadoop 0.20-append (normally in the form of
backports of patches for Hadoop 0.21 or 22.0), you can create a local branch based on the Hadoop version you are
interested in.  Yes, this creates some effort on your behalf so you should be sure to weigh the pros and cons of doing
so.</p>

<p>Imagine that, for instance, you use Hadoop 0.20-append based on <code>branch-0.20-append</code> because you also want to run the
latest stable release of HBase on your Hadoop cluster.  While doing your
<a href="http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/">&gt;benchmarking and stress testing of your cluster</a>, you have unfortunately discovered a problem that you could track down to
<a href="https://issues.apache.org/jira/browse/HDFS-611">HDFS-611</a>.  Now a patch is actually available (you might have to do
some tinkering to backport it properly) but it is not in the version of Hadoop you are running, i.e. it is not in the
vanilla <code>branch-0.20-append</code>.</p>

<p>What you can do is to create a local git branch based on your Hadoop version (here: <code>branch-0.20-append</code>) where you
can integrate and test any relevant patches you need.  Please understand that I will only describe the basic approach
here – I do not go into details on how you can make sure to stay current with any changes to the Hadoop version you
are tracking after you followed the steps below.  There are a lot of splendid git introductions such as the
<a href="http://book.git-scm.com/">Git Community Book</a> that can explain this much better and thoroughly than I am able to.</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># Make sure we are in branch-0.20-append before running the next command</span>
</span><span class="line"><span class="nv">$ </span>git checkout branch-0.20-append
</span><span class="line">
</span><span class="line"><span class="c"># Create your own local branch based on the latest version (HEAD) of the official branch-0.20-append</span>
</span><span class="line"><span class="nv">$ </span>git checkout -b branch-0.20-append-yourbranch
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Verify that the two append branches are identical up to now.</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># Verify that your local branch and the &quot;official&quot; (remote) branch are identical</span>
</span><span class="line"><span class="nv">$ </span>git show-branch branch-0.20-append branch-0.20-append-yourbranch
</span><span class="line">! <span class="o">[</span>branch-0.20-append<span class="o">]</span> HDFS-1554. New semantics <span class="k">for </span>recoverLease. Contributed by Hairong Kuang.
</span><span class="line"> * <span class="o">[</span>branch-0.20-append-yourbranch<span class="o">]</span> HDFS-1554. New semantics <span class="k">for </span>recoverLease. Contributed by Hairong Kuang.
</span><span class="line">--
</span><span class="line">+* <span class="o">[</span>branch-0.20-append<span class="o">]</span> HDFS-1554. New semantics <span class="k">for </span>recoverLease. Contributed by Hairong Kuang.
</span><span class="line">
</span><span class="line"><span class="c"># Yep, they are.</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Apply the relevant patch to your branch.  In the example below, I apply a backport of the patch for
<a href="https://issues.apache.org/jira/browse/HDFS-611">HDFS-611</a> for <code>branch-0.20-append</code> via the file
<code>HDFS-611.branch-0.20-append.v1.patch</code>.  Note that this backport is not available on the HDFS-611 page – I created
the backport myself based on the HDFS-611 patch for Hadoop 0.20.2 release
(<a href="https://issues.apache.org/jira/secure/attachment/12424512/HDFS-611.branch-20.v6.patch">HDFS-611.branch-20.v6.patch</a>).</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># Apply the patch to your branch</span>
</span><span class="line"><span class="nv">$ </span>patch -p1 &lt; HDFS-611.branch-0.20-append.v1.patch
</span><span class="line">
</span><span class="line"><span class="c"># Add any modified or newly created files from the patch to git&#39;s index</span>
</span><span class="line"><span class="nv">$ </span>git add src/hdfs/org/apache/hadoop/hdfs/protocol/FSConstants.java <span class="se">\</span>
</span><span class="line">         src/hdfs/org/apache/hadoop/hdfs/server/datanode/FSDataset.java <span class="se">\</span>
</span><span class="line">         src/hdfs/org/apache/hadoop/hdfs/server/datanode/FSDatasetAsyncDiskService.java <span class="se">\</span>
</span><span class="line">         src/test/org/apache/hadoop/hdfs/TestDFSRemove.java
</span><span class="line">
</span><span class="line"><span class="c"># Commit the changes from the index to the repository</span>
</span><span class="line"><span class="nv">$ </span>git commit -m <span class="s2">&quot;HDFS-611: Backport of HDFS-611 patch for Hadoop 0.20.2 release&quot;</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Verify that your patched branch is one commit ahead of the original (remote) append branch.</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># Compare the commit histories of your local append branch and the original (remote) branch</span>
</span><span class="line"><span class="nv">$ </span>git show-branch branch-0.20-append branch-0.20-append-yourbranch
</span><span class="line">! <span class="o">[</span>branch-0.20-append<span class="o">]</span> HDFS-1554. New semantics <span class="k">for </span>recoverLease. Contributed by Hairong Kuang.
</span><span class="line"> * <span class="o">[</span>branch-0.20-append-yourbranch<span class="o">]</span> HDFS-611: Backport of HDFS-611 patch <span class="k">for </span>Hadoop 0.20.2 release
</span><span class="line">--
</span><span class="line"> * <span class="o">[</span>branch-0.20-append-yourbranch<span class="o">]</span> HDFS-611: Backport of HDFS-611 patch <span class="k">for </span>Hadoop 0.20.2 release
</span><span class="line">+* <span class="o">[</span>branch-0.20-append<span class="o">]</span> HDFS-1554. New semantics <span class="k">for </span>recoverLease. Contributed by Hairong Kuang.
</span><span class="line">
</span><span class="line"><span class="c"># Yep, it is exactly one commit ahead.</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Voilà!</p>

<p>And by the way, if you want to see the commit differences between Hadoop 0.20.2 release, the official
<code>branch-0.20-append</code> and your own, patched <code>branch-0.20-append-yourbranch</code>, run the following git command:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>git show-branch release-0.20.2 branch-0.20-append branch-0.20-append-yourbranch
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="conclusion">Conclusion</h1>

<p>I hope this article helps you to build a Hadoop 0.20.x version for running HBase 0.90.2 on in a production
environment!  Your feedback and comments are as always appreciated.</p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Benchmarking and Stress Testing an Hadoop Cluster with TeraSort, TestDFSIO & Co.]]></title>
    <link href="http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/"/>
    <updated>2011-04-09T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench</id>
    <content type="html"><![CDATA[<p>In this article I introduce some of the benchmarking and testing tools that are included in the Apache Hadoop
distribution.  Namely, we look at the benchmarks <em>TestDFSIO</em>, <em>TeraSort</em>, <em>NNBench</em> and <em>MRBench</em>.  These are popular
choices to benchmark and stress test an Hadoop cluster.  Hence knowing how to run these tools will help you to shake
out your cluster in terms of architecture, hardware and software, to measure its performance and also to share and
compare your results with other people.</p>

<!--more-->

<h1 id="before-we-start">Before we start</h1>

<p>Let me first talk about a few things that you should be aware of while reading through this article.</p>

<h2 id="what-we-not-want-to-do">What we (not) want to do</h2>

<p>The purpose of this article is to give you a quick overview and walkthrough of some of Hadoop’s most popular benchmarking and testing tools. All the tools described in the following sections are part of the <a href="http://hadoop.apache.org/">Apache Hadoop distribution</a>, so they should already be available in your cluster and waiting to be (ab)used! I do not focus on the general concepts and best practices of benchmarking an Hadoop cluster in this article but rather want to get you up to speed with actually using these tools. My motivation comes from my own – sometimes frustrating – experience with using these tools, as they often lack clear documentation.</p>

<p>That said, at any time feel free to <a href="http://wiki.apache.org/hadoop/HowToContribute">contribute back to the Hadoop project</a> by helping to improve the status of the documentation. I know I’ll try to do my fair share.</p>

<h2 id="prerequisites">Prerequisites</h2>

<p>If you want to follow along the examples in this article, you obviously need access to a running cluster. In the case that you are still trying to install and configure your cluster, my following tutorials might help you to build one. The tutorials are tailored to Ubuntu Linux but the information does also apply to other Linux/Unix variants.</p>

<ul>
    <li><a title="Running Hadoop On Ubuntu Linux (Single-Node Cluster)" href="http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/">Running Hadoop On Ubuntu Linux (Single-Node Cluster)</a>
How to set up a <em>single-node</em> Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux</li>
</ul>
<ul>
    <li><a title="Running Hadoop On Ubuntu Linux (Multi-Node Cluster)" href="http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/">Running Hadoop On Ubuntu Linux (Multi-Node Cluster)</a>
How to set up a <em>multi-node</em> Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux</li>
</ul>

<h2 id="version-focus-hadoop-0202">Version focus: Hadoop 0.20.2</h2>

<p>I put the focus on the benchmark and testing tools shipped with Hadoop version 0.20.2. At this moment this is the latest production-ready release Hadoop (0.21 is not!). Until Hadoop 0.22 is out, a lot of Hadoop users (including me) are therefore sticking to the tested and true Hadoop 0.20.2 release.</p>

<h2 id="notreplicatedyetexception-and-alreadybeingcreatedexception-errors">NotReplicatedYetException and AlreadyBeingCreatedException errors</h2>

<p>If your cluster is running the stock version of Hadoop 0.20.2 release, you might run into <code>NotReplicatedYetException</code> and/or <code>AlreadyBeingCreatedException</code> “errors” while test-driving your cluster (as you can see in the output below that they are actually logged as <code>INFO</code>):</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
</pre></td><td class="code"><pre><code class=""><span class="line">2011-02-11 03:48:05,367 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54310, call addBlock(/user/hduser/terasort-input/_temporary/_attempt_201102110201_0008_m_000020_0/part-00020, DFSClient_attempt_201102110201_0008_m_000020_0) from 192.168.0.2:45133: error: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/user/hduser/terasort-input/_temporary/_attempt_201102110201_0008_m_000020_0/part-00020
</span><span class="line">org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/user/hduser/terasort-input/_temporary/_attempt_201102110201_0008_m_000020_0/part-00020
</span><span class="line">        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1257)
</span><span class="line">        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
</span><span class="line">        at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
</span><span class="line">        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
</span><span class="line">        at java.lang.reflect.Method.invoke(Method.java:597)
</span><span class="line">        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
</span><span class="line">        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
</span><span class="line">        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
</span><span class="line">        at java.security.AccessController.doPrivileged(Native Method)
</span><span class="line">        at javax.security.auth.Subject.doAs(Subject.java:396)
</span><span class="line">        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>This often happens when you delete a large amount of HDFS data on the cluster – and this is something that you might want to do in between test runs so that your cluster does not fill up. This is a <a href="http://issues.apache.org/jira/browse/HDFS-611">known problem</a> in 0.20.2 and fixed in 0.21.0 (see <a href="http://issues.apache.org/jira/browse/HDFS-611">HDFS-611</a>). Fortunately, there is a patch available: <code>HDFS-611.branch-20.v6.patch</code> (<a href="https://issues.apache.org/jira/secure/attachment/12424512/HDFS-611.branch-20.v6.patch">link</a>) is the file you need for Hadoop 0.20.2.</p>

<p>Describing how to patch Hadoop and rebuild it from source is unfortunately beyond the scope of this article. Sorry! If you actually happen to run into this problem, please have a look at the Hadoop documentation (e.g. the <a href="http://wiki.apache.org/hadoop/FAQ#Platform_Specific">FAQ</a>, <a href="http://wiki.apache.org/hadoop/HowToContribute">How to Contribute</a>, <a href="http://wiki.apache.org/hadoop/EclipseEnvironment">Working with Hadoop under Eclipse</a> or <a href="http://wiki.apache.org/hadoop/GitAndHadoop">Git and Hadoop</a> are a very good start) and/or consult the <a href="http://hadoop.apache.org/mailing_lists.html">Hadoop mailing lists</a>. You can also check out my article <a href="http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-version-for-hbase-0-90-2/">Building an Hadoop 0.20.x version for HBase 0.90.2</a>, which should jumpstart you with creating your own custom Hadoop build.</p>

<p>Another workaround is to simply wait some minutes after deleting a large amount of data before starting another test run. You might want to monitor the NameNode’s log file to check when the asynchronous background delete processes have finished their work.</p>

<h1 id="overview-of-benchmarks-and-testing-tools">Overview of Benchmarks and Testing Tools</h1>

<p>The Hadoop distribution comes with a number of benchmarks, which are bundled in <code>hadoop-*test*.jar</code> and <code>hadoop-*examples*.jar</code>. The four benchmarks we will be looking at in more details are <code>TestDFSIO</code>, <code>nnbench</code>, <code>mrbench</code> (in <code>hadoop-*test*.jar</code>) and <code>TeraGen</code> / <code>TeraSort</code> / <code>TeraValidate</code> (in <code>hadoop-*examples*.jar</code>).</p>

<p>Here is the full list of available options in <code>hadoop-*test*.jar</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
</pre></td><td class="code"><pre><code class=""><span class="line"># change to your Hadoop installation directory;
</span><span class="line"># if you followed my Hadoop tutorials, this is /usr/local/hadoop
</span><span class="line">$ cd /usr/local/hadoop
</span><span class="line">
</span><span class="line">$ bin/hadoop jar hadoop-*test*.jar
</span><span class="line">An example program must be given as the first argument.
</span><span class="line">Valid program names are:
</span><span class="line">  DFSCIOTest: Distributed i/o benchmark of libhdfs.
</span><span class="line">  DistributedFSCheck: Distributed checkup of the file system consistency.
</span><span class="line">  MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
</span><span class="line">  TestDFSIO: Distributed i/o benchmark.
</span><span class="line">  dfsthroughput: measure hdfs throughput
</span><span class="line">  filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
</span><span class="line">  loadgen: Generic map/reduce load generator
</span><span class="line">  mapredtest: A map/reduce test check.
</span><span class="line">  mrbench: A map/reduce benchmark that can create many small jobs
</span><span class="line">  nnbench: A benchmark that stresses the namenode.
</span><span class="line">  testarrayfile: A test for flat files of binary key/value pairs.
</span><span class="line">  testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
</span><span class="line">  testfilesystem: A test for FileSystem read/write.
</span><span class="line">  testipc: A test for ipc.
</span><span class="line">  testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
</span><span class="line">  testrpc: A test for rpc.
</span><span class="line">  testsequencefile: A test for flat files of binary key value pairs.
</span><span class="line">  testsequencefileinputformat: A test for sequence file input format.
</span><span class="line">  testsetfile: A test for flat files of binary key/value pairs.
</span><span class="line">  testtextinputformat: A test for text input format.
</span><span class="line">  threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>And here is the full list of available options for <code>hadoop-*examples*.jar</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>bin/hadoop jar hadoop-*examples*.jar
</span><span class="line">An example program must be given as the first argument.
</span><span class="line">Valid program names are:
</span><span class="line">  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
</span><span class="line">  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
</span><span class="line">  dbcount: An example job that count the pageview counts from a database.
</span><span class="line">  grep: A map/reduce program that counts the matches of a regex in the input.
</span><span class="line">  join: A job that effects a join over sorted, equally partitioned datasets
</span><span class="line">  multifilewc: A job that counts words from several files.
</span><span class="line">  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
</span><span class="line">  pi: A map/reduce program that estimates Pi using monte-carlo method.
</span><span class="line">  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
</span><span class="line">  randomwriter: A map/reduce program that writes 10GB of random data per node.
</span><span class="line">  secondarysort: An example defining a secondary sort to the reduce.
</span><span class="line">  sleep: A job that sleeps at each map and reduce task.
</span><span class="line">  sort: A map/reduce program that sorts the data written by the random writer.
</span><span class="line">  sudoku: A sudoku solver.
</span><span class="line">  teragen: Generate data <span class="k">for </span>the terasort
</span><span class="line">  terasort: Run the terasort
</span><span class="line">  teravalidate: Checking results of terasort
</span><span class="line">  wordcount: A map/reduce program that counts the words in the input files.
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The <code>wordcount</code> example, for instance, is a <a href="http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#running-a-mapreduce-job">typical “Hello, World” test</a> that you can run after you have finished <a href="m/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#running-a-mapreduce-job">installing a cluster</a>.</p>

<p>And before we start, here’s a nifty trick for your tests: When running the benchmarks described in the following sections, you might want to use the Unix <code>time</code> command to measure the elapsed time. This saves you the hassle of navigating to the Hadoop JobTracker web interface to get the (almost) same information. Simply prefix every Hadoop command with <code>time</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span><span class="nb">time </span>hadoop jar hadoop-*examples*.jar ...
</span><span class="line"><span class="o">[</span>...<span class="o">]</span>
</span><span class="line">real    9m15.510s
</span><span class="line">user    0m7.075s
</span><span class="line">sys     0m0.584s
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The relevant metric is the <code>real</code> value in the first row.</p>

<h1 id="testdfsio">TestDFSIO</h1>

<p>The TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such as stress testing HDFS, to discover performance bottlenecks in your network, to shake out the hardware, OS and Hadoop setup of your cluster machines (particularly the NameNode and the DataNodes) and to give you a first impression of how fast your cluster is in terms of I/O.</p>

<div class="note">
Note: Because this test is run as a MapReduce job, the MapReduce stack of the cluster must be correctly working. In other words, this test cannot be used to benchmark HDFS in isolation from MapReduce.
</div>

<p>An official documentation does not seem to exist in Hadoop 0.20.2, so the only way to understand the test in greater detail is to inspect the source code found in <code>$HADOOP_HOME/src/test/org/apache/hadoop/fs/TestDFSIO.java</code>.</p>

<h2 id="preliminaries">Preliminaries</h2>

<h3 id="the-default-output-directory-is-benchmarkstestdfsio">1. The default output directory is /benchmarks/TestDFSIO</h3>

<p>When a write test is run via <code>-write</code>, the TestDFSIO benchmark writes its files to <code>/benchmarks/TestDFSIO</code> on HDFS. Files from older write runs are overwritten. If you want to preserve the output files of previous runs, you have to copy/move these files manually to a new HDFS location. Benchmark results are saved in a local file called <code>TestDFSIO_results.log</code> in the current local directory (results are appended if the file already exists) and also printed to STDOUT. If you want to use a different filename, set the -resFile parameter appropriately (e.g. <code>-resFile foo.txt</code>).</p>

<div class="note">
Normally, you would be able to set the property &#8220;test.build.data&#8220; to specify a custom output directory for TestDFSIO&#8217;s output files (e.g. &#8220;hadoop jar hadoop-*test*.jar TestDFSIO -D test.build.data=/benchmarks/mytestdfsio &#8230;&#8220;). But due to a known bug in TestDFSIO this will not work (<a href="https://issues.apache.org/jira/browse/HADOOP-6074">HADOOP-6074</a>). The reason is that &#8220;TestDFSIO.java&#8220; sets up the job configuration object via &#8220;Configuration conf = new Configuration();&#8220; but it subsequently forgets to parse and include any user-supplied parameters from the command line. This is also one of the reasons why you cannot assign TestDFSIO jobs (as of Hadoop 0.20.2) to custom pools or queues.
</div>

<h3 id="run-write-tests-before-read-tests">2. Run write tests before read tests</h3>

<p>The read test of TestDFSIO does not generate its own input files. For this reason, it is a convenient practice to first run a write test via <code>-write</code> and then follow-up with a read test via <code>-read</code> (while using the same parameters as during the previous <code>-write</code> run).</p>

<h2 id="run-a-write-test-as-input-data-for-the-subsequent-read-test">Run a write test (as input data for the subsequent read test)</h2>

<p>The syntax for running a write test is as follows:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">TestDFSIO.0.0.4
</span><span class="line">Usage: hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile resultFileName] [-bufferSize Bytes]</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>TestDFSIO is designed in such a way that it will use <a href="http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200901.mbox/%3C496EACE2.2090007@yahoo-inc.com%3E">1 map task per file</a>, i.e. it is a 1:1 mapping from files to map tasks. Splits are defined so that each map gets only one filename, which it creates (<code>-write</code>) or reads (<code>-read</code>).</p>

<p>The command to run a write test that generates 10 output files of size 1GB for a total of 10GB is:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="run-a-read-test">Run a read test</h2>

<p>The command to run the corresponding read test using 10 input files of size 1GB is:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="clean-up-and-remove-test-data">Clean up and remove test data</h2>

<p>The command to remove previous test data is:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*test*.jar TestDFSIO -clean
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The cleaning run will delete the output directory <code>/benchmarks/TestDFSIO</code> on HDFS.</p>

<h2 id="interpreting-testdfsio-results">Interpreting TestDFSIO results</h2>

<p>Let’s have a look at this exemplary result for writing and reading 1TB of data on a cluster of twenty nodes and try to deduce its meaning:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class=""><span class="line">----- TestDFSIO ----- : write
</span><span class="line">           Date &amp; time: Fri Apr 08 2011
</span><span class="line">       Number of files: 1000
</span><span class="line">Total MBytes processed: 1000000
</span><span class="line">     Throughput mb/sec: 4.989
</span><span class="line">Average IO rate mb/sec: 5.185
</span><span class="line"> IO rate std deviation: 0.960
</span><span class="line">    Test exec time sec: 1113.53
</span><span class="line">
</span><span class="line">----- TestDFSIO ----- : read
</span><span class="line">           Date &amp; time: Fri Apr 08 2011
</span><span class="line">       Number of files: 1000
</span><span class="line">Total MBytes processed: 1000000
</span><span class="line">     Throughput mb/sec: 11.349
</span><span class="line">Average IO rate mb/sec: 22.341
</span><span class="line"> IO rate std deviation: 119.231
</span><span class="line">    Test exec time sec: 544.842</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Here, the most notable metrics are <em>Throughput mb/sec</em> and <em>Average IO rate mb/sec</em>. Both of them are based on the file size written (or read) by the individual map tasks and the elapsed time to do so (see <a href="http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200901.mbox/%3C496EACE2.2090007@yahoo-inc.com%3E">this discussion thread</a> for more information).</p>

<p><em>Throughput mb/sec</em> for a TestDFSIO job using <code>N</code> map tasks is defined as follows. The index <code>1 &lt;= i &lt;= N</code> denotes the individual map tasks:</p>

<script type="math/tex; mode=display">
Throughput(N) = \frac{\sum_{i=0}^N filesize_i}{\sum_{i=0}^N time_i}
</script>

<p><em>Average IO rate mb/sec</em> is defined as:</p>

<script type="math/tex; mode=display">
Average\ IO\ rate(N) = \frac{\sum_{i=0}^N rate_i}{N} = \frac{\sum_{i=0}^N \frac{filesize_i}{time_i}}{N}
</script>

<p>Two derived metrics you might be interested in are estimates of the “concurrent” throughput and average IO rate (for the lack of a better term) your cluster is capable of. Imagine you let TestDFSIO create 1,000 files but your cluster has only 200 map slots. This means that it takes about five MapReduce waves (<code>5 * 200 = 1,000</code>) to write the full test data because the cluster can only run 200 map tasks at the same time. In this case, simply take the minimum of the number of files (here: <code>1,000</code>) and the number of available map slots in your cluster (here: <code>200</code>), and multiply the throughput and average IO rate by this minimum. In our example, the concurrent throughput would be estimated at <code>4.989 * 200 = 997.8 MB/s</code> and the concurrent average IO rate at <code>5.185 * 200 = 1,037.0 MB/s</code>.</p>

<p>Another thing to keep in mind when interpreting TestDFSIO results is that the <a href="http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Data+Replication">HDFS replication factor</a> plays an important role. If you compare two otherwise identical TestDFSIO write runs which use an HDFS replication factor of 2 and 3, respectively, you will see higher throughput and higher average IO numbers for the run with the lower replication factor. Unfortunately, you cannot simply overwrite your cluster’s default replication value on the command line via</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*test*.jar TestDFSIO -D dfs.replication<span class="o">=</span>2 -write ...
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>because as I said above the TestDFSIO shipped with Hadoop 0.20.2 does not parse command line parameters properly. A quick workaround is to create or update the <code>conf/hdfs-site.xml</code> file on the machine you are running the TestDFSIO write test from. TestDFSIO will read this configuration file during startup and use the value specified for the <code>dfs.replication</code> property therein.</p>

<h2 id="more-information-about-testdfsio">More information about TestDFSIO</h2>

<p>The comments in <a href="https://issues.apache.org/jira/browse/HDFS-1338">HDFS-1338</a> include some general remarks on the concept and design of TestDFSIO. And of course you can have a look at the source code in <code>TestDFSIO.java</code>.</p>

<h1 id="terasort-benchmark-suite">TeraSort benchmark suite</h1>

<p>The TeraSort benchmark is probably the most well-known Hadoop benchmark. Back in 2008, Yahoo! set a record by <a href="http://developer.yahoo.com/blogs/hadoop/posts/2008/07/apache_hadoop_wins_terabyte_sort_benchmark/">sorting 1 TB of data in 209 seconds</a> – on an Hadoop cluster of 910 nodes as Owen O’Malley of the Yahoo! Grid Computing Team reports. One year later in 2009, Yahoo! set another record by <a href="http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/">sorting a 1 PB (1’000 TB) of data in 16 hours</a> on an even larger Hadoop cluster of 3800 nodes (it took the same cluster only 62 seconds to sort 1 TB of data, easily beating the previous year’s record!).</p>

<p>Basically, the goal of TeraSort is to sort 1TB of data (or any other amount of data you want) as fast as possible. It is a benchmark that combines testing the HDFS and MapReduce layers of an Hadoop cluster. As such it is not surprising that the TeraSort benchmark suite is often used in practice, which has the added benefit that it allows us – among other things – to compare the results of our own cluster with the clusters of other people. You can use the TeraSort benchmark, for instance, to iron out your Hadoop configuration after your cluster passed a convincing TestDFSIO benchmark first. Typical areas where TeraSort is helpful is to determine whether your map and reduce slot assignments are sound (as they depend on the variables such as the number of cores per TaskTracker node and the available RAM), whether other MapReduce-related parameters such as <code>io.sort.mb</code> and <code>mapred.child.java.opts</code> are set to proper values, or whether the FairScheduler configuration you came up with really behaves as expected.</p>

<p>A full TeraSort benchmark run consists of the following three steps:</p>

<ol>
  <li>Generating the input data via <code>TeraGen</code>.</li>
  <li>Running the actual <code>TeraSort</code> on the input data.</li>
  <li>Validating the sorted output data via <code>TeraValidate</code>.</li>
</ol>

<p>You do not need to re-generate input data before every <code>TeraSort</code> run (step 2). So you can skip step 1 (TeraGen) for later TeraSort runs if you are satisfied with the generated data.</p>

<p>Figure 1 shows the basic data flow. We use the included HDFS directory names in the later examples.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/hadoop-benchmarking-terasort-data-flow1-505x600.png" width="505" height="600" title="Hadoop Benchmarking and Stress Testing: The basic data flow of the TeraSort benchmark suite.&#34; title=&#34;TeraSort Data Flow" alt="Hadoop Benchmarking and Stress Testing: The basic data flow of the TeraSort benchmark suite.&#34; title=&#34;TeraSort Data Flow" /></p>

<div class="caption">
Figure 1: Hadoop Benchmarking and Stress Testing: The basic data flow of the TeraSort benchmark suite
</div>

<p>The next sections describe each of the three steps in greater detail.</p>

<h2 id="teragen-generate-the-terasort-input-data-if-needed">TeraGen: Generate the TeraSort input data (if needed)</h2>

<p><code>TeraGen</code> (<a href="http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/examples/terasort/TeraGen.html">source code</a>) generates random data that can be conveniently used as input data for a subsequent TeraSort run.</p>

<p>The syntax for running TeraGen is as follows:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*examples*.jar teragen &lt;number of 100-byte rows&gt; &lt;output dir&gt;
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Using the HDFS output directory <code>/user/hduser/terasort-input</code> as an example, the command to run TeraGen in order to generate 1TB of input data (i.e. 1,000,000,000,000 bytes) is:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*examples*.jar teragen 10000000000 /user/hduser/terasort-input
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Please note that the first parameter supplied to TeraGen is 10 billion (10,000,000,000), i.e. not 1 trillion = 1 TB (1,000,000,000,000). The reason is that the first parameter specifies the <em>number of rows</em> of input data to generate, each of which having a size of 100 bytes.</p>

<p>Here is the actual TeraGen data format per row to clear things up:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">&lt;10 bytes key&gt;&lt;10 bytes rowid&gt;&lt;78 bytes filler&gt;\r\n</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>where</p>

<ol>
  <li>The <code>keys</code> are random characters from the set ‘ ‘ .. ‘~’.</li>
  <li>The <code>rowid</code> is the right justified row id as a int.</li>
  <li>The <code>filler</code> consists of 7 runs of 10 characters from ‘A’ to ‘Z’.</li>
</ol>

<p>If you have a very fast cluster, your map tasks might finish in a few seconds if you use a relatively small default HDFS block size such as 128MB (which is not a bad idea though in general). This means that the time to start/stop map tasks might be larger than the time to perform the actual task. In other words, the overhead for managing the TaskTrackers might exceed the job’s “payload”. An easy way to work around this is to increase the HDFS block size for files written by TeraGen. Keep in mind that the HDFS block size is a per-file setting, and the value specified by the <code>dfs.block.size</code> property in <code>hdfs-default.xml</code> (or <code>conf/hdfs-site.xml</code> if you use a custom configuration file) is just a default value. So if, for example, you want to use an HDFS block size of 512MB (i.e. 536870912 bytes) for the TeraSort benchmark suite, overwrite <code>dfs.block.size</code> when running TeraGen:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*examples*.jar teragen -D dfs.block.size<span class="o">=</span>536870912 ...
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="terasort-run-the-actual-terasort-benchmark">TeraSort: Run the actual TeraSort benchmark</h2>

<p><code>TeraSort</code> (<a href="http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/examples/terasort/TeraSort.html">source code</a>) is implemented as a MapReduce sort job with a custom partitioner that uses a sorted list of <em>n-1</em> sampled keys that define the key range for each reduce.</p>

<p>The syntax to run the TeraSort benchmark is as follows:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*examples*.jar terasort &lt;input dir&gt; &lt;output dir&gt;
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Using the input directory <code>/user/hduser/terasort-input</code> and the output directory <code>/user/hduser/terasort-output</code> as an example, the command to run the TeraSort benchmark is:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*examples*.jar terasort /user/hduser/terasort-input /user/hduser/terasort-output
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="teravalidate-validate-the-sorted-output-data-of-terasort">TeraValidate: Validate the sorted output data of TeraSort</h2>

<p><code>TeraValidate</code> (<a href="http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/examples/terasort/TeraValidate.html">source code</a>) ensures that the output data of <code>TeraSort</code> is globally sorted.</p>

<p>TeraValidate creates one map task per file in TeraSort’s output directory. A map task ensures that each key is less than or equal to the previous one. The map task also generates records with the first and last keys of the file, and the reduce task ensures that the first key of file <em>i</em> is greater than the last key of file <em>i-1</em>. The reduce tasks do not generate any output if everything is properly sorted. If they do detect any problems, they output the keys that are out of order.</p>

<p>The syntax to run the TeraValidate test is as follows:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*examples*.jar teravalidate &lt;terasort output dir <span class="o">(=</span> input data<span class="o">)</span>&gt; &lt;teravalidate output dir&gt;
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Using the output directory <code>/user/hduser/terasort-output</code> from the previous sections and the report (output) directory <code>/user/hduser/terasort-validate</code> as an example, the command to run the TeraValidate test is:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*examples*.jar teravalidate /user/hduser/terasort-output /user/hduser/terasort-validate&lt;/pre&gt;
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="further-tips-and-tricks">Further tips and tricks</h2>

<p>Hadoop provides a very convenient way to access statistics about a job from the command line:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop job -history all &lt;job output directory&gt;
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>This command retrieves a job’s history files (two files that are by default stored in <code>&lt;job output directory&gt;/_logs/history</code>) and computes job statistics from them:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop fs -ls /user/hduser/terasort-input/_logs/history
</span><span class="line">Found 2 items
</span><span class="line">-rw-r--r--   3 hadoop supergroup      17877 2011-02-11 11:58 /user/hduser/terasort-input/_logs/history/master_1297410440884_job_201102110201_0014_conf.xml
</span><span class="line">-rw-r--r--   3 hadoop supergroup      36758 2011-02-11 11:58 /user/hduser/terasort-input/_logs/history/master_1297410440884_job_201102110201_0014_hadoop_TeraGen
</span></code></pre></td></tr></table></div></figure></notextile></div>

<div class="note">
Note: Unfortunately, not all benchmarks/tests shipped with Hadoop write such job history log files. The TestDFSIO benchmark, for instance, does not save job history files.
</div>

<p>Here is an exemplary snippet of such job statistics for <code>TeraGen</code> from a small cluster:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop job -history all /user/hduser/terasort-input
</span><span class="line">
</span><span class="line">Hadoop job: <span class="nv">job_201102110201_0014</span>
</span><span class="line"><span class="o">=====================================</span>
</span><span class="line">Job tracker host name: master
</span><span class="line">job tracker start <span class="nb">time</span>: Fri Feb 11 2011
</span><span class="line">User: hadoop
</span><span class="line">JobName: TeraGen
</span><span class="line">JobConf: hdfs://master:54310/app/hadoop/tmp/mapred/system/job_201102110201_0014/job.xml
</span><span class="line">Submitted At: 11-Feb-2011
</span><span class="line">Launched At: 11-Feb-2011 13:58:14 <span class="o">(</span>0sec<span class="o">)</span>
</span><span class="line">Finished At: 11-Feb-2011 15:00:56 <span class="o">(</span>1hrs, 2mins, 41sec<span class="o">)</span>
</span><span class="line">Status: <span class="nv">SUCCESS</span>
</span><span class="line"><span class="o">=====================================</span>
</span><span class="line">
</span><span class="line">Task <span class="nv">Summary</span>
</span><span class="line"><span class="o">============================</span>
</span><span class="line">Kind    Total   Successful      Failed  Killed  StartTime       FinishTime
</span><span class="line">
</span><span class="line">Setup   1       1               0       0       11-Feb-2011 13:58:15    11-Feb-2011 13:58:16 <span class="o">(</span>1sec<span class="o">)</span>
</span><span class="line">Map     24      24              0       0       11-Feb-2011 13:58:18    11-Feb-2011 15:00:47 <span class="o">(</span>1hrs, 2mins, 28sec<span class="o">)</span>
</span><span class="line">Reduce  0       0               0       0
</span><span class="line">Cleanup 1       1               0       0       11-Feb-2011 15:00:50    11-Feb-2011 15:00:53 <span class="o">(</span>2sec<span class="o">)</span>
</span><span class="line"><span class="o">============================</span>
</span><span class="line">
</span><span class="line"><span class="nv">Analysis</span>
</span><span class="line"><span class="o">=========</span>
</span><span class="line">
</span><span class="line">Time taken by best performing map task task_201102110201_0014_m_000006: 59mins, 5sec
</span><span class="line">Average <span class="nb">time </span>taken by map tasks: 1hrs, 1mins, 24sec
</span><span class="line">Worse performing map tasks:
</span><span class="line">TaskId          Timetaken
</span><span class="line">task_201102110201_0014_m_000004 1hrs, 2mins, 24sec
</span><span class="line">task_201102110201_0014_m_000020 1hrs, 2mins, 19sec
</span><span class="line">task_201102110201_0014_m_000013 1hrs, 2mins, 9sec
</span><span class="line"><span class="o">[</span>...<span class="o">]</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="namenode-benchmark-nnbench">NameNode benchmark (nnbench)</h1>

<p><code>NNBench</code> (see <code>src/test/org/apache/hadoop/hdfs/NNBench.java</code>) is useful for load testing the NameNode hardware and configuration. It generates a lot of HDFS-related requests with normally very small “payloads” for the sole purpose of putting a high HDFS management stress on the NameNode. The benchmark can simulate requests for creating, reading, renaming and deleting files on HDFS.</p>

<p>I like to run this test simultaneously from several machines – e.g. from a set of DataNode boxes – in order to hit the NameNode from multiple locations at the same time.</p>

<p>The syntax of NNBench is as follows:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
</pre></td><td class="code"><pre><code class=""><span class="line">NameNode Benchmark 0.4
</span><span class="line">Usage: nnbench &lt;options&gt;
</span><span class="line">Options:
</span><span class="line">        -operation &lt;Available operations are create_write open_read rename delete. This option is mandatory&gt;
</span><span class="line">         * NOTE: The open_read, rename and delete operations assume that the files they operate on, are already available. The create_write operation must be run before running the other operations.
</span><span class="line">        -maps &lt;number of maps. default is 1. This is not mandatory&gt;
</span><span class="line">        -reduces &lt;number of reduces. default is 1. This is not mandatory&gt;
</span><span class="line">        -startTime &lt;time to start, given in seconds from the epoch. Make sure this is far enough into the future, so all maps (operations) will start at the same time&gt;. default is launch time + 2 mins. This is not mandatory
</span><span class="line">        -blockSize &lt;Block size in bytes. default is 1. This is not mandatory&gt;
</span><span class="line">        -bytesToWrite &lt;Bytes to write. default is 0. This is not mandatory&gt;
</span><span class="line">        -bytesPerChecksum &lt;Bytes per checksum for the files. default is 1. This is not mandatory&gt;
</span><span class="line">        -numberOfFiles &lt;number of files to create. default is 1. This is not mandatory&gt;
</span><span class="line">        -replicationFactorPerFile &lt;Replication factor for the files. default is 1. This is not mandatory&gt;
</span><span class="line">        -baseDir &lt;base DFS path. default is /becnhmarks/NNBench. This is not mandatory&gt;
</span><span class="line">        -readFileAfterOpen &lt;true or false. if true, it reads the file and reports the average time to read. This is valid with the open_read operation. default is false. This is not mandatory&gt;
</span><span class="line">        -help: Display the help statement</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The following command will run a NameNode benchmark that creates 1000 files using 12 maps and 6 reducers. It uses a custom output directory based on the machine’s short hostname. This is a simple trick to ensure that one box does not accidentally write into the same output directory of another box running NNBench at the same time.</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*test*.jar nnbench -operation create_write <span class="se">\</span>
</span><span class="line">    -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 <span class="se">\</span>
</span><span class="line">    -replicationFactorPerFile 3 -readFileAfterOpen <span class="nb">true</span> <span class="se">\</span>
</span><span class="line">    -baseDir /benchmarks/NNBench-<span class="sb">`</span>hostname -s<span class="sb">`</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Note that by default the benchmark waits 2 minutes before it actually starts!</p>

<h1 id="mapreduce-benchmark-mrbench">MapReduce benchmark (mrbench)</h1>

<p><code>MRBench</code> (see <code>src/test/org/apache/hadoop/mapred/MRBench.java</code>) loops a small job a number of times. As such it is a very complimentary benchmark to the “large-scale” TeraSort benchmark suite because MRBench checks whether <em>small</em> job runs are responsive and running efficiently on your cluster. It puts its focus on the MapReduce layer as its impact on the HDFS layer is very limited.</p>

<p>This test should be run from a single box (see caveat below). The command syntax can be displayed via <code>mrbench --help</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class=""><span class="line">MRBenchmark.0.0.2
</span><span class="line">Usage: mrbench [-baseDir &lt;base DFS path for output/input, default is /benchmarks/MRBench&gt;]
</span><span class="line">          [-jar &lt;local path to job jar file containing Mapper and Reducer implementations, default is current jar file&gt;]
</span><span class="line">          [-numRuns &lt;number of times to run the job, default is 1&gt;]
</span><span class="line">          [-maps &lt;number of maps for each run, default is 2&gt;]
</span><span class="line">          [-reduces &lt;number of reduces for each run, default is 1&gt;]
</span><span class="line">          [-inputLines &lt;number of input lines to generate, default is 1&gt;]
</span><span class="line">          [-inputType &lt;type of input to generate, one of ascending (default), descending, random&gt;]
</span><span class="line">          [-verbose]</span></code></pre></td></tr></table></div></figure></notextile></div>

<div class="note">
Important note: In Hadoop 0.20.2, setting the &#8220;-baseDir&#8220; parameter has no effect. This means that multiple parallel &#8220;MRBench&#8220; runs (e.g. started from different boxes) might interfere with each other. This is a known bug (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-2398">MAPREDUCE-2398</a>). I have submitted a patch but it has not been integrated yet.
</div>

<p>In Hadoop 0.20.2, the parameters default to:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class=""><span class="line">-baseDir: /benchmarks/MRBench  [*** see my note above ***]
</span><span class="line">-numRuns: 1
</span><span class="line">-maps: 2
</span><span class="line">-reduces: 1
</span><span class="line">-inputLines: 1
</span><span class="line">-inputType: ascending</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The command to run a loop of 50 small test jobs is:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-*test*.jar mrbench -numRuns 50
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Exemplary output of the above command:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">DataLines       Maps    Reduces AvgTime (milliseconds)
</span><span class="line">1               2       1       31414</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>This means that the average finish time of executed jobs was 31 seconds.</p>

<h1 id="summary">Summary</h1>

<p>I hope you have found my quick overview of Hadoop’s benchmarking and testing tools useful! Feel free to provide your feedback, corrections and suggestions in the comments below.</p>

<h1 id="related-articles">Related Articles</h1>

<p>From yours truly:</p>

<ul>
  <li><a href="http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/">Running Hadoop On Ubuntu Linux (Single-Node Cluster)</a></li>
  <li><a href="http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/">Running Hadoop On Ubuntu Linux (Multi-Node Cluster)</a></li>
  <li><a href="http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/">Writing An Hadoop MapReduce Program In Python</a></li>
</ul>

<p>From others:</p>

<ul>
  <li><a href="https://github.com/intel-hadoop/HiBench">HiBench</a> - an Hadoop benchmark suite developed by Intel</li>
</ul>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Hadoop space quotas, HDFS block size, replication and small files]]></title>
    <link href="http://www.michael-noll.com/blog/2011/03/28/hadoop-space-quotas-hdfs-block-size-replication-and-small-files/"/>
    <updated>2011-03-28T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2011/03/28/hadoop-space-quotas-hdfs-block-size-replication-and-small-files</id>
    <content type="html"><![CDATA[<p>A common way to put restrictions on the resource usage of your Hadoop cluster is to set quotas.  Recently, I ran into a
problem where I could not upload a small file to an empty HDFS directory for which a space quota of 200MB was set.</p>

<!--more-->

<p>I won’t go into the details of
<a href="http://hadoop.apache.org/hdfs/docs/current/hdfs_quota_admin_guide.html">using quotas in Hadoop</a>, but here it is in a
nutshell.  Hadoop differentiates between two kinds of quotas:
<a href="http://hadoop.apache.org/hdfs/docs/current/hdfs_quota_admin_guide.html#Name+Quotas">name quotas</a> and
<a href="http://hadoop.apache.org/hdfs/docs/current/hdfs_quota_admin_guide.html#Space+Quotas">space quotas</a>.  The former limits
the <em>number</em> of file and (sub)directory names whereas the latter limits the HDFS “disk” space of the directory (i.e. the
number of bytes used by files under the tree rooted at this directory).</p>

<p>You can set HDFS name quotas with the command</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop dfsadmin -setQuota &lt;max_number&gt; &lt;directory&gt;</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>and you can set HDFS space quotas with the command</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop dfsadmin -setSpaceQuota &lt;max_size&gt; &lt;directory&gt;</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>To clear quotas, use <code>-clrQuota</code> and <code>-clrSpaceQuota</code>, respectively.</p>

<p>So much for the introduction. Recently, I stumbled upon a problem where Hadoop (version 0.20.2) reported a quota
violation for a directory for which a space quota of 200 MB (209,715,200 bytes) was set but no name quota:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop fs -count -q /user/jsmith
</span><span class="line">none    inf    209715200    209715200    5   1   0</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The output columns for <a href="http://hadoop.apache.org/common/docs/r0.20.2/hdfs_shell.html#count"><code>fs -count -q</code></a> are:
QUOTA, REMAINING_QUOTA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME.</p>

<p>The directory <code>/user/jsmith</code> was empty and not a single byte was used yet (according to the quota report above). However, when I tried to upload a very small file to the directory, the upload failed with a (space) quota violation.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop fs -copyFromLocal small-file.txt /user/jsmith
</span><span class="line">11/03/25 13:09:16 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The DiskSpace quota of /user/jsmith is exceeded: quota=209715200 diskspace consumed=384.0m
</span><span class="line">at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
</span><span class="line">at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java               :39)
</span><span class="line">at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorI               mpl.java:27)
</span><span class="line">at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
</span><span class="line">at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96)
</span><span class="line">at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:58)
</span><span class="line">at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:293               9)
</span><span class="line">at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:28               19)
</span><span class="line">at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
</span><span class="line">at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
</span><span class="line">11/03/25 13:09:16 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
</span><span class="line">11/03/25 13:09:16 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/jsmith/small-file.txt" - Aborting...
</span><span class="line">copyFromLocal: org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The DiskSpace quota of /user/jsmith is exceeded: quota=209715200 diskspace consumed=384.0m
</span><span class="line">11/03/25 13:09:16 ERROR hdfs.DFSClient: Exception closing file /user/jsmith/small-file.txt :
</span><span class="line">org.apache.hadoop.hdfs.protocol.DSQuotaExceededException:
</span><span class="line">org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The DiskSpace quota of /user/jsmith is exceeded: quota=209715200 diskspace consumed=384.0m</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Clearly, the small file could not have exceeded the space quota of 200MB I thought.  I also checked whether the
<a href="http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html">Trash feature</a> of Hadoop could be the culprit but that
wasn’t the case.</p>

<p>Eventually, the mystery was solved: Hadoop checks space quotas during space <em>allocation</em>.  This means that HDFS block
size (here: 128MB) and the replication factor of the file (here: 3, i.e. the default value in the cluster set by the
<code>dfs.replication</code> property) play an important role.  In my case, this is what seems to have happened:  When I tried
to copy the local file to HDFS, Hadoop figured it would require a single block of 128MB size to store the small file.
With replication factored in, the total space would be 3 * 128MB = 384 MB. And this would violate the space quota of
200MB.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">  required_number_of_HDFS blocks * HDFS_block_size * replication_count
</span><span class="line">= 1 * 128MB * 3 = 384MB &gt; 200MB.</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>I verified this by manually overwriting the default replication factor of 3 to 1 via </p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ hadoop fs -D dfs.replication=1 -copyFromLocal small-file.txt /user/jsmith</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>which worked successfully.</p>

<p>To be honest, I was a bit puzzled at first because – remembering Tom White’s Hadoop book – a file in HDFS that is
smaller than a single HDFS block does not occupy a full block’s worth of underlying storage (i.e. a kind of “sparse”
use of storage).</p>

<h1 id="related-articles">Related Articles</h1>

<ul>
  <li><a href="http://www.michael-noll.com/blog/2011/10/20/understanding-hdfs-quotas-and-hadoop-fs-and-fsck-tools/">Understanding HDFS quotas and Hadoop fs and fsck tools</a></li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Virtualenv Cheat Sheet]]></title>
    <link href="http://www.michael-noll.com/blog/2010/11/29/virtualenv-cheat-sheet/"/>
    <updated>2010-11-29T00:00:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2010/11/29/virtualenv-cheat-sheet</id>
    <content type="html"><![CDATA[<p><code>virtualenv</code> is a tool to create isolated Python environments.</p>

<p>Such virtual environments are very helpful to make sure that each of your Python applications runs in a healthy
environment.  It allows your applications to use different - even conflicting - versions of Python modules, say,
app A requires YourModule v1.0 whereas app B requires YourModule 3.1.  In addition, virtual environments make sure
that updating Python modules in one place/for one application does not break other applications.</p>

<!--more-->

<p>The full documentation is available at <a href="http://pypi.python.org/pypi/virtualenv">http://pypi.python.org/pypi/virtualenv</a>,
from which parts of this text has been taken/modified.</p>

<h1 id="installation">Installation</h1>

<p>These instructions might vary depending on your operating system. The next lines are the instructions for Ubuntu.</p>

<p>Install setuptools:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ sudo apt-get install python-setuptools</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Install <code>virtualenv</code> and <code>pip</code> (actually, pip should be installed automatically as a dependency of recent virtualenv
versions):</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ sudo easy_install virtualenv pip</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="creating-a-python-sandbox">Creating a Python sandbox</h1>

<h2 id="create-a-virtual-environment">Create a virtual environment</h2>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ virtualenv yourenv</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>or</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ virtualenv --no-site-packages yourenv</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>If you build with the optional <code>--no-site-packages</code> switch, your virtual environment will not inherit any packages
from your “global” site-packages directory.  This can be used if you don’t have control over site-packages and don’t
want to depend on the packages there, or you just want more isolation from the global system.</p>

<h2 id="activate-the-virtual-environment">Activate the virtual environment</h2>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">$ source yourenv/bin/activate</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>This will change your <code>$PATH</code> to point to the virtualenv <code>bin/</code> directory, and update your prompt.  This is all
it does.  If you use the complete path like <code>/path/to/yourenv/bin/python myscript.py</code>, you <em>do not</em> need to activate
the environment first.  You must use <code>source</code> because it changes the environment in-place.</p>

<h2 id="adding-custom-python-modules-to-the-virtual-environment">Adding custom Python modules to the virtual environment</h2>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">(yourenv)$ pip install YourFancyModule</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>If you have activated the virtual environment, running <code>pip</code> within your environment will install
<code>YourFancyModule</code> only to this virtual environment.  After all, this was the purpose of using
<code>virtualenv</code> in the first place, right?</p>

<h2 id="deactivate-the-virtual-environment">Deactivate the virtual environment</h2>

<p>After activating an environment you can use the function <code>deactivate</code> to undo the changes.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">(yourenv)$ deactivate</span></code></pre></td></tr></table></div></figure></notextile></div>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Reference implementation of SPEAR algorithm released]]></title>
    <link href="http://www.michael-noll.com/blog/2010/07/10/reference-implementation-of-spear-algorithm-released/"/>
    <updated>2010-07-10T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2010/07/10/reference-implementation-of-spear-algorithm-released</id>
    <content type="html"><![CDATA[<p>I have just released the <a href="http://github.com/miguno/Spear">reference implementation</a> of our
<a href="http://www.spear-algorithm.org/">SPEAR ranking algorithm</a>.  The library is written in the Python programming language,
and should be straight-forward to use.  You can install the library via Python’s setuptools/easy_install or download it
directly from <a href="http://github.com/miguno/Spear">GitHub</a>.</p>

<p>Here’s a quick example on how to use it from the Python interpreter:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">spear</span>
</span><span class="line"><span class="o">&gt;&gt;&gt;</span> <span class="n">activities</span> <span class="o">=</span> <span class="p">[</span>
</span><span class="line"><span class="o">...</span> <span class="p">(</span><span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2010</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">),</span> <span class="s">&quot;alice&quot;</span><span class="p">,</span> <span class="s">&quot;http://www.michael-noll.com/&quot;</span><span class="p">),</span>
</span><span class="line"><span class="o">...</span> <span class="p">(</span><span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2010</span><span class="p">,</span><span class="mi">8</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">12</span><span class="p">,</span><span class="mi">45</span><span class="p">,</span><span class="mi">0</span><span class="p">),</span> <span class="s">&quot;bob&quot;</span><span class="p">,</span> <span class="s">&quot;http://www.michael-noll.com/&quot;</span><span class="p">),</span>
</span><span class="line"><span class="o">...</span> <span class="p">]</span>
</span><span class="line"><span class="o">&gt;&gt;&gt;</span> <span class="n">spear_algorithm</span> <span class="o">=</span> <span class="n">spear</span><span class="o">.</span><span class="n">Spear</span><span class="p">(</span><span class="n">activities</span><span class="p">)</span>
</span><span class="line"><span class="o">&gt;&gt;&gt;</span> <span class="n">expertise_results</span><span class="p">,</span> <span class="n">quality_results</span> <span class="o">=</span> <span class="n">spear_algorithm</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
</span><span class="line"><span class="o">&lt;/</span><span class="n">pre</span><span class="o">&gt;</span>
</span><span class="line"><span class="o">&lt;</span><span class="err">!</span><span class="o">--</span><span class="n">more</span><span class="o">--&gt;</span>
</span><span class="line"><span class="n">Get</span> <span class="n">the</span> <span class="n">top</span> <span class="n">user</span> <span class="ow">and</span> <span class="n">his</span> <span class="n">expertise</span> <span class="n">score</span><span class="p">:</span>
</span><span class="line">
</span><span class="line"><span class="o">&lt;</span><span class="n">pre</span><span class="o">&gt;</span>
</span><span class="line"><span class="o">&gt;&gt;&gt;</span> <span class="n">expertise_score</span><span class="p">,</span> <span class="n">user</span> <span class="o">=</span> <span class="n">expertise_results</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span><span class="line"><span class="o">&gt;&gt;&gt;</span> <span class="k">print</span> <span class="s">&quot;</span><span class="si">%s</span><span class="s"> =&gt; </span><span class="si">%.4f</span><span class="s">&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">expertise_score</span><span class="p">)</span>
</span><span class="line"><span class="n">alice</span> <span class="o">=&gt;</span> <span class="mf">0.5858</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Get the top resource and its quality score:</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="python"><span class="line"><span class="o">&gt;&gt;&gt;</span> <span class="n">quality_score</span><span class="p">,</span> <span class="n">resource</span> <span class="o">=</span> <span class="n">quality_results</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span><span class="line"><span class="o">&gt;&gt;&gt;</span> <span class="k">print</span> <span class="s">&quot;</span><span class="si">%s</span><span class="s"> =&gt; </span><span class="si">%.4f</span><span class="s">&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">resource</span><span class="p">,</span> <span class="n">quality_score</span><span class="p">)</span>
</span><span class="line"><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">www</span><span class="o">.</span><span class="n">michael</span><span class="o">-</span><span class="n">noll</span><span class="o">.</span><span class="n">com</span><span class="o">/</span> <span class="o">=&gt;</span> <span class="mf">1.0000</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>You can also use the library to simulate the <a href="http://en.wikipedia.org/wiki/HITS_algorithm">HITS algorithm</a> of Jon
Kleinberg.  Simply supply a credit score function <code>C(x) = 1</code> to the SPEAR algorithm (see the documentation of the
<code>Spear.run()</code> method).</p>

<p>Feel free to play around with it and send me feedback!</p>

<p>PS: The SPEAR Python library requires <a href="http://www.scipy.org/">SciPy/NumPy</a>.  If you don’t have these installed already,
here are <a href="http://www.scipy.org/Installing_SciPy">some installation instructures</a> to get you started.</p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[How To Extract Audio From FLV Files Using VLC]]></title>
    <link href="http://www.michael-noll.com/blog/2010/01/20/how-to-extract-audio-from-flv-files-using-vlc/"/>
    <updated>2010-01-20T00:00:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2010/01/20/how-to-extract-audio-from-flv-files-using-vlc</id>
    <content type="html"><![CDATA[<p>This method allows you to extract the audio of FLV files (e.g. YouTube videos) using VLC and save it in MP3 format.
I tested the instructions below with VLC on Mac OS X.</p>

<!--more-->

<ol>
  <li>Open the FLV file with VLC and stop it as soon as it starts playing.</li>
  <li>Open the VLC Wizard by clicking on File &gt; Streaming/Exporting Wizard…</li>
  <li>Select Transcode/Save to file. Next.</li>
  <li>Select your file from the Playlist. Next.</li>
  <li>Check only the Transcode Audio checkmark (leave Video unchecked). Select 192 kb/s bitrate, and MP3 as the Audio codec.
Next.</li>
  <li>Select MPEG-1 as encapsulation method. Save the file with any name, but with the extension MPG. (Don’t use MP3 at this
time). Press Finish.</li>
</ol>

<p>Once the transcoding is finished, repeat steps 1 to 6, except for:</p>

<ul>
  <li>Choose the MPG file you just created as your input file.</li>
  <li>Change the encapsulation method as RAW. Save the new file with the extension MP3.</li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Invited article for Yahoo! on SPEAR algorithm]]></title>
    <link href="http://www.michael-noll.com/blog/2009/09/03/invited-article-for-yahoo-on-spear-algorithm/"/>
    <updated>2009-09-03T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2009/09/03/invited-article-for-yahoo-on-spear-algorithm</id>
    <content type="html"><![CDATA[<p>A couple of days ago, my co-worker Ching-man Au Yeung and I were approached by <a href="http://www.yahoo.com/">Yahoo!</a> to write
a guest article about our work on the <a href="http://www.spear-algorithm.org/">SPEAR algorithm</a> for the
<a href="http://blog.delicious.com/blog/2009/08/how-spear-identifies-domain-experts-within-delicious.html">Delicious.com blog</a>.</p>

<p>Well, I’m happy to announce that the article is published now:
<a href="http://blog.delicious.com/blog/2009/08/how-spear-identifies-domain-experts-within-delicious.html">How SPEAR Identifies Domain Experts within Delicious</a>
Check it out while it’s still hot ;-)</p>

<p>Thanks again to <a href="http://zooie.wordpress.com/bio/">Vik Singh</a> and <a href="http://aseidman.com/about/">Ariel Seidman</a> from Yahoo!
for this great opportunity and for their support!</p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[German characters in VMware Fusion on Mac OS&nbsp;X]]></title>
    <link href="http://www.michael-noll.com/blog/2009/08/25/german-characters-in-vmware-fusion-on-mac-os-x/"/>
    <updated>2009-08-25T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2009/08/25/german-characters-in-vmware-fusion-on-mac-os-x</id>
    <content type="html"><![CDATA[<p>Here are some notes on using special characters on a German keyboard in VMware Fusion on Mac OS X.</p>

<p>The basic rule is <code>Ctrl-Alt</code> plus the special character as listed on my Logitech G15 keyboard (which has special
characters listed in the bottom right of each key).</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Character</th>
      <th style="text-align: left">Shortcut (DE keyboard)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">@</td>
      <td style="text-align: left">Ctrl-Alt-Q</td>
    </tr>
    <tr>
      <td style="text-align: left">\ (backslash)</td>
      <td style="text-align: left">Ctrl-Alt-ß</td>
    </tr>
    <tr>
      <td style="text-align: left">€ (Euro symbol)</td>
      <td style="text-align: left">Ctrl-Alt-E</td>
    </tr>
    <tr>
      <td style="text-align: left">| (pipe symbol)</td>
      <td style="text-align: left">Ctrl-Alt-&lt;</td>
    </tr>
    <tr>
      <td style="text-align: left">~ (tilde)</td>
      <td style="text-align: left">Ctrl-Alt-+</td>
    </tr>
  </tbody>
</table>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Technology Review article on our expertise ranking approach from SIGIR '09]]></title>
    <link href="http://www.michael-noll.com/blog/2009/07/31/technology-review-article-on-our-expertise-ranking-approach-from-sigir-09/"/>
    <updated>2009-07-31T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2009/07/31/technology-review-article-on-our-expertise-ranking-approach-from-sigir-09</id>
    <content type="html"><![CDATA[The US magazine <a href="http://www.technologyreview.com/web/23100/">Technology Review</a> has <a href="http://www.technologyreview.com/web/23100/">published an article</a> about our <a href="http://www.michael-noll.com/blog/2009/06/05/telling-experts-from-spammers-expertise-ranking-in-folksonomies/">SPEAR algorithm</a> for expertise ranking in the Social Web. My co-worker <a href="http://www.albertauyeung.com/">Ching-man Au Yeung</a> from University of Southampton and I just returned from this year&#8217;s <a href="http://www.sigir2009.org/">ACM SIGIR conference</a> where we presented this joint work.

The TR article <a href="http://www.technologyreview.com/web/23100/">A Better Way to Rank Expertise Online</a> summarizes our approach quite well, and we&#8217;re really happy about the very positive feedback therein from researchers like <a href="http://en.wikipedia.org/wiki/Jon_Kleinberg">Professor Jon Kleinberg</a> (the inventor of the <a href="http://en.wikipedia.org/wiki/HITS_algorithm">HITS algorithm</a> on which our SPEAR algorithm is based), <a href="http://www.redlog.net/">Scott Golder</a> (his publication together with Huberman on usage patterns in collaborative tagging systems is one of the earliest and currently most cited works in the area of the Social Web) and <a href="http://isiosf.isi.it/~cattuto/">Ciro Cattuto</a> (very nice work on social dynamics and network characteristics of folksonomies). Thanks!

<!--more-->
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Telling Experts from Spammers: Expertise Ranking in Folksonomies]]></title>
    <link href="http://www.michael-noll.com/blog/2009/06/05/telling-experts-from-spammers-expertise-ranking-in-folksonomies/"/>
    <updated>2009-06-05T00:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2009/06/05/telling-experts-from-spammers-expertise-ranking-in-folksonomies</id>
    <content type="html"><![CDATA[My paper <em>Telling Experts from Spammers: Expertise Ranking in Folksonomies</em>, a joint work written together with fellow Ph.D. candidate Ching-Man Au Yeung from University of Southampton, has been accepted for publication and presentation at this year’s <a href="http://www.sigir2009.org/">International ACM SIGIR Conference</a> which will be held in Boston, USA, from July 19 - 23, 2009.<!--more-->
<br />
<br />
<br />
<div style="border:2px solid red; padding: 10px 5px;"><strong>Update:</strong> Head over to the dedicated <a href="http://www.spear-algorithm.org/">SPEAR algorithm website</a> for more information!</div>
<h1>Abstract</h1>
With a suitable algorithm for ranking the expertise of a user in a collaborative tagging system, we will be able to identify experts and discover useful and relevant resources through them. We propose that the level of expertise of a user with respect to a particular topic is mainly determined by two factors. Firstly, an expert should possess a high quality collection of resources, while the quality of a Web resource depends on the expertise of the users who have assigned tags to it. Secondly, an expert should be one who tends to identify interesting or useful resources before other users do.

We propose a graph-based algorithm, <em>SPEAR (SPamming-resistant Expertise Analysis and Ranking)</em>, which implements these ideas for ranking users in a folksonomy. We evaluate our method with experiments on data sets collected from Delicious.com comprising over 71,000 Web documents, 0.5 million users and 2 million shared bookmarks. We also show that the algorithm is more resistant to spammers than other methods such as the original HITS algorithm and simple statistical measures.
<h1>Full Paper &amp; Presentation</h1>
<ul>
	<li>M. G. Noll, C.-M. Au Yeung, N. Gibbins, C. Meinel, N. Shadbolt
<a href="http://www.michael-noll.com/blog/uploads/Michael-Noll_Telling-Experts-from-Spammers_SIGIR_2009.pdf">Telling Experts from Spammers: Expertise Ranking in Folksonomies</a>
Proceedings of 32nd ACM SIGIR Conference, Boston, USA, July 2009, pp. 612-619, <a href="../../wiki/Special:BookSources/9781605584836">ISBN 978-1-60558-483-6</a> (<a title="http://portal.acm.org/citation.cfm?id=1571941.1572046" href="http://portal.acm.org/citation.cfm?id=1571941.1572046">ACM Link</a>, <a href="http://www.michael-noll.com/blog/uploads/sigir2009.bib">BibTeX</a>)</li>
	<li>Presentation: <a href="https://s3.amazonaws.com/cdn.michael-noll.com/talks/Michael-Noll_Talk_Telling-Experts-from-Spammers_SIGIR_2009.pdf">Telling Experts from Spammers (Talk)</a>, our talk at SIGIR 2009</li>
</ul>
<h1>Related Links</h1>
<ul>
	<li><a href="http://www.spear-algorithm.org/">The Spear Algorithm</a> - website dedicated to SPEAR</li>
	<li><a href="../../wiki/Publications">List of my publications</a></li>
	<li><a href="http://www.technologyreview.com/web/23100/">Article on Technology Review</a> about this work: <a href="http://www.technologyreview.com/web/23100/" target="_blank">A Better Way to Rank Expertise Online</a>, July 2009</li>
	<li><a href="http://www.sigir2009.org/" target="_blank">32nd Annual ACM SIGIR Conference</a>, Boston, USA, July 2009</li>
</ul>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Article published in Python Magazine]]></title>
    <link href="http://www.michael-noll.com/blog/2009/03/13/article-published-in-python-magazine/"/>
    <updated>2009-03-13T00:00:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2009/03/13/article-published-in-python-magazine</id>
    <content type="html"><![CDATA[A couple of months ago Doug Hellmann, the chief editor of <a href="http://pythonmagazine.com/c/issue/view/92">Python Magazine</a>, invited me to write a featured article. I&#8217;m happy to share the good news that my article <a href="http://pythonmagazine.com/c/issue/view/92"><em>Writing a Personal Link Recommendation Engine</em></a> has eventually been published in this year&#8217;s February issue.

<!--more--><img class="size-full wp-image-270" title="Cover of Python Magazine, February 2009" src="http://www.michael-noll.com/blog/uploads/pymag_february2009.jpg" alt="Cover of Python Magazine, Volume 3(2), February 2009" width="200" height="258" />

Here&#8217;s a brief teaser:
<blockquote><strong>Writing a Personal Link Recommendation Engine</strong>

There is so much going on in the Internet today that it’s hard to keep track of it all. Whether you need to stay up to date with job-related information, the current state of research, or the latest developments in the Open Source world, finding relevant information quickly on the Internet is difficult. So-called “social” services such as Delicious.com or Digg.com try to support users by allowing them to collaboratively share their knowledge about interesting web sites. In this article, we will design and implement a simple, yet powerful link recommender whose analysis is based solely on Delicious.com social bookmarks information.</blockquote>
Interested? <a href="http://pythonmagazine.com/c/issue/view/92">Head over to PyMag</a> and have a look!
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[WI 2008 and SITIS 2008 papers available for download]]></title>
    <link href="http://www.michael-noll.com/blog/2009/01/06/wi-2008-and-sitis-2008-papers-available-for-download/"/>
    <updated>2009-01-06T00:00:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2009/01/06/wi-2008-and-sitis-2008-papers-available-for-download</id>
    <content type="html"><![CDATA[As promised, my papers <a href="../2008/09/05/the-metadata-triumvirate-social-annotations-anchor-texts-and-search-queries/">The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries</a> (IEEE/WIC/ACM WI 2008) and <a href="../2008/09/17/building-a-scalable-collaborative-web-filter-with-free-and-open-source-software/">Building a Scalable Collaborative Web Filter with Free and Open Source Software</a> (IEEE SITIS 2008) are available for download. Enjoy!
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[CABS120k08: Data Corpus for Research in the Web 2.0, November 2008]]></title>
    <link href="http://www.michael-noll.com/blog/2008/12/02/cabs120k08-data-corpus-for-research-in-the-web-20-november-2008/"/>
    <updated>2008-12-02T00:00:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2008/12/02/cabs120k08-data-corpus-for-research-in-the-web-20-november-2008</id>
    <content type="html"><![CDATA[My <a href="http://www.michael-noll.com/wiki/CABS120k08">CABS120k08</a> research data set is now available for <a href="http://www.michael-noll.com/wiki/CABS120k08#Download">download</a>.

<a href="http://www.michael-noll.com/wiki/CABS120k08">CABS120k08</a> is a large research data set about Web metadata based on a sample of 120,000 web documents in 2008 (=120k08) with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank, Delicious.com, and anchor text from incoming hyperlinks.

The data corpus is described in detail in my paper  <a href="../2008/09/05/the-metadata-triumvirate-social-annotations-anchor-texts-and-search-queries/">The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries</a>, for which the corpus was built. Enjoy!

<!--more-->
]]></content>
  </entry>
  
</feed>
