Michael G. Noll

Applied Research. Big Data. Distributed Systems. Open Source.

Reading and Writing Avro Files From the Command Line

| Comments

Apache Avro is becoming one of the most popular data serialization formats nowadays, and this holds true particularly for Hadoop-based big data platforms because tools like Pig, Hive and of course Hadoop itself natively support reading and writing data in Avro format. Many users seem to enjoy Avro but I have heard many complaints about not being able to conveniently read or write Avro files with command line tools – “Avro is nice, but why do I have to write Java or Python code just to quickly see what’s in a binary Avro file, or discover at least its Avro schema?”

To those users it comes as a surprise that Avro actually ships with exactly such command line tools but apparently they are not prominently advertised or documented as such. In this short article I will show a few hands-on examples on how to read, write, compress and convert data from and to binary Avro using Avro Tools 1.7.4.

Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node

| Comments

In this article I describe how to install, configure and run a multi-broker Apache Kafka 0.8 (trunk) cluster on a single machine. The final setup consists of one local ZooKeeper instance and three local Kafka brokers. We will test-drive the setup by sending messages to the cluster via a console producer and receive those messages via a console receiver. I will also describe how to build Kafka for Scala 2.9.2, which makes it much easier to integrate Kafka with other Scala-based frameworks and tools that require Scala 2.9 instead of Kafka’s default Scala 2.8.

Bootstrapping a Java Project With Gradle, TestNG, Mockito and Cobertura for Eclipse and Jenkins

| Comments

When starting out with a fresh Java project one of the nuisances you have to deal with is setting up your build and test environment. It’s even more troublesome if you are trying to switch from Maven to Gradle for your builds. In this article I will provide you with a bootstrap Java project that is backed by Gradle, TestNG, Mockito, FEST-Assert 2 and Cobertura. It took me quite a while to wire everything together and fix issues such as hidden dependencies and early-bird Cobertura support, so I hope you find this information useful. The included example code illustrates how each of the previously mentioned packages is used. Lastly I will cover how the bootstrap project integrates with Eclipse and Jenkins.

Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm in Storm

| Comments

A common pattern in real-time data workflows is performing rolling counts of incoming data points, also known as sliding window analysis. A typical use case for rolling counts is identifying trending topics in a user community – such as on Twitter – where a topic is considered trending when it has been among the top N topics in a given window of time. In this article I will describe how to implement such an algorithm in a distributed and scalable fashion using the Storm real-time data processing platform. The same code can also be used in other areas such as infrastructure and security monitoring.

Understanding the Parallelism of a Storm Topology

| Comments

In the past few days I have been test-driving Twitter’s Storm project, which is a distributed real-time data processing platform. One of my findings so far has been that the quality of Storm’s documentation and example code is pretty good – it is very easy to get up and running with Storm. Big props to the Storm developers! At the same time, I found the sections on how a Storm topology runs in a cluster not perfectly clear, and learned that the recent releases of Storm changed some of its behavior in a way that is not yet fully reflected in the Storm wiki and in the API docs.

In this article I want to share my own understanding of the parallelism of a Storm topology after reading the documentation and writing some first prototype code. More specifically, I describe the relationships of worker processes, executors (threads) and tasks, and how you can configure them according to your needs. This article is based on Storm release 0.8.1, the latest version as of October 2012.