Have you ever asked yourself what monoids and monads are, and particularly why they seem to be so attractive in the field of large-scale data processing? Twitter recently open-sourced Algebird, which provides you with a JVM library to work with such algebraic data structures. Algebird is already being used in Big Data tools such as Scalding and SummingBird, which means you can use Algebird as a mechanism to plug your own data structures – e.g. Bloom filters, HyperLogLog – directly into large-scale data processing platforms such as Hadoop and Storm. In this post I will show you how to get started with Algebird, introduce you to monoids and monads, and address the question why you get interested in those in the first place.
So you got your first distributed Storm cluster installed and have your first topologies up and running. Great! Now you want to integrate your Storm applications with your monitoring systems and begin tracking application-level metrics from your topologies. In this article I show you how to integrate Storm with the popular Graphite monitoring system. This, combined with the Storm UI, will provide you with actionable information to tune the performance of your topologies and also help you to track key business as well as technical metrics.
Understanding how an Hadoop cluster is actually used in practice is paramount to properly manage and operate it. In this article I introduce Replephant, an open source Clojure library to perform interactive analysis of Hadoop cluster usage via REPL and to generate usage reports.
Apache Avro is a very popular data serialization format in the Hadoop technology stack. In this article I show code examples of MapReduce jobs in Java, Hadoop Streaming, Pig and Hive that read and/or write data in Avro format. We will use a small, Twitter-like data set as input for our example MapReduce jobs.
When you are optimizing the performance of your Storm topologies it helps to understand how Storm’s internal message queues are configured and put to use. In this short article I will explain and illustrate how Storm version 0.8/0.9 implements the intra-worker communication that happens within a worker process and its associated executor threads.