Hadoop
From Michael G. Noll
In my PhD project, I use Hadoop quite a lot. Hadoop is a Yahoo-sponsored open source framework for distributed computing and data storage, similar to Google's MapReduce and GFS. If you are passionate software developer, you should definitely give it a try. I know you want it! ;-)
Tutorials
If you are interested in Hadoop, these tutorials will get you started. Enjoy!
- Writing An Hadoop MapReduce Program In Python
You don't need to use Java for writing an Hadoop application -- any programming language will do. Here, I will showcase MapReduce in Python. - Running Hadoop On Ubuntu Linux (Single-Node Cluster)
Quick start for those interested in running Hadoop for the first time, or those who want to run a test installation of Hadoop on their development machine. In addition, I've been told that students like to use the single-node setup for their course work. - Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
How to set a "real" Hadoop cluster for distributed computing and data storage, if only for bragging® rights!
Use Cases
I have used Hadoop for a variety of things such as web crawling, creation of research data sets or straight-forward data analysis. Here are some examples from my personal experience. You might also want to check the Hadoop website or the Hadoop user mailing list to find out what other problems can and are being solved with Hadoop.
- Building a Scalable Collaborative Web Filter with Free and Open Source Software
Publication at IEEE Signal-Image Technology & Internet-based Systems (SITIS), 2008. In this case study, I describe the design and architecture of a scalable collaborative web filtering service, which is powered by free and open source software. The described system components include Pylons, MySQL, Tokyo Cabinet/Tokyo Tyrant, Pylog (custom app based on Twisted), Hadoop, and on the client side a Firefox Add-On. Design and implementation done by yours truly. - CABS120k08
Large research data set about Web metadata based on a sample of 120,000 web documents with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank, Delicious.com, and anchor text from incoming hyperlinks. The creation and characteristics of CABS120k08 are described in detail in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries at IEEE/WIC/ACM Web Intelligence, 2008.
Note: I used my DeliciousAPI for scraping the data from Delicious.com. - DMOZ100k06
Large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from Delicious.com, Google, and ICRA. The creation and characteristics of DMOZ100k06 are described in detail in my paper Authors vs. Readers: A Comparative Study of Document Metadata and Content in the WWW at ACM Symposium on Document Engineering, 2007.
Note: I used my DeliciousAPI for scraping the data from Delicious.com.