My CABS120k08 research data set is now available for download.

CABS120k08 is a large research data set about Web metadata based on a sample of 120,000 web documents in 2008 (=120k08) with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank,, and anchor text from incoming hyperlinks.

The data corpus is described in detail in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries, for which the corpus was built. Enjoy!

Interested in more? You can subscribe to this blog and follow me on Twitter.