Michael G. Noll

Applied Research. Big Data. Distributed Systems. Open Source.

CABS120k08: Data Corpus for Research in the Web 2.0, November 2008

My CABS120k08 research data set is now available for download. CABS120k08 is a large research data set about Web metadata based on a sample of 120,000 web documents in 2008 (=120k08) with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank, Delicious.com, and anchor text from incoming hyperlinks. The data corpus is described in detail in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries, for which the corpus was built. Enjoy!