About this entry
You’re currently reading “CABS120k08: Data Corpus for Research in the Web 2.0, November 2008”.
- Author:
- Michael G. Noll
- Published:
- Dec 02, 2008
- Last updated:
- Jul 20, 2010
- Bookmark:
- Permanent Link
- Tags:
- acm, anchor texts, aol500k, backlinks, bookmarks, cabs120k078, cabs120k08, categories, categorization, corpus, del.icio.us, dmoz, doceng, google, incoming hyperlinks, inlinks, metadata, open directory project, pagerank, paper, papers, publication, Publications, random-sample, Research, search queries, social-bookmarking, social-tagging, study, tagging, web2.0 (show tag cloud)
CABS120k08: Data Corpus for Research in the Web 2.0, November 2008
My CABS120k08 research data set is now available for download.
CABS120k08 is a large research data set about Web metadata based on a sample of 120,000 web documents in 2008 (=120k08) with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank, Delicious.com, and anchor text from incoming hyperlinks.
The data corpus is described in detail in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries, for which the corpus was built. Enjoy!
Comments are closed
Comments are closed on this entry for protection against spam. If you want to send me feedback, just contact me.