CABS120k08 is a large research data set about Web metadata based on a sample of 120,000 web documents with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank,, and anchor text from incoming hyperlinks.

This web page describes the CABS120k08 data set as published in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries at the 7th IEEE/WIC/ACM International Conference on Web Intelligence (WI), Australia, December 2008.

The CABS120k08 data set contains 117,434 URLs with additional metadata, based on an intersection of the AOL Search query log corpus called AOL500k from 2006 and the Open Directory Project (DMOZ).

The randomly sampled AOL500k corpus is one of the largest publicly available collections of search queries today. It consists of 20 million web queries collected from 650,000 users on AOL Search over three months in 2006. As result of these search queries, 1.6 million different web documents were visited by users.

The Open Directory Project is by its own account “the largest, most comprehensive human-edited directory of the Web”. At the time we built the CABS120k08 corpus, the Open Directory contained about 4,8 million web documents in about 590,000 categories.

For creating CABS120k08, I made an intersection of URLs (= web documents) in AOL500k and in the Open Directory. Only such web documents made it into the data set which were both searched for and subsequently visited (AOL500k) as well as categorized (Open Directory).

The Data Set

This section gives you a short overview of CABS120k08. The corpus is described in detail in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries, for which the corpus was built. The paper includes both a quantitative and qualitative analysis of CABS120k08.

Overview Total Comment
Total documents 117,434  
Total categories 84,663  
Total searches 2,617,326  
Total anchor texts 2,242,621  
Total users 3,383,571  
Total bookmarks 1,289,563 unique: 9.1 %
Total tags 3,383,571 unique: 26.3 %
Categorized documents* 117,434 100.0 %
Searched documents* 117,434 100.0 %
Anchored documents 95,230 81.1 %
Bookmarked documents 59,126 50.3 %
Tagged documents 56,457 48.1 %

*100% by definition, i.e. the way I created the CABS120k08 data set

Estimated probabilities prob.
P(bookmarked | anchor text) 0.467
P(tagged | anchor text) 0.447
P(bookmarked | anchor text) 0.575
P(tagged | anchor text) 0.552
P(anchor text | bookmarked) 0.927
P(anchor text | tagged) 0.930

Information Sources

For each url, I retrieved and compiled the following metadata:

  • popularity*: represented by Google PageRank (
  • categorization: Open Directory categories (generally, only one category per URL)
  • search queries from AOL500k corpus
  • incoming hyperlinks with anchor text (via Google search for “link:…” followed by a targeted web crawl and HTML parsing)
  • metadata by readers/visitors: bookmarks and tags

* “general” popularity as computed/interpreted by search engines

Tools of the Trade

I implemented custom software tools which relied on the services’ official APIs where possible and fell back to alternative techniques for situations where the APIs did not provide the required functionality:

Data Format

The corpus is in XML format with following structure.


Each web document is represented as a <document> with the following attributes:

URL of the web document
number of Open Directory categories
number of searches on AOL Search in AOL500k
number of incoming hyperlinks with anchor text
number of users who have bookmarked this URL (equals the number of bookmarks of this url at
top tags
number of top tags for this URL, i.e. the most popular tags (maximum: 10)
number of unique tags for this URL
Google PageRank of the URL
<category> elements
Open Directory categories (see below)
<top_tag> elements
the top tags in detail (see below)
<bookmark> elements
bookmarks of users including tags (see below)


Each <document> contains 1 or more <category> elements, which denote the document URL’s categories as assigned by the Open Directory Project.

full category information with sub-categories separated by "/", e.g. "top/sports/golf/courses"

Search Queries

Each <document> contains 1 or more <search> elements, which denote any search queries from AOL500k where this document was visited as result of the search.

the search keywords for the query
the searching user's anonymized ID in AOL500k
the date of the search in YYYY-MM-DD format
the time of day of the search in HH:MM:SS format
the document's position in the corresponding search result list when it was clicked

Incoming anchor texts

Each <document> may contain 1 or more <inlink> elements, which denote any incoming hyperlinks with anchor texts for this document.

the anchor text of the incoming hyperlink top tags

Each <document> may contain 1 or more <top_tag> elements, which represent the top tags for a URL in detail.

name of the tag, e.g. "news"
number of times this tag has been used by users

Note: the number of <top_tag> elements of a URL is equal to the document’s “tags” attribute value (see above). user bookmarks

Each <document> may contain 1 or more <bookmark>, which represent the users’ bookmarks of the document.

name of the user who bookmarked the URL
comma-separated list of tags with which the user annotated the bookmark
creation date of the bookmark in YYYY-MM format

Note that I could have stored the user’s tags as separate XML elements, but this would have been overkill and increased the corpus file size without a real benefit.


<document url="" users="10" categories="1" searches="29" inlinks="36" top_tags="5" tags="9" page
    <category name="top/reference/education/journals" />
    <search query="united states preschool teachers and statistics" aol500k_id="807613" date="2006-03-23" time="18:31:58" rank="12" />
    <search query="nclb and kindergarten" aol500k_id="7516545" date="2006-03-12" time="16:58:12" rank="16" />
    <search query="harvard education letters" aol500k_id="2229594" date="2006-03-21" time="01:43:37" rank="4" />
    <inlink anchor_text="Harvard Education Letter" />
    <inlink anchor_text="Home" />
    <inlink anchor_text="Harvard Education Letter" />
    <inlink anchor_text="" />
    <top_tag name="education" count="5" />
    <top_tag name="newsletter" count="2" />
    <top_tag name="research" count="3" />
    <bookmark user="mohandas" tags="edumags" date="2005-07" />
    <bookmark user="selahl" tags="pedagogy, teaching" date="2005-12" />
    <bookmark user="lllnelson2004" tags="edl600, edl671" date="2006-02" />


By downloading, you acknowledge that:

  • The data has been compiled for the purposes of illustration for scientific research.
  • The copyright holders retain ownership and reserve all rights.

Download: CABS120k08 data corpus (86 MB, GZIP compressed)

How to reference

When you use CABS120k08 for your own research, please use my corresponding paper as reference.


Comments, questions and constructive feedback are always welcome. Just drop me a note.

Related Research Data Sets

  • DMOZ100k06 (published 2007)
    Large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from!, Google, and ICRA.