CABS120k08 is a large research data set about Web metadata based on a sample of 120,000 web documents with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank, Delicious.com, and anchor text from incoming hyperlinks.

This web page describes the CABS120k08 data set as published in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries at the 7th IEEE/WIC/ACM International Conference on Web Intelligence (WI), Australia, December 2008.

The CABS120k08 data set contains 117,434 URLs with additional metadata, based on an intersection of the AOL Search query log corpus called AOL500k from 2006 and the Open Directory Project (DMOZ).

The randomly sampled AOL500k corpus is one of the largest publicly available collections of search queries today. It consists of 20 million web queries collected from 650,000 users on AOL Search over three months in 2006. As result of these search queries, 1.6 million different web documents were visited by users.

The Open Directory Project is by its own account “the largest, most comprehensive human-edited directory of the Web”. At the time we built the CABS120k08 corpus, the Open Directory contained about 4,8 million web documents in about 590,000 categories.

For creating CABS120k08, I made an intersection of URLs (= web documents) in AOL500k and in the Open Directory. Only such web documents made it into the data set which were both searched for and subsequently visited (AOL500k) as well as categorized (Open Directory).

The Data Set

This section gives you a short overview of CABS120k08. The corpus is described in detail in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries, for which the corpus was built. The paper includes both a quantitative and qualitative analysis of CABS120k08.

Overview	Total	Comment
Total documents	117,434
Total categories	84,663
Total searches	2,617,326
Total anchor texts	2,242,621
Total users	3,383,571
Total bookmarks	1,289,563	unique: 9.1 %
Total tags	3,383,571	unique: 26.3 %
Categorized documents*	117,434	100.0 %
Searched documents*	117,434	100.0 %
Anchored documents	95,230	81.1 %
Bookmarked documents	59,126	50.3 %
Tagged documents	56,457	48.1 %

*100% by definition, i.e. the way I created the CABS120k08 data set

Estimated probabilities	prob.
P(bookmarked \| anchor text)	0.467
P(tagged \| anchor text)	0.447
P(bookmarked \| anchor text)	0.575
P(tagged \| anchor text)	0.552
P(anchor text \| bookmarked)	0.927
P(anchor text \| tagged)	0.930

Information Sources

For each url, I retrieved and compiled the following metadata:

popularity*: represented by Google PageRank (www.google.com)
categorization: Open Directory categories (generally, only one category per URL)
search queries from AOL500k corpus
incoming hyperlinks with anchor text (via Google search for “link:…” followed by a targeted web crawl and HTML parsing)
metadata by readers/visitors: Delicoius.com bookmarks and tags

* “general” popularity as computed/interpreted by search engines

Tools of the Trade

I implemented custom software tools which relied on the services’ official APIs where possible and fell back to alternative techniques for situations where the APIs did not provide the required functionality:

Python: custom scripts and modules, most notably my Python API for Delicious
Perl: small script using WWW::Google::PageRank
Unix swiss army knife featuring grep, sed, awk

Data Format

The corpus is in XML format with following structure.

Documents

Each web document is represented as a <document> with the following attributes:

url: URL of the web document
categories: number of Open Directory categories
searches: number of searches on AOL Search in AOL500k
inlinks: number of incoming hyperlinks with anchor text
users: number of Delicious.com users who have bookmarked this URL (equals the number of bookmarks of this url at Delicious.com)
top tags: number of Delicious.com top tags for this URL, i.e. the most popular tags (maximum: 10)
tags: number of unique Delicious.com tags for this URL
pagerank: Google PageRank of the URL
<category> elements: Open Directory categories (see below)
<top_tag> elements: the Delicious.com top tags in detail (see below)
<bookmark> elements: bookmarks of Delicious.com users including tags (see below)

Search Queries

Each <document> contains 1 or more <search> elements, which denote any search queries from AOL500k where this document was visited as result of the search.

query: the search keywords for the query
aol500k_id: the searching user's anonymized ID in AOL500k
date: the date of the search in YYYY-MM-DD format
time: the time of day of the search in HH:MM:SS format
rank: the document's position in the corresponding search result list when it was clicked

Incoming anchor texts

Each <document> may contain 1 or more <inlink> elements, which denote any incoming hyperlinks with anchor texts for this document.

anchor_text: the anchor text of the incoming hyperlink

Delicious.com top tags

Each <document> may contain 1 or more <top_tag> elements, which represent the Delicious.com top tags for a URL in detail.

name: name of the tag, e.g. "news"
count: number of times this tag has been used by users

Note: the number of <top_tag> elements of a URL is equal to the document’s “tags” attribute value (see above).

Delicious.com user bookmarks

Each <document> may contain 1 or more <bookmark>, which represent the users’ bookmarks of the document.

user: name of the Delicious.com user who bookmarked the URL
tags: comma-separated list of tags with which the user annotated the bookmark
date: creation date of the bookmark in YYYY-MM format

Note that I could have stored the user’s tags as separate XML elements, but this would have been overkill and increased the corpus file size without a real benefit.

Example

<documents>
<document url="http://www.edletter.org/" users="10" categories="1" searches="29" inlinks="36" top_tags="5" tags="9" page
rank="6">
    <category name="top/reference/education/journals" />
    <search query="united states preschool teachers and statistics" aol500k_id="807613" date="2006-03-23" time="18:31:58" rank="12" />
    <search query="nclb and kindergarten" aol500k_id="7516545" date="2006-03-12" time="16:58:12" rank="16" />
    <search query="harvard education letters" aol500k_id="2229594" date="2006-03-21" time="01:43:37" rank="4" />
    ...
    <inlink anchor_text="Harvard Education Letter" />
    <inlink anchor_text="Home" />
    <inlink anchor_text="Harvard Education Letter" />
    <inlink anchor_text="www.edletter.org/" />
    ...
    <top_tag name="education" count="5" />
    <top_tag name="newsletter" count="2" />
    <top_tag name="research" count="3" />
    ...
    <bookmark user="mohandas" tags="edumags" date="2005-07" />
    <bookmark user="selahl" tags="pedagogy, teaching" date="2005-12" />
    <bookmark user="lllnelson2004" tags="edl600, edl671" date="2006-02" />
    ...
    </document>
    ...
    ...
    ...
</documents>

Download

Legal information

By downloading, you acknowledge that:

The data has been compiled for the purposes of illustration for scientific research.
The copyright holders retain ownership and reserve all rights.

Download: CABS120k08 data corpus (86 MB, GZIP compressed)

How to reference

When you use CABS120k08 for your own research, please use my corresponding paper as reference.

Feedback

Comments, questions and constructive feedback are always welcome. Just drop me a note.

DMOZ100k06 (published 2007)
Large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from Delicious.com/Yahoo!, Google, and ICRA.