Michael G. Noll

Applied Research. Big Data. Distributed Systems. Open Source.

CABS120k08

CABS120k08 is a large research data set about Web metadata based on a sample of 120,000 web documents with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank, Delicious.com, and anchor text from incoming hyperlinks.

This web page describes the CABS120k08 data set as published in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries at the 7th IEEE/WIC/ACM International Conference on Web Intelligence (WI), Australia, December 2008.

The CABS120k08 data set contains 117,434 URLs with additional metadata, based on an intersection of the AOL Search query log corpus called AOL500k from 2006 and the Open Directory Project (DMOZ).

The randomly sampled AOL500k corpus is one of the largest publicly available collections of search queries today. It consists of 20 million web queries collected from 650,000 users on AOL Search over three months in 2006. As result of these search queries, 1.6 million different web documents were visited by users.

The Open Directory Project is by its own account “the largest, most comprehensive human-edited directory of the Web”. At the time we built the CABS120k08 corpus, the Open Directory contained about 4,8 million web documents in about 590,000 categories.

For creating CABS120k08, I made an intersection of URLs (= web documents) in AOL500k and in the Open Directory. Only such web documents made it into the data set which were both searched for and subsequently visited (AOL500k) as well as categorized (Open Directory).

The Data Set

This section gives you a short overview of CABS120k08. The corpus is described in detail in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries, for which the corpus was built. The paper includes both a quantitative and qualitative analysis of CABS120k08.

Overview total comment
Total documents 117,434
Total categories 84,663
Total searches 2,617,326
Total anchor texts 2,242,621
Total users 3,383,571
Total bookmarks 1,289,563 unique: 9.1 %
Total tags 3,383,571 unique: 26.3 %
Categorized documents* 117,434 100.0 %
Searched documents* 117,434 100.0 %
Anchored documents 95,230 81.1 %
Bookmarked documents 59,126 50.3 %
Tagged documents 56,457 48.1 %

*100% by definition, i.e. the way I created the CABS120k08 data set

Estimated probabilities prob.
P(bookmarked | anchor text) 0.467
P(tagged | anchor text) 0.447
P(bookmarked | anchor text) 0.575
P(tagged | anchor text) 0.552
P(anchor text | bookmarked) 0.927
P(anchor text | tagged) 0.930

Information Sources

For each url, I retrieved and compiled the following metadata:

  • popularity*: represented by Google PageRank (www.google.com)
  • categorization: Open Directory categories (generally, only one category per URL)
  • search queries from AOL500k corpus
  • incoming hyperlinks with anchor text (via Google search for “link:…” followed by a targeted web crawl and HTML parsing)
  • metadata by readers/visitors: Delicoius.com bookmarks and tags

* “general” popularity as computed/interpreted by search engines

Tools of the Trade

I implemented custom software tools which relied on the services’ official APIs where possible and fell back to alternative techniques for situations where the APIs did not provide the required functionality:

Data Format

The corpus is in XML format with following structure.

Documents

Each web document is represented as a <document> with the following attributes:

url
URL of the web document
categories
number of Open Directory categories
searches
number of searches on AOL Search in AOL500k
inlinks
number of incoming hyperlinks with anchor text
users
number of Delicious.com users who have bookmarked this URL (equals the number of bookmarks of this url at Delicious.com)
top tags
number of Delicious.com top tags for this URL, i.e. the most popular tags (maximum: 10)
tags
number of unique Delicious.com tags for this URL
pagerank
Google PageRank of the URL
<category> elements
Open Directory categories (see below)
<top_tag> elements
the Delicious.com top tags in detail (see below)
<bookmark> elements
bookmarks of Delicious.com users including tags (see below)

Categories

Each <document> contains 1 or more <category> elements, which denote the document URL’s categories as assigned by the Open Directory Project.

name
full category information with sub-categories separated by “/”, e.g. “top/sports/golf/courses”

Search Queries

Each <document> contains 1 or more <search> elements, which denote any search queries from AOL500k where this document was visited as result of the search.

query
the search keywords for the query
aol500k_id
the searching user’s anonymized ID in AOL500k
date
the date of the search in YYYY-MM-DD format
time
the time of day of the search in HH:MM:SS format
rank
the document’s position in the corresponding search result list when it was clicked

Incoming anchor texts

Each <document> may contain 1 or more <inlink> elements, which denote any incoming hyperlinks with anchor texts for this document.

anchor_text
the anchor text of the incoming hyperlink

Delicious.com top tags

Each <document> may contain 1 or more <top_tag> elements, which represent the Delicious.com top tags for a URL in detail.

name
name of the tag, e.g. “news”
count
number of times this tag has been used by users

Note: the number of <top_tag> elements of a URL is equal to the document’s “tags” attribute value (see above).

Delicious.com user bookmarks

Each <document> may contain 1 or more <bookmark>, which represent the users’ bookmarks of the document.

user
name of the Delicious.com user who bookmarked the URL
tags
comma-separated list of tags with which the user annotated the bookmark
date
creation date of the bookmark in YYYY-MM format

Note that I could have stored the user’s tags as separate XML elements, but this would have been overkill and increased the corpus file size without a real benefit.

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<documents>
<document url="http://www.edletter.org/" users="10" categories="1" searches="29" inlinks="36" top_tags="5" tags="9" page
rank="6">
    <category name="top/reference/education/journals" />
    <search query="united states preschool teachers and statistics" aol500k_id="807613" date="2006-03-23" time="18:31:58
" rank="12" />
    <search query="nclb and kindergarten" aol500k_id="7516545" date="2006-03-12" time="16:58:12" rank="16" />
    <search query="harvard education letters" aol500k_id="2229594" date="2006-03-21" time="01:43:37" rank="4" />
    ...
    <inlink anchor_text="Harvard Education Letter" />
    <inlink anchor_text="Home" />
    <inlink anchor_text="Harvard Education Letter" />
    <inlink anchor_text="www.edletter.org/" />
    ...
    <top_tag name="education" count="5" />
    <top_tag name="newsletter" count="2" />
    <top_tag name="research" count="3" />
    ...
    <bookmark user="mohandas" tags="edumags" date="2005-07" />
    <bookmark user="selahl" tags="pedagogy, teaching" date="2005-12" />
    <bookmark user="lllnelson2004" tags="edl600, edl671" date="2006-02" />
    ...
    </document>
    ...
    ...
    ...
</documents>

Download

By downloading, you acknowledge that:

  • The data has been compiled for the purposes of illustration for scientific research.
  • The copyright holders retain ownership and reserve all rights.

Download: CABS120k08 data corpus (86 MB, GZIP compressed)

How to reference

When you use CABS120k08 for your own research, please use my corresponding paper as reference.

Feedback

Comments, questions and constructive feedback are always welcome. Just drop me a note.

Related Research Data Sets

  • DMOZ100k06 (published 2007)
    Large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from Delicious.com/Yahoo!, Google, and ICRA.

Comments