CABS120k08
CABS120k08 is a large research data set about Web metadata based on a sample of 120,000 web documents with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank, Delicious.com, and anchor text from incoming hyperlinks.
This web page describes the CABS120k08 data set as published in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries at the 7th IEEE/WIC/ACM International Conference on Web Intelligence (WI), Australia, December 2008.
The CABS120k08 data set contains 117,434 URLs with additional metadata, based on an intersection of the AOL Search query log corpus called AOL500k from 2006 and the Open Directory Project (DMOZ).
The randomly sampled AOL500k corpus is one of the largest publicly available collections of search queries today. It consists of 20 million web queries collected from 650,000 users on AOL Search over three months in 2006. As result of these search queries, 1.6 million different web documents were visited by users.
The Open Directory Project is by its own account “the largest, most comprehensive human-edited directory of the Web”. At the time we built the CABS120k08 corpus, the Open Directory contained about 4,8 million web documents in about 590,000 categories.
For creating CABS120k08, I made an intersection of URLs (= web documents) in AOL500k and in the Open Directory. Only such web documents made it into the data set which were both searched for and subsequently visited (AOL500k) as well as categorized (Open Directory).
The Data Set
This section gives you a short overview of CABS120k08. The corpus is described in detail in my paper The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries, for which the corpus was built. The paper includes both a quantitative and qualitative analysis of CABS120k08.
Overview | Total | Comment |
---|---|---|
Total documents | 117,434 | |
Total categories | 84,663 | |
Total searches | 2,617,326 | |
Total anchor texts | 2,242,621 | |
Total users | 3,383,571 | |
Total bookmarks | 1,289,563 | unique: 9.1 % |
Total tags | 3,383,571 | unique: 26.3 % |
Categorized documents* | 117,434 | 100.0 % |
Searched documents* | 117,434 | 100.0 % |
Anchored documents | 95,230 | 81.1 % |
Bookmarked documents | 59,126 | 50.3 % |
Tagged documents | 56,457 | 48.1 % |
*100% by definition, i.e. the way I created the CABS120k08 data set
Estimated probabilities | prob. |
---|---|
P(bookmarked | anchor text) | 0.467 |
P(tagged | anchor text) | 0.447 |
P(bookmarked | anchor text) | 0.575 |
P(tagged | anchor text) | 0.552 |
P(anchor text | bookmarked) | 0.927 |
P(anchor text | tagged) | 0.930 |
Information Sources
For each url, I retrieved and compiled the following metadata:
- popularity*: represented by Google PageRank (www.google.com)
- categorization: Open Directory categories (generally, only one category per URL)
- search queries from AOL500k corpus
- incoming hyperlinks with anchor text (via Google search for “link:…” followed by a targeted web crawl and HTML parsing)
- metadata by readers/visitors: Delicoius.com bookmarks and tags
* “general” popularity as computed/interpreted by search engines
Tools of the Trade
I implemented custom software tools which relied on the services’ official APIs where possible and fell back to alternative techniques for situations where the APIs did not provide the required functionality:
- Python: custom scripts and modules, most notably my Python API for Delicious
- Perl: small script using WWW::Google::PageRank
- Unix swiss army knife featuring grep, sed, awk
Data Format
The corpus is in XML format with following structure.
Documents
Each web document is represented as a <document>
with the following attributes:
- url
- URL of the web document
- categories
- number of Open Directory categories
- searches
- number of searches on AOL Search in AOL500k
- inlinks
- number of incoming hyperlinks with anchor text
- users
- number of Delicious.com users who have bookmarked this URL (equals the number of bookmarks of this url at Delicious.com)
- top tags
- number of Delicious.com top tags for this URL, i.e. the most popular tags (maximum: 10)
- tags
- number of unique Delicious.com tags for this URL
- pagerank
- Google PageRank of the URL
- <category> elements
- Open Directory categories (see below)
- <top_tag> elements
- the Delicious.com top tags in detail (see below)
- <bookmark> elements
- bookmarks of Delicious.com users including tags (see below)
Categories
Each <document>
contains 1 or more <category>
elements, which denote the document URL’s categories as assigned
by the Open Directory Project.
- name
- full category information with sub-categories separated by "/", e.g. "top/sports/golf/courses"
Search Queries
Each <document>
contains 1 or more <search>
elements, which denote any search queries from AOL500k where this
document was visited as result of the search.
- query
- the search keywords for the query
- aol500k_id
- the searching user's anonymized ID in AOL500k
- date
- the date of the search in YYYY-MM-DD format
- time
- the time of day of the search in HH:MM:SS format
- rank
- the document's position in the corresponding search result list when it was clicked
Incoming anchor texts
Each <document>
may contain 1 or more <inlink>
elements, which denote any incoming hyperlinks with anchor texts
for this document.
- anchor_text
- the anchor text of the incoming hyperlink
Delicious.com top tags
Each <document>
may contain 1 or more <top_tag>
elements, which represent the Delicious.com top tags for a URL
in detail.
- name
- name of the tag, e.g. "news"
- count
- number of times this tag has been used by users
Note: the number of <top_tag>
elements of a URL is equal to the document’s “tags” attribute value (see above).
Delicious.com user bookmarks
Each <document>
may contain 1 or more <bookmark>
, which represent the users’ bookmarks of the document.
- user
- name of the Delicious.com user who bookmarked the URL
- tags
- comma-separated list of tags with which the user annotated the bookmark
- date
- creation date of the bookmark in YYYY-MM format
Note that I could have stored the user’s tags as separate XML elements, but this would have been overkill and increased the corpus file size without a real benefit.
Example
<documents>
<document url="http://www.edletter.org/" users="10" categories="1" searches="29" inlinks="36" top_tags="5" tags="9" page
rank="6">
<category name="top/reference/education/journals" />
<search query="united states preschool teachers and statistics" aol500k_id="807613" date="2006-03-23" time="18:31:58" rank="12" />
<search query="nclb and kindergarten" aol500k_id="7516545" date="2006-03-12" time="16:58:12" rank="16" />
<search query="harvard education letters" aol500k_id="2229594" date="2006-03-21" time="01:43:37" rank="4" />
...
<inlink anchor_text="Harvard Education Letter" />
<inlink anchor_text="Home" />
<inlink anchor_text="Harvard Education Letter" />
<inlink anchor_text="www.edletter.org/" />
...
<top_tag name="education" count="5" />
<top_tag name="newsletter" count="2" />
<top_tag name="research" count="3" />
...
<bookmark user="mohandas" tags="edumags" date="2005-07" />
<bookmark user="selahl" tags="pedagogy, teaching" date="2005-12" />
<bookmark user="lllnelson2004" tags="edl600, edl671" date="2006-02" />
...
</document>
...
...
...
</documents>
Download
Legal information
By downloading, you acknowledge that:
- The data has been compiled for the purposes of illustration for scientific research.
- The copyright holders retain ownership and reserve all rights.
Download: CABS120k08 data corpus (86 MB, GZIP compressed)
How to reference
When you use CABS120k08 for your own research, please use my corresponding paper as reference.
Feedback
Comments, questions and constructive feedback are always welcome. Just drop me a note.
Related Research Data Sets
- DMOZ100k06
(published 2007)
Large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from Delicious.com/Yahoo!, Google, and ICRA.