Michael G. Noll

Applied Research. Big Data. Distributed Systems. Open Source.

DMOZ100k06

DMOZ100k06 is a large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from Delicious.com/Yahoo!, Google, and ICRA.

This web page describes the DMOZ100k06 data set as published in my paper Authors vs. Readers: A Comparative Study of Document Metadata and Content in the WWW at the 7th International ACM Symposium on Document Engineering, Canada, August 2007.

The Data Set

This section gives you a short overview of DMOZ100k06. The corpus is described in detail in my paper Authors vs. Readers: A Comparative Study of Document Metadata and Content in the WWW, for which the corpus was built. The paper includes both a quantitative and qualitative analysis of DMOZ100k06.

Figure 1: Frequency of metadata supplied by end users via social bookmarking and tagging by Google PageRank. For instance, 32.8% of all tag annotations in DMOZ100k06 were applied to documents with a PageRank of 6.
Overview Total Comment
documents* 97,578
bookmarks 180,246
bookmarked documents 13,771 14.1 %
(common) tags 25,311 6,090 unique
tagged documents 4,992 5.1 %

*the remaining 2,422 could not be retrieved

Statistics per document Mean Std. dev.
bookmarks 1.85 47.68
tags 0.26 1.80
PageRank 3.13 1.66

Information Sources

For each web document in the sample, I retrieved the actual document from the WWW plus metadata from the social bookmarking service Delicious.com, from the Internet Content Rating Association (ICRA), and from Google as shown in figure 2. This means DMOZ100k06 provides the following types of information about a web document:

  • metadata by authors/webmasters: ICRA content labels
  • metadata by readers/visitors: del.icio.us bookmarks and tags
  • “popularity”: represented by Google PageRank
  • technical infrastructure: average HTTP response time of web server

Figure 2: Information sources used for building the DMOZ100k06 corpus. The sources of the original web documents containing the HTML metadata and the actual content (gray blocks) are not included in the download at the moment.

Tools of the Trade

I implemented custom software tools which relied on the services’ official APIs where possible and fell back to alternative techniques for situations where the APIs did not provide the required functionality:

Data Format

The corpus is in XML format with following structure.

Documents

Each web document is represented as a <document> with the following attributes:

url
URL of the web document
users
Number of del.icio.us users who have bookmarked the URL
tags
Number of del.icio.us common tags for the URL, i.e. the most popular tags (0…25). For a description of common tags, see section “Tags” below.
pagerank
Google PageRank of the URL
icra
ICRA label status of the URL with the following possible values: red (no label(s) or only incorrect label(s)), yellow(correct label but only partial coverage of document content, i.e. some elements such as images or banners are not covered by the label), green (correct label, full coverage of the document content), error (error during label tests such as server errors or network timeouts)
http_response_time
Mean HTTP response time of web server serving the URL; averaged over several runs during different days/times with DNS caching applied
<tag> elements
del.icio.us common tags in detail (see below)

Note that the ICRA label test checks only the syntactical correctness of a label, not whether the content of the URL actually matches the label’s description! For a semantical analysis of ICRA labels, read my paper Web Page Classification: An Exploratory Study of Internet Content Rating Systems.

Tags

Each <document> may contain up to 25 <tag> elements, which represent the Delicious.com common tags for the document. Delicious.com limits common tags to 25 per URL, which means that the list of all tags attached to a document might actually be (much) larger than 25. The reason for retrieving just the common tags of a document instead of all tags was technical restrictions. Note that the number of <tag> elements of a document is always equal to the value of a <document>’s ”tags” attribute (see above).

name
Name of the tag, e.g. “news” or “delicious”
weight
Weight of the tag (1…5) as returned by Delicious.com; higher values denote higher popularity of the tag

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<documents>
    <document url="http://www.example.com/" users="33" tags="9" pagerank="8" icra="red" http_response_time="0.541234">
        <tag name="delicious" weight="1" />
        <tag name="dmoz100k06" weight="4" />
        <tag name="dmoz" weight="5" />
        <tag name="google" weight="1" />
        <tag name="icra" weight="1" />
        <tag name="information retrieval" weight="1" />
        <tag name="research" weight="3" />
        <tag name="web2.0" weight="1" />
        <tag name="viIsBetterThanEmacs" weight="1" />
    </document>
    ...
    ...
    ...
</documents>

Download

By downloading, you acknowledge that:

  • The data has been compiled for the purposes of illustration for scientific research.
  • The copyright holders retain ownership and reserve all rights.

Download: DMOZ100k06 data corpus (version without web document sources)

How to reference

When you use DMOZ100k06 for your own research, please use my corresponding paper as a reference.

Feedback

Comments, questions and constructive feedback are always welcome. Just drop me a note.

Related Research Data Sets

  • CABS120k08 (published 2008)
    Large research data set about Web metadata based on a sample of 120,000 web documents with data retrieved from the Open Directory Project, the AOL Search query log corpus AOL500k, Google PageRank, Delicious.com/Yahoo!, and anchor text from incoming hyperlinks

Comments