DMOZ100k06

From Michael G. Noll

Jump to: navigation, search
“A large research data set about document metadata based on a random sample of 100,000 web documents from the Open Directory combined with data retrieved from del.icio.us, Google, and ICRA.”


The DMOZ100k06 data set is based on a random sample of 100,000 web documents from the Open Directory (aka DMOZ). At the time of the sampling in December 2006, the Open Directory RDF Dump contained 4,818,944 web documents in total (100,000 sample = 2.1 %) in over 590,000 categories. The data set is the start of a long-term project for studying the impact of end users on the Internet, and how academic research and the society can benefit from it.

Contents

The Corpus

This section gives you a short overview of DMOZ100k06. The corpus is described in detail in my paper Authors vs. Readers: A Comparative Study of Document Metadata and Content in the WWW [1], for which the corpus was built. The paper includes both a quantitative and qualitative analysis of DMOZ100k06.

Image:Dmoz100k06-frequency-metadata.png

Figure 1: Frequency of metadata supplied by end users via social bookmarking and tagging by Google PageRank. For instance, 32.8% of all tag annotations in DMOZ100k06 were applied to documents with a PageRank of 6.


overview total comment
documents* 97,578  
bookmarks 180,246  
bookmarked documents 13,771 14.1 %
(common) tags 25,311 6,090 unique
tagged documents 4,992 5.1 %

*the remaining 2,422 could not be retrieved

statistics per document mean std. dev.
bookmarks 1.85 47.68
tags 0.26 1.80
PageRank 3.13 1.66

Information Sources

For each web document in the sample, we retrieved the actual document from the WWW plus metadata from the social bookmarking service del.icio.us, from the Internet Content Rating Association, and from Google as shown in figure 2. This means DMOZ100k06 provides the following types of information about a web document:

  • metadata by authors/webmasters: ICRA content labels
  • metadata by readers/visitors: del.icio.us bookmarks and tags
  • "popularity": represented by Google PageRank
  • technical infrastructure: average HTTP response time of web server

Image:Information-sources-dmoz100k06.png

Figure 2: Information sources used for building the DMOZ100k06 corpus as shown in [1]. The sources of the original web documents containing the HTML metadata and the actual content (gray blocks) are not included in the download at the moment.

Tools of the Trade

I implemented custom software tools which relied on the services’ official APIs where possible and fell back to alternative techniques for situations where the APIs did not provide the required functionality:

Data Format

The corpus is in XML format with following structure.

Documents

Each web document is represented as a <document> with the following attributes:

url 
URL of the web document
users 
Number of del.icio.us users who have bookmarked the URL
tags 
Number of del.icio.us common tags for the URL, i.e. the most popular tags (0…25). For a description of common tags, see section “Tags” below.
pagerank 
Google PageRank of the URL
icra 
ICRA label status of the URL with the following possible values: red (no label(s) or only incorrect label(s)), yellow (correct label but only partial coverage of document content, i.e. some elements such as images or banners are not covered by the label), green (correct label, full coverage of the document content), error (error during label tests such as server errors or network timeouts)
http_response_time 
Mean HTTP response time of web server serving the URL; averaged over several runs during different days/times with DNS caching applied
<tag> elements 
del.icio.us common tags in detail (see below)

Note that the ICRA label test checks only the syntactical correctness of a label, not whether the content of the URL actually matches the label’s description! For a semantical analysis of ICRA labels, read my paper Web Page Classification: An Exploratory Study of Internet Content Rating Systems [1].

Tags

Each <document> may contain up to 25 <tag> elements, which represent the del.icio.us common tags for the document. del.icio.us limits common tags to 25 per URL, which means that the list of all tags attached to a document might actually be (much) larger than 25. The reason for retrieving just the common tags of a document instead of all tags was technical restrictions. Note that the number of <tag> elements of a document is always equal to the value of a <document>’s tags attribute (see above).

name 
Name of the tag, e.g. “news” or "delicious"
weight 
Weight of the tag (1…5) as returned by del.icio.us; higher values denote higher popularity of the tag

Example

<documents>
    <document url="http://www.example.com/" users="33" tags="9" pagerank="8" icra="red" http_response_time="0.541234">
        <tag name="delicious" weight="1" />
        <tag name="dmoz100k06" weight="4" />
        <tag name="dmoz" weight="5" />
        <tag name="google" weight="1" />
        <tag name="icra" weight="1" />
        <tag name="information retrieval" weight="1" />
        <tag name="research" weight="3" />
        <tag name="web2.0" weight="1" />
        <tag name="viIsBetterThanEmacs" weight="1" />
    </document>
    …
    …
    …
</documents>

Download

Legal information

By downloading, you acknowledge that:

  • The data has been compiled for the purposes of illustration for scientific research.
  • The copyright holders retain ownership and reserve all rights.

Download: DMOZ100k06 data corpus (version without web document sources)

How to reference

When you use DMOZ100k06 for your own research, use my paper [1] as reference.

Feedback

Comments, questions and constructive feedback are always welcome. Just drop me a note.

References


Tags: acm, content rating, corpus, del.icio.us, delicious, dmoz, dmoz100k06, doceng, google, icra, metadata, pagerank, paper, papers, publication, publications, random sample, research, social bookmarking, social tagging, study, tagging, web2.0