Web Page Classification: An Exploratory Study of Internet Content Rating Systems

by Michael G. Noll on June 20, 2005 (last updated: September 21, 2007)

Published at HACK 2005 Conference, October 2005.

Abstract

The greatest use of the Internet and new online technologies today is for constructive purposes. However, the use of the same technologies to spread illegal and objectionable content has been increasing dramatically in the last years. Internet users have begun to protect themselves and their wards by using so-called web content filters, which allow access to legitimate content and block access to objectionable, illegal, and otherwise harmful content. Next to active filtering technologies, which use heuristics, machine learning and similar techniques from the area of text and image classification to analyze web pages, there is the complementary category of passive content filters, which rely on (mostly voluntary) content rating systems to classify web pages. In the last years, content rating systems have received increased public attention and support, for example by initiatives such as the European Commission’s Safer Internet campaign. In this paper, we study the usage of Internet content rating systems in the context of web page classification, and in particular the filtering of pornographic web content. Based on 8,000,000 anonymized Internet requests collected over a 1-month period, we have tested more than 150,000 websites for the presence of content rating information, so-called content labels; a random subset of 5,000 websites has been manually classified and used as the fundament for an evaluation of the classification performance of rating-dependent content filters for pornographic material in a real-world scenario. We show that the usage of Internet content rating systems is at best marginal and as a result of this, that the classification performance of rating-dependent content filters is inadequate and their application not yet recommended in practice.

1 Introduction

A recent web server survey1 counted more than 63 million active websites and an average increase of 1.2 million sites per month for 2005. The greatest use of the Internet and new online technologies today is for constructive purposes; however, the use of the same technologies to spread illegal and objectionable content has been increasing dramatically in the last years. Internet users have begun to protect themselves and their wards by using so-called web content filters, which allow access to legitimate content and disallow access to objectionable, illegal, pornographic, and otherwise problematic content; for example, parents can use filtering software such as NetNanny or CyberSitter to safeguard their children from harmful websites. The problem of unwanted e-mails (unsolicited commercial e-mails, spam) has received increased public attention, and appropriate tools for filtering spam e-mails have been steadily integrated into the Internet’s communication infrastructures and end-user applications. On the other hand, the area of web content filtering is still in its infancy. To improve this situation, various international, governmental and public initiatives have started campaigns such as the European Commission’s Safer Internet programme to increase the public awareness of objectionable Internet content and support the development of technologies and frameworks to tackle harmful material on the World Wide Web.

One prominent approach is the usage of rating systems for Internet content, similar to rating systems such as MPAA’s2 for movies or ESRB’s3 for computer software and games. Like many of these, most of the existing Internet content rating systems are legally voluntary. Interested content providers can manually classify their content with a common description framework and add the rating information in the form of digital content labels to their websites. Internet users can then use filtering software to allow or disallow access to websites based on this meta information. Obviously, the availability of such content labels makes the filtering task per se rather trivial and theoretically more reliable than heuristic methods for content classification. For the rest of this paper, we will use the term “rating-dependent content filter” for filtering software, which relies only on this rating information to make filtering decisions.

Rating systems for Internet content sound promising on paper. But the viability and the success of content rating systems depend heavily on the actual usage of these systems by the involved parties, in particular those responsible for providing rating information. To the best of our knowledge, the work in this paper is the first study analyzing the availability and trustworthiness of content rating information in the Internet. We show that the usage of Internet content rating systems is at best marginal and, at the example of using rating information to filter pornographic websites, that the resulting classification performance of rating-dependent content filters is inadequate and their application not recommended in practice.

2 Related Work

The discussion about Internet content filtering has always been accompanied by censorship and privacy concerns. Recent studies [1] have shown that filtering technologies are one of the tools used by governments to restrict access to “inappropriate” Internet content. On the other hand, end users themselves have expressed their need for filtering technologies; for example, parents request better technical tools for protecting their children in the Internet [6], in particular for filtering pornography [5]. Ho and Lui analyzed the factors affecting Internet content filter acceptance [3] such as perceived usefulness in this context.

In an attempt to promote self-regulation of Internet content, rating systems have been introduced to help users control which content they want and do not want to see in the World Wide Web. Some related research work has been done on discussing the benefits and drawbacks of content rating systems in general [4], [8], in which voluntary rating systems [2] have been favored by most of the authors. To the best of our knowledge, the work in this paper is the first study analyzing the availability and trustworthiness of content rating information in the Internet.

3 Internet content rating systems

3.1 Overview

Internet content rating systems define special metadata to describe web content, so-called content labels. The creation of this metadata is generally performed on a voluntary basis by the content providers themselves, who will also technically integrate the rating information into their websites. Another though less common scenario involves third parties in the role of the content rating institution, who will classify content on behalf of others and provide this rating information on request.

Most of the existing Internet content rating systems are based on PICS, the Platform for Internet Content Selection4. PICS enables metadata to be associated with Internet content and promotes voluntary self-rating of online material [7]. It was originally designed to help parents and teachers control what children access on the Internet, and it is a platform on which other rating services and filtering software have been built.

The most prominent content rating system in the Internet today is developed and maintained by the Internet Content Rating Association (ICRA)5, an independent non-profit organization established in 1999 by a group of international Internet companies and associations6. ICRA has been supported by the European Commission’s Safer Internet Action Plan7 and has participated in several EU funded projects in the fields of Internet security with a focus on content filtering. ICRA’s current rating system is based on PICS but a successor using RDF (Resource Description Framework) is under development. The cornerstone of the rating system is the ICRA vocabulary8, which defines a set of descriptors9 used to classify online content. The vocabulary covers nudity and sexual content, violence, language, chat facilities, and other topics such as gambling, drugs, and alcohol. A selection of ICRA descriptors is listed in Table 1. In this paper, we focus on the ICRA content rating system for our studies.

Descriptor Meaning
na 1 Erections and female genitals in detail
nd 1 Female breasts
ng 1 Obscure or implied sexual acts
nr 1 Appears in an artistic context and is suitable for young children
va 1 Sexuall violence / rape
ve 1 Killing of human beings
vk 1 Deliberate damage to objects
lb 1 Crude words or profanity
oc 1 Promotion of drug use

Table 1: Selected ICRA content descriptors

3.2 Rating and filtering content

To rate material under their control, content providers use an online web form provided by the ICRA and check which of the (currently 45) elements in the ICRA vocabulary are present or absent from their websites. At the end of this process, the ICRA content label is automatically generated and can be integrated into the content providers’ websites. The following label could be used to rate the content available at the LIASIT website, http://www.liasit.lu, and would be put into the section of every LIASIT web page for which it is valid.

The fictitious label describes the content of LIASIT’s website as rather innocuous:

  • No elements listed in the category “Nudity and sexual material”
  • No elements listed in the category “Violence”
  • No elements listed in the category “Language”
  • No elements listed in the category “Language”
  • No elements listed in the category “Other topics”

In our example, we generated the most common kind of label, whose scope is valid for the whole website (“gen true”, an abbrevation of “generic true”) and not only for specific pages. After this procedure, the rating of the LIASIT website would be completed and we could use the ICRA label tester application to verify the (technical) correctness of the content label.

By the use of this rating information, content filtering software can allow or block access to labeled websites based on the user’s preferences via a simple matching process. Similar to the labeling process performed by the content providers, users employ the rating vocabulary to specify which types of content are deemed appropriate or inappropriate for them. If the parents of a 10-year old girl wanted to protect her from online pornography, they might decide to configure a rating-dependent content filter in a such a way that it would block access to any website with content in the “Nudity and sexual material” category.

3.3 Dealing with unrated content

An ongoing point of discussion is about how to deal with unrated content and which type of websites rating systems should focus on. The first and intuitive approach is that mainly unsuitable websites need to rate their content because the general goal of rating systems is to protect users from unwanted material. On the other hand, no regulatory jurisdiction can impose a rating system on content outside of its control, e.g. material from another country10, and criminal content providers are unlikely to care for content rating at all [2]. The second approach therefore argues that it is rather the legitimate websites that need to use rating systems in order to express their “innocence”. While the first alternative supports a policy to allow access to unrated content, the second implies the policy to deny access to unrated content since it cannot be trusted. In this paper, we have studied the consequences of both cases to deal with unrated content with regard to classification performance.