My paper Exploring Social Annotations for Web Document Classification has been accepted for publication and presentation at the Semantic Web track of this year’s ACM Symposium on Applied Computing (SAC), which will be held in Fortaleza, Ceará, Brazil, from March 16-20, 2008.
Social annotation via collaborative tagging describes the process by which many users add metadata in the form of unstructured keywords to shared content. The recent success of web services with such a tagging component like del.icio.us or Flickr has provided a plethora of user-supplied metadata about web content for everyone to leverage.
In this paper, we explore and analyze social annotations and tagging with regard to their usefulness for web document classification. We are interested in finding out which kinds of documents are annotated more by end users than others, how users tend to annotate these documents, and in particular how this user-generated folksonomy compares with a top-down taxonomy maintained by classification experts for the same set of documents. We describe what can be deduced from the results for further research and development in the areas of document classification and information retrieval. Our work is based on large sets of real-world data, comprising a random sample of 100,000 web documents combined with data retrieved from the social bookmarking service del.icio.us, the Open Directory catalogue, and the search engine Google. The data set of our experiments is freely available for research.
- M. G. Noll, C. Meinel, Exploring Social Annotations for Web Document Classification (PDF), Proceedings of 23rd International ACM Symposium on Applied Computing, Fortaleza, Ceará, Brazil, March 2008, pp. 2315-2320, ISBN 978-1-59593-753-7 (ACM Link, BibTeX)
- List of my publications
- Authors vs. Readers: A Comparative Study of Document Metadata and Content in the WWW, Proceedings of 7th Intl’l ACM Symposium on Document Engineering (ACM DocEng), Winnipeg, Canada, August 2007, pp. 177-186, ISBN 978-1-59593-776-6
- DMOZ100k06, a large research data set about document metadata based on a random sample of 100,000 web documents