Interested in users’ online queries? Ever wanted to cluster similar users or mine their data? Wait no more, AOL’s research team has published a huge data collection of 20,000,000 search queries from 650,000 users sampled over three months for the public to see, dig around and analyze.

Note: The AOL500k is not available anymore from AOL directly. Instructions on how to download the AOL500k data set is given near the end of this post.

According to the AOL research wiki (this link is not working anymore):

This collection consists of ~20M web queries collected from ~650k users over three months. Where the data is sorted by anonymized user id:

The data set includes (UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl).

The goal of this collection is to provide a real query log based on users. It could be used for personalization, query reformulation or other type of search research.

The data download is a 439 MB TGZ file in which the AOL screen names of the users have been obfuscated by name randomization. Please note that the file may (and does) contain sexually explicit data. The collection is distributed for non-commercial research use only (commercial usage is prohibited). Interested researchers can add their comments to the U500k community.

Data format

Here is an example entry of the user queries logfile, taken from user-ct-test-collection-01.txt:

AnonID  Query           QueryTime            ItemRank   ClickURL
8760    jojo the singer 2006-03-26 16:02:04  5          http://www.jojofan.com
8760    jennifer lopez  2006-03-26 16:05:29  4          http://www.allstarz.org
8760    jennifer lopez  2006-03-26 16:05:29  10         http://www.starpulse.com
8760    nicole richie   2006-03-26 17:28:58
8760    free porn       2006-03-28 16:43:16

If you ask me, it’s of course very interesting for scientists to get your hands on such a large sample of real-world user data. But I’m still completely flabbergasted that AOL did publish this data at all. Remember the big discussion in January when the US government demanded search data from companies such as Google? Even though AOL has obfuscated the data, there will still be a lot of valuable (read: sensitive) information in the sample file. I wonder whether AOL’s privacy policy has been violated by this publication.

Data description (README)

Here’s the content of the original U500k_README.txt:

500k User Session Collection
----------------------------------------------
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY.
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

Brief description:

This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.

The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
        AnonID - an anonymous user ID number.
        Query  - the query issued by the user, case shifted with
                 most punctuation removed.
        QueryTime - the time at which the query was submitted for search.
        ItemRank  - if the user clicked on a search result, the rank of the
                    item on which they clicked is listed.
        ClickURL  - if the user clicked on a search result, the domain portion of
                    the URL in the clicked result is listed.

Each line in the data represents one of two types of events:
        1. A query that was NOT followed by the user clicking on a result item.
        2. A click through on an item in the result list returned from a query.
In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above).
In the second case (click through), there is data in all five columns.  For click through events, the query that preceded the click through is included.  Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events.  Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.

CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA!  Please be aware that these queries are not filtered to remove any content.  Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material.  There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE.  This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms.  If you are offended by sexually explicit language you should not read through this data.  Also be aware that in some states it may be illegal to expose a minor to this data.  Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.

Basic Collection Statistics
Dates:
  01 March, 2006 - 31 May, 2006

Normalized queries:
  36,389,567 lines of data
  21,011,340 instances of new queries (w/ or w/o click-through)
   7,887,022 requests for "next page" of results
  19,442,629 user click-through events
  16,946,938 queries w/o user click-through
  10,154,742 unique (normalized) queries
     657,426 unique user ID's


Please reference the following publication when using this collection:

G. Pass, A. Chowdhury, C. Torgeson,  "A Picture of Search"  The First
International Conference on Scalable Information Systems, Hong Kong, June,
2006.

Copyright (2006) AOL

Update 08-Aug-2006: Seems like the wiki links are currently down.

Update 22-Aug-2006: The Chief Technology Officer of AOL, Maureen Govern, and two other employees have been fired by AOL after the data “leak”. AOL’s corporation website has already removed Govern’s previous entry under “Other Corporate Operations” on the “Who’s Who” section.

Update 18-Jul-2007: You can still find the original AOL500k data set on mirror sites in the Internet. Pay attention to the MD5 checksum though to make sure you have a genuine copy.

The MD5 hash of the “full” TGZ file (439 MB) as published by AOL is as follows. The actual filename can differ depending on where you download it, of course. 500kusers.tgz is the original filename.

31cd27ce12c3a3f2df62a38050ce4c0a  500kusers.tgz

The MD5 hashes of the individual data set files are as follows:

d51e959cf4586b7f1664aa955045bd1f  user-ct-test-collection-01.txt
f14e7abcd259f3628b4c03213172bdea  user-ct-test-collection-02.txt
d4258d61bde74c05dfc77562dd8d3dfc  user-ct-test-collection-03.txt
5ed4cb71cb682e1cb701ff334b6b7d38  user-ct-test-collection-04.txt
8fc0eb2dcb8294eb1dc413e072a26efa  user-ct-test-collection-05.txt
9b66c339f574c45cf6887a4db50d4e69  user-ct-test-collection-06.txt
9cd709fa646fdb4308b7804ea171b88a  user-ct-test-collection-07.txt
e6c9a8ecce884a4fd8c63cecbacc0c9c  user-ct-test-collection-08.txt
a1f1812da837bdeb25f43db2a32c7d80  user-ct-test-collection-09.txt
1bf390ab0312a3d9fa1a1acff25deb74  user-ct-test-collection-10.txt

Update 19-Sep-2007: Admittedly, I could have added the following snippet a bit earlier. The following statement is from Andrew Weinstein, AOL Spokesman. He posted the linked message to several blogs like TechCrunch; however, I couldn’t find an official AOL reference to it. His statement gives some further details on the AOL500k data set.

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

Here was what was mistakenly released:

* Search data for roughly 658,000 anonymized users over a three month period from March to May.

* There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.

* According to comScore Media Metrix, the AOL search network had 42.7 million unique visitors in May, so the total data set covered roughly 1.5% of May search users.

* Roughly 20 million search records over that period, so the data included roughly 1/3 of one percent of the total searches conducted through the AOL network over that period.

* The searches included as part of this data only included U.S. searches conducted within the AOL client software.

We apologize again for the release.

Andrew Weinstein (AOL Spokesman)

Related links

Interested in more? You can subscribe to this blog and follow me on Twitter.