# AOL Research Publishes 650,000 User Queries

Interested in users’ online queries? Ever wanted to cluster similar users or mine their data? Wait no more, AOL’s research team has published a huge data collection of 20,000,000 search queries from 650,000 users sampled over three months for the public to see, dig around and analyze.

Note: The AOL500k is not available anymore from AOL directly. Instructions on how to download the AOL500k data set is given near the end of this post.

According to the AOL research wiki (this link is not working anymore):

This collection consists of ~20M web queries collected from ~650k users over three months. Where the data is sorted by anonymized user id:

The data set includes (UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl).

The goal of this collection is to provide a real query log based on users. It could be used for personalization, query reformulation or other type of search research.

The data download is a 439 MB TGZ file in which the AOL screen names of the users have been obfuscated by name randomization. Please note that the file may (and does) contain sexually explicit data. The collection is distributed for non-commercial research use only (commercial usage is prohibited). Interested researchers can add their comments to the U500k community.

# Data format

Here is an example entry of the user queries logfile, taken from user-ct-test-collection-01.txt:

If you ask me, it’s of course very interesting for scientists to get your hands on such a large sample of real-world user data. But I’m still completely flabbergasted that AOL did publish this data at all. Remember the big discussion in January when the US government demanded search data from companies such as Google? Even though AOL has obfuscated the data, there will still be a lot of valuable (read: sensitive) information in the sample file. I wonder whether AOL’s privacy policy has been violated by this publication.

Here’s the content of the original U500k_README.txt:

Update 08-Aug-2006: Seems like the wiki links are currently down.

Update 22-Aug-2006: The Chief Technology Officer of AOL, Maureen Govern, and two other employees have been fired by AOL after the data “leak”. AOL’s corporation website has already removed Govern’s previous entry under “Other Corporate Operations” on the “Who’s Who” section.

Update 18-Jul-2007: You can still find the original AOL500k data set on mirror sites in the Internet. Pay attention to the MD5 checksum though to make sure you have a genuine copy.

The MD5 hash of the “full” TGZ file (439 MB) as published by AOL is as follows. The actual filename can differ depending on where you download it, of course. 500kusers.tgz is the original filename.

The MD5 hashes of the individual data set files are as follows:

Update 19-Sep-2007: Admittedly, I could have added the following snippet a bit earlier. The following statement is from Andrew Weinstein, AOL Spokesman. He posted the linked message to several blogs like TechCrunch; however, I couldn’t find an official AOL reference to it. His statement gives some further details on the AOL500k data set.

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

Here was what was mistakenly released:

* Search data for roughly 658,000 anonymized users over a three month period from March to May.

* There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.

* According to comScore Media Metrix, the AOL search network had 42.7 million unique visitors in May, so the total data set covered roughly 1.5% of May search users.

* Roughly 20 million search records over that period, so the data included roughly 1/3 of one percent of the total searches conducted through the AOL network over that period.

* The searches included as part of this data only included U.S. searches conducted within the AOL client software.

We apologize again for the release.

Andrew Weinstein (AOL Spokesman)