Interested in users’ online queries? Ever wanted to cluster similar users or mine their data? Wait no more, AOL’s research team has published a huge data collection of 20,000,000 search queries from 650,000 users sampled over three months for the public to see, dig around and analyze.
According to the AOL research wiki (this link is not working anymore):
This collection consists of ~20M web queries collected from ~650k users over three months. Where the data is sorted by anonymized user id:
The data set includes (UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl).
The goal of this collection is to provide a real query log based on users. It could be used for personalization, query reformulation or other type of search research.
The data download is a 439 MB TGZ file in which the AOL screen names of the users have been obfuscated by name randomization. Please note that the file may (and does) contain sexually explicit data. The collection is distributed for non-commercial research use only (commercial usage is prohibited). Interested researchers can add their comments to the U500k community.
Here is an example entry of the user queries logfile, taken from
1 2 3 4 5 6
Data description (README)
Here’s the content of the original U500k_README.txt:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
Update 08-Aug-2006: Seems like the wiki links are currently down.
Update 22-Aug-2006: The Chief Technology Officer of AOL, Maureen Govern, and two other employees have been fired by AOL after the data “leak”. AOL’s corporation website has already removed Govern’s previous entry under “Other Corporate Operations” on the “Who’s Who” section.
Update 18-Jul-2007: You can still find the original AOL500k data set on mirror sites in the Internet. Pay attention to the MD5 checksum though to make sure you have a genuine copy.
The MD5 hash of the “full” TGZ file (439 MB) as published by AOL is as follows. The actual filename can differ
depending on where you download it, of course.
500kusers.tgz is the original filename.
The MD5 hashes of the individual data set files are as follows:
1 2 3 4 5 6 7 8 9 10
Update 19-Sep-2007: Admittedly, I could have added the following snippet a bit earlier. The following statement is from Andrew Weinstein, AOL Spokesman. He posted the linked message to several blogs like TechCrunch; however, I couldn’t find an official AOL reference to it. His statement gives some further details on the AOL500k data set.
This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.
Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.
Here was what was mistakenly released:
* Search data for roughly 658,000 anonymized users over a three month period from March to May.
* There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.
* According to comScore Media Metrix, the AOL search network had 42.7 million unique visitors in May, so the total data set covered roughly 1.5% of May search users.
* Roughly 20 million search records over that period, so the data included roughly 1/3 of one percent of the total searches conducted through the AOL network over that period.
* The searches included as part of this data only included U.S. searches conducted within the AOL client software.
We apologize again for the release.
- Web Search Personalization via Social Bookmarking and Tagging, 6th Int’l Semantic Web Conference (ISWC) & 2nd Asian Semantic Web Conference (ASWC), Busan, South Korea, November 2007
- AOL500k mirror sites