Michael G. Noll

Applied Research. Big Data. Distributed Systems. Open Source.

GooDiff Project Up'n'running

Alexandre Dulaunoy and I have finally been able to put our GooDiff website online after I fixed the last problems with the Python monitor script and its integration into the RCS backend. We’re still working on some parts, but most of it is done and tested. Maybe we should put a “beta” flag on it? Seems like it’s not possible to launch a service without that buzzword nowadays.

The GooDiff idea came during a discussion about Google’s mail service, Gmail. Alexandre found out that the Google Bot was crawling a “honeypot” web page on his website, which had only been created for testing whether hyperlinks in Gmail are analyzed by Google or not (needless to say, the hyperlink’s address was very cryptic and not published anywhere else). Alex had sent an email with the prepared hyperlink to his Gmail account and had clicked on the link from within Gmail. As it turned out, Google did not only process the test email for showing advertisements but it obviously fed the included hyperlink to the Google search bot, too.

So we asked ourselves if this was compliant with Gmail’s privacy policy (GooDiff’d Gmail privacy policy as of March 30, 06). There were (are) several sections about what and how your Gmail data is processed. Read the relevant sections and judge by yourself:

Personal information


  • When you use Gmail, Google’s servers automatically record certain information about your use of Gmail. Similar
    to other web services, Google records information such as account activity (including storage usage, number of
    log-ins), data displayed or clicked on (including UI elements, ads, links); and other log information (including
    browser type, IP-address, date and time of access, cookie ID, and referrer URL).


  • Google maintains and processes your Gmail account and its contents to provide the Gmail service to you and to
    improve our services. The Gmail service includes relevant advertising and related links based on the IP address,
    content of messages and other information related to your use of Gmail.

  • Google’s computers process the information in your messages for various purposes, including formatting and
    displaying the information to you, delivering advertisements and related links, preventing unsolicited bulk email
    (spam), backing up your messages, and other purposes relating to offering you Gmail.

  • Google may send you information related to your Gmail account or other Google services.

We guessed it could be the part “records information such as […] data clicked on (including UI elements, ads, links)”. But would you intuitively realize that when you send an email to your spouse containing the private link to your web gallery with honeymoon photos, the pictures will end up in Google’s search index, and thus in Google Cache?

The Google document More on Gmail and Privacy has more descriptive explanations than the aforementioned privacy policy, so we checked its section Scanning email content.

Google scans the text of Gmail messages in order to filter spam and detect viruses, just as all major webmail services
do. Google also uses this scanning technology to deliver targeted text ads and other related information. This is
completely automated and involves no humans.

When a user opens an email message, computers scan the text and then instantaneously display relevant information that
is matched to the text of the message. Once the message is closed, ads are no longer displayed. It is important to note
that the ads generated by this matching process are dynamically generated each time a message is opened by the user –
in other words, Google does not attach particular ads to individual messages or to users’ accounts.


On the other hand, delivering information gathered through email scanning to a third party would be a violation of
privacy. Google does not do this. Neither email content nor any personal information is ever shared with other parties
as a result of our ad-targeting process.

Hmm, reading the text above - in particular the last passage - reassured us that feeding the hyperlinks you get from friends via email should not end up as Google Bot fodder. Isn’t putting web pages into Google’s search index some kind of “delivering information to third parties”?

Maybe it’s a good idea to think about what Google would do in our situation. Do you remember how Google’s very own CEO Eric Schmidt reacted when Cnet journalist Elinor Mills retrieved personal information about him via Google search? Google was so upset that the first reaction was to blackball all Cnet reporters for one year even though the actual “boycott” only lasted for some weeks. All personal information about Schmidt was retrieved by using Google’s own search engine and index.

Regardless of whether Google is allowed to do that or not, and regardless whether you think it’s morally correct or not, the first thing we did was to ask ourselves whether Google had changed the Gmail legal documents (privacy policy, terms of use, you name it) in the meantime, and whether a small change to and difference in the privacy policy would allow them to do it now. The bad thing is that it’s normally up to the end user to keep track of any changes to a service’s legal documents, and Gmail is neither an exception nor the only service doing so.

Although we may attempt to notify you via your Gmail address when major changes are made, you should visit this page
periodically to review the terms. Google may, in its sole discretion, modify or revise these terms and conditions and
policies at any time, and you agree to be bound by such modifications or revisions.

To stress it again: we first asked ourselves if it was our fault because we did not keep an eye on the Gmail legal documents.

That was the moment when I joked around and said, “Well, let’s put the privacy policy in Subversion [a revision control system for computer code] and see if it detects any changes over time”. Though we were laughing at first, we started to think serious about the idea. Later that evening, Alexandre and I were talking about how to set up such a service and we began working on the required software for it. So it came to pass that the GooDiff project was born.

Note that the “Goo” in GooDiff is and was not a reference to Google. It is derived from Gray Goo in science-fiction literature, a term which refers to a hypothetical end-of-the-world event involving molecular nanotechnology in which out-of-control self-replicating robots (very little, but not very cute) consume all living matter on Earth while building more of themselves. The attentive reader should see some parallels here: robots vs. search bots, nanotechnology vs. small changes to documents, end of the world vs. misery of the end user, for example. We were also convinced that it was much better to build a service which is able to monitor not only one but a multitude of service providers - what if you’re using Yahoo! Messenger and not Google Talk?

For now, GooDiff seems to be running fine. If you have any comments or suggestions, feel free to drop me an email.

Update on April 6, 06: I’ve been notified about the first blog article about GooDiff. We’re quite anxious to get feedback on GooDiff, in particular because it’s been created for the benefit of every web user. After we fixed a problem with robots.txt caused by GooDiff’s Trac installation, some search engines are listing us, too. A Yahoo! search for GooDiff already shows us on 4th place!

Update on April 12, 06: Alexandre has started a topic in the Google Group for Gmail Help Discussion where he asks about Privacy and url(s) in the mail - Are they included in the Google public index ?.

Update on April 24, 06: We’re 1st place on Google search for GooDiff. Sorry, Henry Goodiff!