The vision of the new, simple, beautiful world of information, where everyone can gather information from anywhere, select it, customise it, and reformat it (perhaps even read it sometimes) is a seductive one. At one end you have RSS or Atom feeds generating streams of news items, blog entries, or whatever; at the other end you have personalised home pages that display those streams, or stand-alone newsfeed-reading tools, or plug-in feed readers such as Sage for Firefox. The tools get cleverer and cleverer, both for formatting and for searching: a recent blog post by David Tebbutt shows a nice tool that takes a search term, passes it to Google’s blog search page, and creates a live window that shows a list of matching blog entries. “Try it,” he says.
So I did.
It doesn’t work.
I went to the blog post, clicked in the right place and typed “Slidey”. I expected to find some postings from Slidey’s Training Log. Instead, I got an error message.
What happened next is unrealistic. I emailed David. David emailed the person responsible for the tool he was using. The person responsible investigated and discovered the cause.
Why is this unrealistic? Because email doesn’t work in real life. Suppose that David’s blog has a thousand readers, or a million. Given how much he is charging us for his (free) blog, he cannot afford to field email requests for technical support even if the problem has nothing to do with him. Given how much he has paid for this (free) tool, the creator of the tool can’t afford to answer emails either, especially once the tool he created is a few years old and he has long since got bored and moved on to other things.
The vision of the new, simple, beautiful world of information is a seductive one. The vision of a new, simple, beautiful world of information as long as you don’t say the word “slidey” is less seductive. Where the paradox lies is this: the newness, the simplicity and the beauty all come from the fact that anyone can create these tools quickly and easily, and even give them away if he wants. And other people can then incorporate those tools into their tools, and so ad infinitum. Because no money has changed hands, there is no obligation for the creator to provide any sort of support – which makes it risk-free to release a useful-seeming tool – which makes for an abundance of these tools.
The shining vision of the future only works if everything works: if one thing goes wrong, everything goes wrong. A future in which our information lives depend on an arbitrary collection of XML tools but we have no idea of whom to contact if something goes wrong – or even how to work out whose fault it is – is a future that will collapse under the weight of its own contradictions.
What actually happened
On 30 October 2006 someone calling himself Geisrud visited gizmodo.com’s web site and commented on an article called
PQI Creates World’s First Click-Style Retractable USB Flash Drive. He said:
I think I prefer SanDisk’s slidey retractable method. Click thing is just waiting to get broken.
gizmodo.com offers an RSS feed for its site. In that RSS feed, the URL for this comment is:
When I said “slidey” to David’s tool and David’s tool asked Google’s blog search system to search for “slidey”, Google found this entry. Rather than returning its search results as a web page, it tried to return them as a newsfeed so that David’s tool could process them further.
The Atom standard for entries in feeds includes an item called <link>, which is intended to be the URL for information relevant to the feed entry: in Google’s case, <link> points to the original feed article, like this:
' is what you put into your XML in order to represent a single-quote or apostrophe in real life (because single-quotes have a special meaning in XML). Using this code means that the “link” item will display correctly when you look at it in your browser.
The Atom standard also includes an item called <id>, which identifies the feed entry uniquely, whether it happens to have an associated URL or not. There are no rules for <id> except that it should be valid XML and uniquely identify the entry. Google creates the <id> item based on the URL of the entry:
You’ll see that in the <id> item # has been replaced by %23. I suppose this is because Google wanted to have a valid URL for some reason, and # isn’t valid in the middle of URLs, so it needed to be translated. The rules for translating special characters are different in XML and in URLs: XML uses ampersands and URLs use the % sign. What Google’s programmer should have done if he wanted a URL was to use the original URL; or, failing that, he should have de-XML-ified the <link> item he’d been given, and used that as the URL. What he has done instead is to take the contents of <link> and apply a “format as URL” function that he happened to have lying around. This function thought that # meant the character #, and accordingly replaced # with %23. In fact there is no # in ' at all: ' is a single indivisible XML entity that represents the character
'. By doing his “format as URL” the programmer has turned “world’s-first” into a piece of unreadable XML. Any tool that reads the output of a Google blog search will encounter this unreadable XML. It will either collapse with an error message or pass the tainted entry further on to some other tool.
Tracking this bug down to Google was possible because of a combination of clear reporting (not it sometimes doesn’t work but it produces an error message with “slidey”) and direct helpfulness from the creator and user of the tool that was reading what Google had produced. Imagine now that you had set up your Yahoo! home page to contain some tool that did a Google blog search and displayed the results in a box om the page, and one day (a year later) one of the blog entries that Google found turned out to have an apostrophe in its URL. Without warning, your home page would stop working. Would you know whom to ask at Yahoo? Would they respond? Would they be able to track it down to Google? Would they know whom to tell at Google? Would they respond? Why should they even bother? If you get your money from advertising, alienating one in a thousand – or even one in a hundred – of your users doesn’t matter.