Blackberry users across the world were cross at RIM for two email outages that affected a fair number of blackberry users twice in as many weeks. (See article here) In addition, some grinches were busy trying to steal Christmas from a number of last minute Amazon and Walmart shoppers. (See article here) And google had an outage earlier this year when Michael Jackson died, and the search engine got so many queries for the singer that it thought it was under attack and stopped responding. (See article here) These outages reflect one of the great technology design challenges: single points of failure. In the blackberry’s case, the basic method for getting email from a desktop to the blackberry requires that email messages be copied from the local computer and transferred through a RIM-controlled relay server to the user’s blackberry. The relay server becomes a single point of failure for the RIM network.
With the Amazon outage this holiday season, the cause was a distributed denial of service (DDOS) attack aimed at the domain name server (DNS) hosting company who is responsible for telling users looking for http://www.amazon.com that that domain is located at the IP address 188.8.131.52. By design, there can be only one “authoritative” group of DNS servers for a domain that can answer, for the entire internet, queries that request the number for the name.
These single points of failures are targeted by Murphy’s Law and malicious hackers alike, and network engineers and security experts have made careers designing better mousetraps to mitigate these fundamental weaknesses of their computer systems. When you consider the amount of money and talent that some of these very large companies have, it underscores for me how fragile our existing information system infrastructure really is. Tremendous resources have been focused on making the amazon.com web site highly available and highly accurate, but in spite of that extraordinary effort, there are still outages around amazon’s busiest time of year.
A challenge for the new decade will be fundamental changes in reliability in our computer networks, to make “High Availability As A Service” one of the new ‘net offerings for computer systems of all sizes. Maybe you all should put that on your list for Santa for next Christmas!