Gmail Outage – How To Address Issues Like An Adult

Posted on 02 September 2009 by


Firefox

If Apple has taught me anything lately it is that any company can go from warm, welcoming underdog to Microsoft-esque arrogance after a period of growth and market domination. Still, while I have no illusions about Google’s “goodness”, I could not help but be impressed with how they responded to yesterday’s outage.

They did, in fact, display a degree of maturity and integrity that is all to lacking in most corners of the world these days. How so?

1. They responded almost immediately.

Here’s what happened: This morning (Pacific Time) we took a small fraction of Gmail’s (Gmail) servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.

2. They responded without secrecy. In fact, they were remarkably transparent.

However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.

3. A bottom line senior position officer authored it.

gmail

That’s right, no anonymous authorship and “We’re looking into it.” A key player took ownership.

4. They explained what they were doing to ensure his issue does not recur

What’s next: We’ve turned our full attention to helping ensure this kind of event doesn’t happen again. Some of the actions are straightforward and are already done — for example, increasing request router capacity well beyond peak demand to provide headroom. Some of the actions are more subtle — for example, we have concluded that request routers don’t have sufficient failure isolation (i.e. if there’s a problem in one datacenter, it shouldn’t affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load). We’ll be hard at work over the next few weeks implementing these and other Gmail reliability improvements — Gmail remains more than 99.9% available to all users, and we’re committed to keeping events like today’s notable for their rarity.

5. They LED with the apology

…right up front, I’d like to apologize to all of you — today’s outage was a Big Deal, and we’re treating it as such.

So here they were, a major screw up and they said, “We messed up.” The did not make excuses but instead explained the circumstances. They were transparent about the issues and immediately explained where they would go next with the issue.

In a world where, too often, the “buck stops anywhere but here” this is awfully refreshing– rare but refreshing.

Yeah Google, you screwed up. But thanks for explaining it and apologizing. All is forgiven.

This post was written by:

- who has written 2793 posts on Gear Diary.

Having a father who was heavily involved in early laser and fiber-optical research, Dan grew up surrounded by technology and gadgets. Dan’s father brought home one of the very first video games when he was young and Dan remembers seeing a “pre-release” touchtone phone. (When he asked his father what the “#” and “*” buttons were his dad said, “Some day, far in the future, we’ll have some use for them.”) Technology seemed to be in Dan’s blood but at some point he took a different path and ended up in the clergy. His passion for technology and gadgets never left him. +Dan Cohen

Contact the author


  • doogald

    My one major problem with Google’s response to this: their error message when you tried to connect simply said that there was an error and that you should try again in 30 seconds. No link to a status page; no explanation that there was a system-wide problem. Sure, their published response was fantastic, but you needed to go searching for it. I had two not very technical friends with gmail accounts text me during the outage asking if I knew what was going on.