Sabtu, 27 Oktober 2012

A Cascading Router Failure Caused the Massive Google App Engine Outage


Google's App Engine outage took down several major sites a few hours ago. The problem has been tracked down to an issue with the routers. The issue proved so large and inescapable that Google decided to restart the entire service and then gradually restore it.

A surge in traffic at one of the data centers prompted Google to do a global restart of all the traffic routers. This should have fixed the unbalanced load problem.

Instead, because of the restart, fewer routers were available to service requests and the surplus load smothered the available resources, causing access problems for customers. This started happening at around 7:30 am Pacific Time.

Almost four hours later, Google decides that the load problem can't be fixed since the routers were stuck in a cascading failure, so it decided to do a full restart of the service. Half an hour after that, Googl e App Engine was finally operating as usual.

Google will be giving everyone a 10 percent discount on their November bill for all of this and has said that it increased routing capacity and updated the configurations to avoid problems like this in the future.

Via: A Cascading Router Failure Caused the Massive Google App Engine Outage

Tidak ada komentar:

Posting Komentar