About That Outage: A Case of the Mondays
Things were a little bumpy for users that tried to log into Second Life on Monday morning as a result of a scheduled code deploy. I wanted to share with you what happened, and what we're going to do to try and prevent this in future.
That morning, I attempted to deploy a database change to an internal service. Without going into too much detail, the deploy was to modify an existing database table in order to add an extra column. These changes had been reviewed multiple times, had passed the relevant QA tests in our development and staging environments, and had met all criteria for a production deploy. Although this service isn't directly exposed to end-users, it is used as part of the login process and it is designed to fail open, i.e. if the service is unavailable, users should still be able to log in to Second Life without a problem.
During the database change, the table being altered was locked to prevent changes to it while it was being altered. This table turned out to be almost a billion rows in size and the alteration took significantly longer than expected. Furthermore, the service did not fail open as designed, and caused logins to Second Life to fail, along with a handful of other ancillary services. Our investigation was further complicated by other problems seen on the Internet on Monday due to a configuration issue at one of the big ISPs in North America. Many of us work remotely and while we saw problems early on, it wasn't immediately clear to us that it was internal, rather than one caused by a third party service. After some investigation, the lock on the database was removed, and services slowly began to recover. We did have to do some additional work to restore the login service, as the Next Generation login servers (as described by April here) are not yet fully deployed.
I'm still looking to complete this deploy in the near future, but this time we'll be using another method which doesn't require locking the database tables, and won't cause a similar problem. We're also investigating exactly why the service didn't fail open as it was designed to, and how we can prevent it from happening in the future.
There are no comments to display.