The Story Behind Last Week's Unexpected Downtime
Hi! I’m a member of the Second Life Operations team. On Friday afternoon, major parts of Second Life had some unplanned downtime, and I want to take a few minutes to explain what happened.
Shortly before 4:15pm PDT/SLT last Friday (May 6th, 2016), the primary node for one of the central databases that drive Second Life crashed. The database node that crashed holds some of the most core data to Second Life, and a whole lot of things stop working when it’s inaccessible, as a lot of Residents saw.
When the primary node in this database is offline we turn off a bunch of services, so that we can bring the grid back up in a controlled manner by turning them back on one at a time.
My team quickly sprung into action, and we were able to promote one of the replica nodes up the chain to replace the primary node that had crashed. All services were fully restored and turned back on in just under an hour.
One additional (and totally unexpected) problem that came up is that for the first part of the outage, our status blog was inaccessible. Our support team uses our status blog to inform Residents of what’s going on when there are problems, and the amount of traffic it receives during an outage is pretty impressive!
A few weeks ago we moved our status blog to new servers. It can be really hard to tune a system for something like a status blog, because the traffic will go from its normal amount to many, many times that very suddenly. We see we now have some additional tuning we need to do with the status blog now that it’s in its new home. (Don’t forget that you can also follow us on Twitter at @SLGridStatus. It’s really handy when the status blog in inaccessible!)
As Landon Linden wrote a year ago, being around my team during an outage is like watching “a ballet in a war zone.” We work hard to restore Second Life services the moment they break, and this outage was no exception. It can be pretty crazy at times!
We’re really sorry for the unexpected downtime late last week. There’s a lot of fun things that happen inworld on Friday night, and the last thing we want is for technical issues to get in the way.