My Birthday "Bash"
Here's the scoop
So it's the Friday before my birthday, and I'm saying goodbye to most of the people I work with, who won't be back until after the LISA conference. I'm not feeling stressed about their departure, as I've now been in the department for a little over three years, and feel fairly confident that I can muddle through pretty much anything that comes along. Simple resource requests from users, debugging wireless networking or DSL issues, and debugging services in our environment are all pretty much old hat. So I leave to begin celebrating my 31st birthday.
Sunday the 14th started as a very relaxing day. I had been out Saturday night quite late, and didn't rise until 10:30. On purpose. It was my birthday, and nobody was gonna rush me into doing anything. I walk past my office down to the kitchen and make some coffee, I have some breakfast, and I walk back upstairs to see what's new in my inbox. Only there is no inbox, per se. Just an error message from Evolution saying it couldn't reach my IMAP server. I'm not shaken a bit. I recently moved into a new place, and was forced to give up the awesome Speakeasy DSL service I had for Comcast cable internet. I figured it was a simple matter of having to reboot my cable modem to pick up a new IP address or something. But wait... I can get to Slashdot... I can get to Google... I can get to sites I've never been to before, so I know it's not cached... Odd... But I'm still not quite worried.
Hey, I can't get to the department website! Hey, I can't ping the web server! ACK! I can't ping any of our servers! Oh no. My new blackberry 7290 lets out a buzz in the background of my office, and I perk up. "Mail! Someone got through to the mail server, so all is not lost!". Wrong. Blackberries can send PIN-to-PIN using nothing more than the cellular network, bypassing any notion of a mail server, and that's exactly how this message came to me, from a member of our group who was about to board a plane for the LISA conference in Atlanta. It read something like "Hey, something amiss here, anyone around?" Well, turns out, I was around, and that was pretty much it.
As a group, we keep a pretty close watch on our infrastructure even when we're not around. We get emails from our UPS to let us know if there's an outage, emails from our syslog server, emails from our air handlers, even emails from various key entry points to our machine rooms and networking cabinets. There was an email from our UPS earlier about a power hit, but then another email came that would seem to have indicated a recovery. I was hopeful that things weren't in some "day after" state when I suited up and headed into the office.
The first thing I noticed was via my ears, not my eyes. The UPS has a horn on it that is louder than I had remembered. I silenced it, and then noticed it was still in some faulty state which I hadn't seen before. Flipping through the menus on the LCD screen, I saw that the battery voltage was at 0%. This was bad. However, the room, and the building, had power.
Next, I logged into our console server to connect to our file server. My heart sank when I saw the prompt on the Sun 4500 server:
hit ctrl-d to continue booting. Oh no. Why hadn't it booted? A quick glance over to the racks of disk showed that an entire T3 storage array was black. I quickly power cycled the array, continued booting the cycle server, and then quickly switched consoles to check on another machine that I knew would be lost without the file server. It was also in a weird, half-booted and broken state, and running
uptime on both machines showed that they had been up only a couple of hours.
At this point, I'm thinking that at some point our entire machine room was black. Just then a member of the user community comes by and says his machine rebooted some time ago, and he's since been unable to get an IP address. So now I know the entire building lost power, which leads me to the only reasonable conclusion, which was kind of scary, because it's never happened before: our UPS completely and catastrophically failed. Oh joy. I call the UPS tech support, and they get somebody on the road to visit within the hour.
Of course, the fact that the power is back means that something else pretty catastrophic happened: every machine in our entire machine room was powered on and booted at the exact same time. Service dependencies be damned!
Calling for Backup
At this point, it's pretty clear that I'll be needing backup. By now, a Sun tech is on his way to replace a disk in our array, and a UPS tech is coming out to get our UPS back to normal. It's clear that the entire world will have to be brought back down, if only to insure that it is brought up in the order it's supposed to come up in. My boss and coworker are still in town, and they come in to lend a hand.
All this time, it should also be noted that, although we were without a mail server, probably 50 or so messages passed through my Blackberry to communicate with others in my group who were out of town. This was invaluable since there were slight changes and additions to our services which were not yet reflected in the documentation.
The UPS tech says a battery in the UPS failed, and shows us splatterings of battery acid inside the battery compartment. There are 20 batteries, so I'm a little baffled, as I'm not an expert on UPSes. Turns out, UPS batteries are wired in series, which makes some sense if you think about it. Wiring in series is the only way the UPS can get the benefit of aggregated power from all of the batteries. Wiring in parallel will only allow it to use power from one battery at a time. Of course, the side effect of this is that UPS batteries resemble christmas lights: one goes out, they all go out.
The Sun guy shows up a little later, and, though he seems a bit confused by some of the errors in the logs, replaces the drive and things are well with the world. At this point, it's after 5 PM, and my boss instructs me to go home for my birthday dinner. I followed his orders, and left them there to bring the rest of the service machines back up.
After all is said and done, we're not happy that this event took place, but the reality is that UPS batteries don't often blow up, and we don't often get simultaneous disk failures in T3 arrays that cause a file server not to boot. At the same time, we are still discussing ways to shorten both routine and emergency downtimes, and how to make the process smoother. The machine room is constantly evolving, and along with it our process for keeping up with the care and feeding of our systems and services.
Some investments, like the Blackberries, which we had upgraded only a couple of days before, were totally justified. Others, like a call-in number for downtime (similar to a school's inclement weather line), are being explored. Still others, like a paid-for inspection of our UPS batteries only a couple of weeks before the failure, are being questioned.
These are all signs of a healthy admin team. There was no infighting, no blaming, no finger pointing, no mumbling under our breath. Things needed doing, and they got done. What didn't get done is getting done with the help of others. What couldn't be done is being researched, and all is peaceful in the user community.