So I'm sitting in the data center minding my own business when I hear all the CRACs spin down, followed immediately by the shrieking alarms on the two UPSes (one 300 kVA, one 120 kVA). "Crap", I thought to myself. "Another power bump on campus." It had been rainy the last few weeks and Oklahoma and lightning are old friends.<BR><BR>CRACs restarted as the transfer switch put us on the generator. I went over and silenced the UPS alarms and made sure we were no longer on battery discharge. Called Ops and asked the trouble desk if we had taken a power hit. "Not that we know of," they said. Hmmm... Come to think of it, I hadn't noticed the lights flicker like they usually did during a hit. Crap.<BR><BR>On the way to the basement, I ran into the lead electrician working on our major power upgrade. "Hey!" he shouted as I ran by. "Your generator's running. And no, we didn't do anything!" Double crap. There went my first theory...<BR><BR>Got downstairs and the first thing I noticed was the 500 kW natural gas genset spooling up and down like a kid's yoyo on a three- or four-second cycle. WTF?!? Ran over to the transfer switch and saw we were on Emergency power and Normal was not available. Freqs were bouncing between 54 and 66. Triple crap. If we hadn’t taken a hit, why wasn’t Normal available?!?<BR><BR>Went over to the building’s main switchboard and found our 800a main feed breaker. Sure enough, it was tripped. WTF again. We hadn't done anything and load hasn't changed substantially in the last couple months. We'd even had the monthly genset run and load transfer test a couple weeks ago and all was fine then. But it sure as hell wasn't fine now.<BR><BR>Grabbed my trusty Cisco wireless phone and called the campus trouble desk. "We need help badly. And we need it right now!" Guess I was a little panicky...<BR><BR>Generator sounded like it was getting worse, and I suddenly thought, "Crap! What's that doing to the inputs of the UPSes?"<BR><BR>Ran back upstairs and heard the damndest sound ever -- the CRAC fans were spooling up and down in sync with the generator (not much, but a little bit), but the big UPS was freaking out. The normal, steady, droning rectifier buzz was oscillating up and down like a banshee trying to hit the right note.<BR><BR>First thing I thought was "Why hasn't the UPS dropped to bypass? Or at least gone into battery discharge? The input power has got to be <I>way</I> out of spec." Then I thought, "Well, at least we're <I>not</I> on battery discharge. Yet..." Considered a manual transfer to bypass, then figured however badly the UPS was running, at least it <I>was</I> running. Inverter output was still good and stable, so I'll just leave it the hell alone for now. Off to the basement I ran.<BR><BR>Electrical guys had arrived and first thing we all agreed on was generator was boinked. Since nobody knew why the breaker had tripped, we suspected a nuisance trip and tried a reset. Normal power up, transfer switch was starting timeout for the re-transfer. We all looked at each other and agreed we didn't want to stay on the generator any longer than absolutely necessary, said generator now sounding like a 16-year-old gunning the engine on his first hot rod. So we hit the manual re-transfer and put the data center back on normal power.<BR><BR>Generator stopped being silly and settled down as it began the cool-down run, now with no load. Ran back upstairs to the data center and checked the UPSes. No problem with the little one, but the rectifier buzz from the big one sounded kinda muffled. Checked the panel - yep, we're still on inverter (good), batteries were recharging (good), and E's and I's looked about right (good). No idea why the change in sound. Wasn’t a big change, and I was wondering if it was just my ears after listening to that whacked-out generator. Didn’t see anything obvious, in any case.<BR><BR>Was just catching my breath and starting to wonder what the hell had just happened (not to mention WTF was up with that generator) when all the CRACs spun down again and the UPSes went back into battery discharge once more. About 30 seconds later, generator back on line, the CRACs wound back up, and the UPS showed power available and started its walk-in. (Foul, dark thoughts...)<BR><BR>Back to the basement and found the electrical crew trying to decide whether or not to attempt another reset. (Was a big breaker – 1600a frame, dialed back by 0.5 to an 800a trip.) We were all ignoring the generator, having gotten used to its foibles, and were now wondering why the breaker was tripping. (By this time, I was also tripping...)<BR><BR>Okay, one more time. Cranked the handle and hit the reset. “Ka-chunk” went the breaker and once again normal power was restored and the generator settled down and began another cool-down run.<BR><BR>Back upstairs to the big UPS, completely ignoring the little one by now (out-of-sight, out-of-mind). With the exception of the still muffled-sounding inverter buzz, all was apparently well. It had completed its retransfer walk-in and was running more-or-less normally, but drawing a bit more input current now since we were pulling a 30a battery recharge.<BR><BR>Was headed back downstairs to discuss things with electrical guys when CRACs spun down again and UPSes went into battery-discharge mode for a third time. FUCK!!! A 30-second wait and power was <I>not</I> returning and we <I>were</I> staying in battery discharge mode. DOUBLE FUCK!<BR><BR>Ran back downstairs (really getting tired of those freaking stairs by now) and found crew gathered around transfer switch which had apparently decided to no longer acknowledge the existence of the badly-behaving generator. Performed some ohm’s law bullshit with a rooster head and a cigar and got the generator back online. Wasn’t sure it was worth it, but a quick run back up those #%#& stairs and an equally quick check of the UPS showed that it was once again accepting the wildly out-of-spec input and was vainly trying to run in normal mode. Input power was totally boinked, but output still appeared good and unit was not on battery discharge, so back to the basement I went.<BR><BR>Electrical guys had got to the point of accusing me of hiding a load increase. I was protesting my innocence when “Crash!” Down again. After swearing on my first-born’s life that, other than onsies or twosies, the last big thing we added was an 8 kVA Sun rack three months ago, they agreed to bump up the trip point one notch to keep us up while they sent for some test equipment. Retransfer. Another UPS battery discharge cycle, back to normal, generator cool-down. Yada, yada.<BR><BR>Up for 15 minutes, thought we had it. CRASH! SHIT! Back to basement. OMFG! Transfer switch: back to ignoring generator. Crew: forgetting transfer switch and now working feverishly to change breaker frame before batteries run down. Breaker frame: several hundred pounds, the size of a small refrigerator. Me: back up stairs, staring at DC bus voltage. 489...488...487... (Mental picture: thread holding up Sword of Damocles slowly unraveling...)<BR><BR>Back downstairs. Old frame out. Spare frame going in. Back upstairs. ...456...455...454... SHIT! Back downstairs. Almost done. “Two minutes,” they say. Back upstairs. ...446...445... Normal DC bus: 540. ...442...441... Jeebus! How low can it get and keep the inverters running? Blink! Normal power back. Walk-in started. Oh, thank you, Lord! (You may replace with deity of your choice...)<BR><BR>Now I know we haven’t changed any load, so it’s obviously a circuit breaker problem, right? They are some 20 years old, after all. Been in that basement for a long, long time. And now that we’re on a spare circuit breaker, everything will be okay, right? Ahem. Right? (Neglecting the fact that spare breaker was just as old, had been in basement just as long... Hindsight always 20-20...)<BR><BR>CRACs spin down. UPS back on battery discharge. Thinking George Carlin’s Seven Words You Can’t Say on TV. Back downstairs. Oops. Forgot to set the new breaker’s trip point. Got the transfer switch to take the generator input this time and feed the UPS. Generator still surging like crazy. Back upstairs to see how the UPS is doing.<BR><BR>Opened data center door and was assaulted by the most unGodly noise I have ever heard. WTF? Shit! It’s the UPS!!! Now what?!? The normal buzz of the rectifiers was replaced by a raspy roar from hell with deviant, demonic, diabolical overtones. The harmonics were rising and falling with the generator’s surges, and there was a definite basso profundo note that had never been there before. Not to mention a brand-new sizzling sound that boded ill for the future of the unit. Having seen way too many Google videos about misbehaving electrical equipment (search for “480v arc flash”), I was literally trying to decide whether or not I even wanted to approach the thing or if it would simply be better to let it self-destruct and then sweep up the ashes. (If I procrastinated long enough, I figured the decision would be made for me...)<BR><BR>The sound got worse. And now there was the acrid smell of burning wiring. Big wiring. Very nasty smell, and growing stronger by the second, too...)<BR><BR>Well, shit. Figuring nobody else was about to do it, and feeling guilty just standing there (not to mention cowardly, craven, and possessed of no fortitude), I gingerly approached the now-screaming, sizzling unit from the battery-rack side, closest to the control panel. (No way was I going to stand in front of the input filters, rectifiers, or inverters. See aforementioned Google videos. AFAIC, not enough thickness to the sheet metal, by far.)<BR><BR>Anyway, I gritted (grit? Whatever...) my teeth and reached out blindly (you never look at the source when it might flash – good way to get retinal burns from hell) and stabbed at the CTRL-BYPASS buttons. KA-CHUNK, KA-CHUNK, KA-CHUNK, went the UPS as the Main Input and both Battery Rack breakers tripped. I jumped three feet into the air and damn near soiled my shorts as one of the battery rack breakers was about four inches from my right ear. The howling stopped and the fans spun down. Other than the ticking and creaking of cooling metal, blessed silence from the UPS. Thankfully, <I>not</I> blessed silence from the server racks. We were on Bypass, which is campus commercial power. Scary, but not as scary as a dark and silent data center. NASTY smell of burning insulation, and had to have set a world record for the most out-of-tolerance power transfer ever. Was bad enough to force the machine into hardware shutdown, which actually was good, because that meant <I>I</I> didn’t have to go over and manually trip the main input CB located right in front of the afore-mentioned filters/rectifiers/inverters/thin sheet metal.<BR><BR>Well, here we are, running the data center on raw commercial power in Oklahoma during thunderstorm season with storm forecast for later that day. There’s a warm fuzzy for you. Looked around and remembered with some sense of foreboding the smaller UPS which I had basically been ignoring. It hadn’t been complaining and the big one had, and the squeaky wheel gets the grease. So I checked it out. It apparently took all the shenanigans with nary a peep other than a bunch of battery discharge cycles. Newer tech? Good tech at any rate.<BR><BR>I finally had time to talk to the HVAC guys who had arrived in the middle of all this and who had been checking the 20 year-old CRACs for misbehaving motors and compressors and keeping them running on that crazy generator power. No apparent problems there.<BR><BR>Thinking it just couldn’t have lasted for another 6 months until we got all of our shiny new infrastructure on-line, I called in a ticket on the UPS. Engineer came out PDQ and said, “Yep, you fried the input filter assy. That’s all?!? Holy shit, I expected to find the whole inside melted down. “That’s what the input filter does,” he said. Well, it sure gave its all for the cause – the three chokes were black and crunchy and damn near still sizzling from the heat. “We’ll FedEx Overnight a replacement from our warehouse in Tennessee and I’ll be out in the morning to replace it.” Well, fine. Here’s the next 18 hours on commercial power during thunderstorm season.<BR><BR>Now starts the management suggestions: Can we bypass the UPS and run on our generator? (WTF?!?) No, our generator is as boinked as our UPS. And even if our generator <I>was</I> working, without the UPS ride-through we would drop power for a few seconds during the transfer, and we don’t want to drop power, do we? Oh. Well, could we... (Sigh...)<BR><BR>Generator folks show up on-site an hour later to do a full-load test. Thankfully, the generator does surge a bit (so no, we weren’t imagining it...) and they poke around with fuel-pressure regulators and such. They get it fairly smooth. There’s a big dump and surge when a 400 kW slam-dunk load hits it, but it does recover after about 15 seconds and natural gas gensets are vulnerable to slam-dunk loads just by their nature. (Not as much quick torque availability as diesels...) And besides, the data center isn’t a slam-dunk load. Well, the CRACs all come back up at once, but the big loads (the UPSes) idle for about 30 seconds and then walk back in over the next 30 seconds or so.<BR><BR>They’re looking for a replacement governor, but the thing is 10 years old and they’re having a hard time finding one. (NG gensets in that size are much less common than diesel...) So they did what they could for the generator and called it good.<BR><BR>And we stuck a couple SAs in the data center overnight to do preemptive shut-downs of the big financial and personnel systems if lightning got too close. Of course if we’d taken a power hit we’d have had exactly zero seconds to respond and things would have gotten ugly, but we lucked out.<BR><BR>END OF BAD DAY ONE