Bad Day in the Data Center - Part I

Status
Not open for further replies.

llib

Ars Scholae Palatinae
649
So I'm sitting in the data center minding my own business when I hear all the CRACs spin down, followed immediately by the shrieking alarms on the two UPSes (one 300 kVA, one 120 kVA). "Crap", I thought to myself. "Another power bump on campus." It had been rainy the last few weeks and Oklahoma and lightning are old friends.<BR><BR>CRACs restarted as the transfer switch put us on the generator. I went over and silenced the UPS alarms and made sure we were no longer on battery discharge. Called Ops and asked the trouble desk if we had taken a power hit. "Not that we know of," they said. Hmmm... Come to think of it, I hadn't noticed the lights flicker like they usually did during a hit. Crap.<BR><BR>On the way to the basement, I ran into the lead electrician working on our major power upgrade. "Hey!" he shouted as I ran by. "Your generator's running. And no, we didn't do anything!" Double crap. There went my first theory...<BR><BR>Got downstairs and the first thing I noticed was the 500 kW natural gas genset spooling up and down like a kid's yoyo on a three- or four-second cycle. WTF?!? Ran over to the transfer switch and saw we were on Emergency power and Normal was not available. Freqs were bouncing between 54 and 66. Triple crap. If we hadn’t taken a hit, why wasn’t Normal available?!?<BR><BR>Went over to the building’s main switchboard and found our 800a main feed breaker. Sure enough, it was tripped. WTF again. We hadn't done anything and load hasn't changed substantially in the last couple months. We'd even had the monthly genset run and load transfer test a couple weeks ago and all was fine then. But it sure as hell wasn't fine now.<BR><BR>Grabbed my trusty Cisco wireless phone and called the campus trouble desk. "We need help badly. And we need it right now!" Guess I was a little panicky...<BR><BR>Generator sounded like it was getting worse, and I suddenly thought, "Crap! What's that doing to the inputs of the UPSes?"<BR><BR>Ran back upstairs and heard the damndest sound ever -- the CRAC fans were spooling up and down in sync with the generator (not much, but a little bit), but the big UPS was freaking out. The normal, steady, droning rectifier buzz was oscillating up and down like a banshee trying to hit the right note.<BR><BR>First thing I thought was "Why hasn't the UPS dropped to bypass? Or at least gone into battery discharge? The input power has got to be <I>way</I> out of spec." Then I thought, "Well, at least we're <I>not</I> on battery discharge. Yet..." Considered a manual transfer to bypass, then figured however badly the UPS was running, at least it <I>was</I> running. Inverter output was still good and stable, so I'll just leave it the hell alone for now. Off to the basement I ran.<BR><BR>Electrical guys had arrived and first thing we all agreed on was generator was boinked. Since nobody knew why the breaker had tripped, we suspected a nuisance trip and tried a reset. Normal power up, transfer switch was starting timeout for the re-transfer. We all looked at each other and agreed we didn't want to stay on the generator any longer than absolutely necessary, said generator now sounding like a 16-year-old gunning the engine on his first hot rod. So we hit the manual re-transfer and put the data center back on normal power.<BR><BR>Generator stopped being silly and settled down as it began the cool-down run, now with no load. Ran back upstairs to the data center and checked the UPSes. No problem with the little one, but the rectifier buzz from the big one sounded kinda muffled. Checked the panel - yep, we're still on inverter (good), batteries were recharging (good), and E's and I's looked about right (good). No idea why the change in sound. Wasn’t a big change, and I was wondering if it was just my ears after listening to that whacked-out generator. Didn’t see anything obvious, in any case.<BR><BR>Was just catching my breath and starting to wonder what the hell had just happened (not to mention WTF was up with that generator) when all the CRACs spun down again and the UPSes went back into battery discharge once more. About 30 seconds later, generator back on line, the CRACs wound back up, and the UPS showed power available and started its walk-in. (Foul, dark thoughts...)<BR><BR>Back to the basement and found the electrical crew trying to decide whether or not to attempt another reset. (Was a big breaker – 1600a frame, dialed back by 0.5 to an 800a trip.) We were all ignoring the generator, having gotten used to its foibles, and were now wondering why the breaker was tripping. (By this time, I was also tripping...)<BR><BR>Okay, one more time. Cranked the handle and hit the reset. “Ka-chunk” went the breaker and once again normal power was restored and the generator settled down and began another cool-down run.<BR><BR>Back upstairs to the big UPS, completely ignoring the little one by now (out-of-sight, out-of-mind). With the exception of the still muffled-sounding inverter buzz, all was apparently well. It had completed its retransfer walk-in and was running more-or-less normally, but drawing a bit more input current now since we were pulling a 30a battery recharge.<BR><BR>Was headed back downstairs to discuss things with electrical guys when CRACs spun down again and UPSes went into battery-discharge mode for a third time. FUCK!!! A 30-second wait and power was <I>not</I> returning and we <I>were</I> staying in battery discharge mode. DOUBLE FUCK!<BR><BR>Ran back downstairs (really getting tired of those freaking stairs by now) and found crew gathered around transfer switch which had apparently decided to no longer acknowledge the existence of the badly-behaving generator. Performed some ohm’s law bullshit with a rooster head and a cigar and got the generator back online. Wasn’t sure it was worth it, but a quick run back up those #%#& stairs and an equally quick check of the UPS showed that it was once again accepting the wildly out-of-spec input and was vainly trying to run in normal mode. Input power was totally boinked, but output still appeared good and unit was not on battery discharge, so back to the basement I went.<BR><BR>Electrical guys had got to the point of accusing me of hiding a load increase. I was protesting my innocence when “Crash!” Down again. After swearing on my first-born’s life that, other than onsies or twosies, the last big thing we added was an 8 kVA Sun rack three months ago, they agreed to bump up the trip point one notch to keep us up while they sent for some test equipment. Retransfer. Another UPS battery discharge cycle, back to normal, generator cool-down. Yada, yada.<BR><BR>Up for 15 minutes, thought we had it. CRASH! SHIT! Back to basement. OMFG! Transfer switch: back to ignoring generator. Crew: forgetting transfer switch and now working feverishly to change breaker frame before batteries run down. Breaker frame: several hundred pounds, the size of a small refrigerator. Me: back up stairs, staring at DC bus voltage. 489...488...487... (Mental picture: thread holding up Sword of Damocles slowly unraveling...)<BR><BR>Back downstairs. Old frame out. Spare frame going in. Back upstairs. ...456...455...454... SHIT! Back downstairs. Almost done. “Two minutes,” they say. Back upstairs. ...446...445... Normal DC bus: 540. ...442...441... Jeebus! How low can it get and keep the inverters running? Blink! Normal power back. Walk-in started. Oh, thank you, Lord! (You may replace with deity of your choice...)<BR><BR>Now I know we haven’t changed any load, so it’s obviously a circuit breaker problem, right? They are some 20 years old, after all. Been in that basement for a long, long time. And now that we’re on a spare circuit breaker, everything will be okay, right? Ahem. Right? (Neglecting the fact that spare breaker was just as old, had been in basement just as long... Hindsight always 20-20...)<BR><BR>CRACs spin down. UPS back on battery discharge. Thinking George Carlin’s Seven Words You Can’t Say on TV. Back downstairs. Oops. Forgot to set the new breaker’s trip point. Got the transfer switch to take the generator input this time and feed the UPS. Generator still surging like crazy. Back upstairs to see how the UPS is doing.<BR><BR>Opened data center door and was assaulted by the most unGodly noise I have ever heard. WTF? Shit! It’s the UPS!!! Now what?!? The normal buzz of the rectifiers was replaced by a raspy roar from hell with deviant, demonic, diabolical overtones. The harmonics were rising and falling with the generator’s surges, and there was a definite basso profundo note that had never been there before. Not to mention a brand-new sizzling sound that boded ill for the future of the unit. Having seen way too many Google videos about misbehaving electrical equipment (search for “480v arc flash”), I was literally trying to decide whether or not I even wanted to approach the thing or if it would simply be better to let it self-destruct and then sweep up the ashes. (If I procrastinated long enough, I figured the decision would be made for me...)<BR><BR>The sound got worse. And now there was the acrid smell of burning wiring. Big wiring. Very nasty smell, and growing stronger by the second, too...)<BR><BR>Well, shit. Figuring nobody else was about to do it, and feeling guilty just standing there (not to mention cowardly, craven, and possessed of no fortitude), I gingerly approached the now-screaming, sizzling unit from the battery-rack side, closest to the control panel. (No way was I going to stand in front of the input filters, rectifiers, or inverters. See aforementioned Google videos. AFAIC, not enough thickness to the sheet metal, by far.)<BR><BR>Anyway, I gritted (grit? Whatever...) my teeth and reached out blindly (you never look at the source when it might flash – good way to get retinal burns from hell) and stabbed at the CTRL-BYPASS buttons. KA-CHUNK, KA-CHUNK, KA-CHUNK, went the UPS as the Main Input and both Battery Rack breakers tripped. I jumped three feet into the air and damn near soiled my shorts as one of the battery rack breakers was about four inches from my right ear. The howling stopped and the fans spun down. Other than the ticking and creaking of cooling metal, blessed silence from the UPS. Thankfully, <I>not</I> blessed silence from the server racks. We were on Bypass, which is campus commercial power. Scary, but not as scary as a dark and silent data center. NASTY smell of burning insulation, and had to have set a world record for the most out-of-tolerance power transfer ever. Was bad enough to force the machine into hardware shutdown, which actually was good, because that meant <I>I</I> didn’t have to go over and manually trip the main input CB located right in front of the afore-mentioned filters/rectifiers/inverters/thin sheet metal.<BR><BR>Well, here we are, running the data center on raw commercial power in Oklahoma during thunderstorm season with storm forecast for later that day. There’s a warm fuzzy for you. Looked around and remembered with some sense of foreboding the smaller UPS which I had basically been ignoring. It hadn’t been complaining and the big one had, and the squeaky wheel gets the grease. So I checked it out. It apparently took all the shenanigans with nary a peep other than a bunch of battery discharge cycles. Newer tech? Good tech at any rate.<BR><BR>I finally had time to talk to the HVAC guys who had arrived in the middle of all this and who had been checking the 20 year-old CRACs for misbehaving motors and compressors and keeping them running on that crazy generator power. No apparent problems there.<BR><BR>Thinking it just couldn’t have lasted for another 6 months until we got all of our shiny new infrastructure on-line, I called in a ticket on the UPS. Engineer came out PDQ and said, “Yep, you fried the input filter assy. That’s all?!? Holy shit, I expected to find the whole inside melted down. “That’s what the input filter does,” he said. Well, it sure gave its all for the cause – the three chokes were black and crunchy and damn near still sizzling from the heat. “We’ll FedEx Overnight a replacement from our warehouse in Tennessee and I’ll be out in the morning to replace it.” Well, fine. Here’s the next 18 hours on commercial power during thunderstorm season.<BR><BR>Now starts the management suggestions: Can we bypass the UPS and run on our generator? (WTF?!?) No, our generator is as boinked as our UPS. And even if our generator <I>was</I> working, without the UPS ride-through we would drop power for a few seconds during the transfer, and we don’t want to drop power, do we? Oh. Well, could we... (Sigh...)<BR><BR>Generator folks show up on-site an hour later to do a full-load test. Thankfully, the generator does surge a bit (so no, we weren’t imagining it...) and they poke around with fuel-pressure regulators and such. They get it fairly smooth. There’s a big dump and surge when a 400 kW slam-dunk load hits it, but it does recover after about 15 seconds and natural gas gensets are vulnerable to slam-dunk loads just by their nature. (Not as much quick torque availability as diesels...) And besides, the data center isn’t a slam-dunk load. Well, the CRACs all come back up at once, but the big loads (the UPSes) idle for about 30 seconds and then walk back in over the next 30 seconds or so.<BR><BR>They’re looking for a replacement governor, but the thing is 10 years old and they’re having a hard time finding one. (NG gensets in that size are much less common than diesel...) So they did what they could for the generator and called it good.<BR><BR>And we stuck a couple SAs in the data center overnight to do preemptive shut-downs of the big financial and personnel systems if lightning got too close. Of course if we’d taken a power hit we’d have had exactly zero seconds to respond and things would have gotten ugly, but we lucked out.<BR><BR>END OF BAD DAY ONE
 

ronelson

Ars Legatus Legionis
21,399
Subscriptor
<BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">Performed some ohm’s law bullshit with a rooster head and a cigar and got the generator back online. </div></BLOCKQUOTE>Pics, or it didn't happen!<BR><BR>Sorry to hear about your bad day, but thank you for the interesting story. I hope for your sake there never is a day two!
 

Accs

Ars Legatus Legionis
12,557
Subscriptor
<BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">Originally posted by llib:<BR>Back downstairs. Old frame out. Spare frame going in. Back upstairs. ...456...455...454... SHIT! Back downstairs. Almost done. “Two minutes,” they say. Back upstairs. ...446...445... Normal DC bus: 540. ...442...441... Jeebus! How low can it get and keep the inverters running? </div></BLOCKQUOTE>Usually around 390-400V. The "Normal DC bus" is the charging voltage. The no-load voltage without a charge is usually ~480V.<BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">Engineer came out PDQ and said, “Yep, you fried the input filter assy” </div></BLOCKQUOTE>The UPS may have been the initial cause of the problem.<BR><BR>I don't think that the oscillating input power would be enough to fry the monster coils they use in these things, and it shouldn’t have caused the major changes in the rectifier buzz that you mentioned. There was probably a short somewhere further in (after the rectifiers). When the UPS guy comes back in, try to get the electricians to crank up the main breaker (to ~.8) for a few minutes while this is tested. I expect he'll be replacing one or two DC filter cap banks, and possibly some rectifiers if they got overheated (actually, I think most of these are IGBTs).<BR><BR>EDIT: Added information on DC voltages.
 
I feel your pain. There is a housing construction project going. The network side of our IT has always managed the 30kV and 60kV UPS support. Find out the 30 is dead. Wery dead batteries. And, the APC support has dropped (contract screwup, not on our end). Find out from the power folks (after a few outages) that they need to repleace the line between the main junction box and ours. BAAAD. I also find out that our fiber is buried God only know where but along the same run they have to dig for the new power. Then, to top it off, our NG Kohler goes ape shit. We were able to get a diesel in until they could fix the NG but it has not been a pleasant 3 weeks. Whiskey is a good pain killer after work though...
 

llib

Ars Scholae Palatinae
649
Day II (Fri)<BR>Input filter assembly arrived on site early morning. Two local lady delivery drivers picked up part from FedEx Freight and spent 40 minutes driving around campus trying to find the loading dock on our building where I was waiting. And waiting. And waiting. Turned out they went to front door and tried to find me in my office. And this was <I>after</I> I told them twice I’d meet them at the dock. Then they backed in at a 45º angle and stopped 10 feet out from the lift. Guess they expected me to pick the 185 pound pallet up with my bare hands. Finally convinced then that they could actually back their van (not a truck, but a <I>utility van</I>. With open back doors! Sheesh!) a few more feet closer to the dock. Finally did end up just grabbing the damn thing and man-handling it across to the lift. Then giggly-little-driver couldn’t find the shipping invoice for me to sign. Then finally did find it, but didn’t have a pen... By that time I was mad enough to just carry it upstairs by myself. Another great start to a great day...<BR><BR>Anyway, the engineer and I got the old unit out and the new unit in with a little huffing and puffing. Torqued all the connections, buttoned up the unit, called facility electrical crews to stand by, and got ready to fire it up.<BR><BR>Engineer put on manual control box to walk it in very slowly. AC up. DC up. Batteries charging a bit. Left it like that for about 30 minutes since battery charges were way low from yesterday. Also gave new input filter some time to “burn in”. Lots of capacitors in that thing. Finally said okay, took it back down, removed the control box, and let it come up automatically. Everything normal. Good, good. Started to get a warm fuzzy.<BR><BR>Went to transfer load from bypass to inverter, and everything went straight to shit. Ground fault alarms on UPS and Total Harmonic Distortion (THD) alarms on all eight downstream PPCs. Short circuit alarm. Loud buzz from the unit. Crap – here we go again...<BR><BR>Okay, buzz not quite as bad a yesterday, and at least no sizzling sounds. But, but... Battery discharge mode! WTH?!? Input power was there, but indicated battery discharge at the same time! Makes no sense. Okay, let’s go back to Bypass while we figure this out. SHIT! Control panel is froze in a “Not OK To Transfer” status. ...532...531... (Getting to be routine by now.) Engineer called support while I went around silencing all the PPC alarms. (Hey, might as well do <I>something</I> useful...) Keeping an eye on the discharge...520...519... Reset the last PPC. Blessed silence – damn little pizzo alarms can really shriek...<BR><BR>Walked back by the UPS and everything was normal. Everything. Thought I was seeing thing. (Well, technically I was, but you know what I mean...) Looked again. Input current a bit high, but we were pulling almost 40a of recharge current, so to be expected. No battery discharge, load on inverter, bypass available, sound back to normal, major big grin on my face. Went over to the engineer still on the phone at my portable desk and asked, “Hey! Great! What did you do?” His reply: “Huh?” Say goodbye to grin.<BR><BR>Told call to hang on, and we went back to unit. Fault had cleared, input loads coming back down, still sounded great. Looked at each other and just didn’t have a freaking clue what to think. It was like the last two days were a bad dream.<BR><BR>Further discussions with engineering support. Came to the conclusion that (1) we really didn’t want to leave it in a state where we weren’t sure of its reliability, and (2) really didn’t want to screw around troubleshooting it further without having some spare parts on hand.<BR><BR>Prime suspect at this point was the switching SCRs since the problem happened while transferring load to inverter. So some SCRs and a snubber board were shipped in overnight and we would do further troubleshooting on Sunday morning (our normal maintenance window). Kinda screws up Mommie’s Day, but hey, nobody ever said IT was a nine-to-five job. ‘Specially if you run the damn data center... Also, regional high mucky-muck engineer will come in to help. Warm fuzzies all over...<BR><BR>Day III (Saturday.<BR>No problem. UPS cranking right along like always. Stayed home and started putting my Hot Rod together. Installed CPU, cooler, memory, put mobo into case, installed disk drives and optical drives. Figured I’d install PSU last to keep all the cable clutter out of the way. (You can see this comin’, right?) PSU won’t fit into case with mobo/cpu/cooler installed. Counted to 1,000 by sevens. Disassembled most of what I had just assembled, installed PSU, the reassembled what I had just disassembled. Put damn thing away before I threw it through the window. Watched the race while drinking Crown and Coke. <I>Lots</I> of Crown and Coke...<BR><BR><BR>Day IV (Sun)<BR>Engineers on site for further troubleshooting. Parts came in Saturday night so we’re good to go.<BR><BR>Started by putting load on Bypass (cring, but it went okay). Powered down the unit and started by inspecting the three switching SCRs (one for each phase). These solid-state switches handle the transfers between inverter and bypass power for brief (sub one-second) moments until the mechanical switches catch up. Found one reading a little flaky, so they replaced it. Then found it read different from the remaining two originals, so they replaced them, too. (Good idea, IMO. I like to replace things in sets.) Then went one step farther and replaced the snubber PCB as well. There went all the parts they just got!<BR><BR>Brought the UPS up through walk-in. Things looked okay. Facility electrical guys were monitoring current down in basement, so we sucked it up and initiated load transfer to inverter. UPS immediately faulted hard, battery discharge with input still available, funky sounds, and a couple seconds later all the PPCs tripped on THD alarms again. Sound got nastier and we started to smell hot wiring again, so we quickly took a few readings and went back to Bypass. Or rather, tried to go back to Bypass. Control panel screwed up again. By the time we figured out that we might need to cut the input power to the system, the breaker downstairs did it for us. Down went the CRACs, down went the UPS input power, down went... Jumped on the now unlocked control panel and dropped the load to bypass before the transfer switch brought input power back up. Took the UPS down.<BR><BR>Well, shit. Okay, doing some research and a few calls back to the factory.<BR><BR>Assumptions are this: the inverter and batteries are probably okay since they kept the load up while the problem was occurring. The input filter <I>shelf</I> is probably okay since it’s brand new and we think we got power off before it got too toasty. That leaves some large series inductors that are also a part of the input filter system, and the rectifiers.<BR><BR>Next step, wire around the input filter inductors, eliminating them from the system. (I have no problem with a methodical approach to troubleshooting, I just wish we weren’t doing it while we were on commercial power. I mean, these are some <I>large</I> wires and this will take a bit of time. (/me looks warily at the clock wondering how much luck we have left...)<BR><BR>An hour later and we’re ready to try again.<BR><BR>Lather, rinse, repeat. Except we didn’t wait for feed breaker to trip – we manually tripped the UPS input breaker. No sense having the generator keep starting. Crap.<BR><BR>Well, now we’re down to the rectum-friers. We’re thinking one or possibly two aren’t firing correctly, and the other phase(s) are trying to compensate. This would explain the significant difference in input current when load transfer is attempted. So rectifiers and control board are on order priority one, airport counter-to-counter. Should be in tonight. Not much to do other than button up and wait for the pieces to show up, so we let everybody go home to celebrate the last third of Mother’s Day. Back home for some dinner and a nap, and we’ll see what happens tonight.<BR><BR>And BTW: the generator is *still* boinked. They replaced the gas regulator valve on Saturday but it didn’t help – we were still surging when we first went to Emergency power. Can’t find replacement governor, unit is apparently too old.<BR><BR>Other Factors: Yes, we’re overloading both the generator and the circuit breaker. Facilities bumped up the trip point a bit to keep us up during troubleshooting. We’re looking at renting a 750kW or so portable generator for a few months until our new infrastructure is up. Also, some of the generator problem is likely a result of the wildly unbalanced input power the UPS is apparently drawing.<BR><BR>Bottom Line: This ain’t fun no more. I sure hope these rectifiers/control boards fix it...<BR><BR>Also, please excuse any minor inconsistencies or inaccuracies - my mind is fried and I'm have a great deal of trouble keeping all this straight. I haven't even <I>begun</I> to mention the finger-pointing between engineers and facility. And the management second guessing... lol
 

oikjn

Ars Scholae Palatinae
1,015
Subscriptor++
With all that power going through your lines, you'll want to check for ground faults from the main feeds to the UPS. I"m guessing that one of the wires is starting to have its insulation break down and its shorting enough to cause your issues, but not enough to trip the GFI instantly. <BR><BR>I've had issues with loose terminals in some high power equipment... first it killed the area around the termination (like I think you described) and then after we replaced the terminal blocks, we still had problems... turned out that while the terminals were loose, the wires literally vibrated like they were dancing (about a 3" shake... it was scarry seeing these MCM300 wires move as much as they did when we turned on the power). Simply by tightening the lugs we were able to stop the wires "dancing" (the wire movement we saw was ~100ft from the terminals in question), but I was scared by all the movement we saw and I decided to have the wire pulled and replaced... turned out the insulation on one of the wires was cut through on a conduit elbo simply by the friction... if your wiring is 20 years old, 500kW is an aweful lot of power to be sending through 20 year old wire... btw, what wiring size is it? are you running close to the wire capacity?
 

llib

Ars Scholae Palatinae
649
Still Day IV (Sun Night)<br>Parts on site. First option is quick change of Rectifier Control PCB. About 30 minutes and we’re ready to try again. WeÂ’re getting pretty good at this by now – expecting the failure and having a tech stand by at the switchboard downstairs to drop input power if the UPS misbehaves again. Experience has shown that itÂ’s the easiest way to get the UPS to respond to front-panel “Transfer to Bypass” commands. No sense letting generator start or input filter fry. Or the data center drop.<br><br>Anyway, UPS power-up. Do the whole manually-controlled slow walk-in thing. All appearances normal with no load. Bring the UPS back down, remove the manual control box, then power back up and let it walk-in normally. AC up. DC up. Idle loads look acceptable. Ready for transfer.<br><br>As Yogi said, deja vu all over again. UPS immediately faulted hard, battery discharge with input still available, DC ground fault, funky sounds, and a quick call to the tech to drop input so we can get back to bypass. Did all that <i>before</i> the generator was called or the PPCs tripped on THD alarms. Like I said, weÂ’re getting pretty good at this by now.<br><br>Well, not much left but the rectifiers. Nowhere near as big as I had thought. Little cube-like thingies maybe six inches long and a couple inches square. Flat plate to mount on heat sink on one side, a couple <i>big</i> terminals for the inputs, and four small terminals for the control. Problem was that each corner had a cutout for the mounting screw, a 10mm bolt. Clearance between bolt head and cutout wall was so tight we couldnÂ’t get a socket in there. WTH? Stupid design. Finally had to take socket over to machine shop and grind the sides down until you could damn near crunch it with your fingers. Newer units use allen bolts, but this UPS was vintage 2001.<br><br>Anyway, wrestled with the rectifiers for an hour and finally got them all replaced. Hooked up manual control box again, buttoned up the doors and got set to try one last time. Asked the engineer what would be the next step if this didnÂ’t work, he said “forklift”. Now <i>that</i> was funny!<br><br>Tech back at switchboard, walking in real slow to let the new rectifiers break in easily. AC up. DC up slooowly. Let it idle for a few minutes, then took it back down. Removed the control box, buttoned it up. Input filter series chokes were still bypassed, but no matter. Face shields down, main breaker on, walk-in started. Watched AC come up, DC come up, idle load currents looked good. Adjusted inverter phase lead to 2º. Okay to transfer. Well, here goes nothinÂ’...<br><br>Hit control-transfer and ka-chunk (sound of transfer switches). No alarms, no ground faults, no THD, no... HEY! ItÂ’s freaking working!! Nice normal buzz from the rectifiers/inverters, readings were actually better than they were before all this started. Stood there staring at it, just daring it to break. But it just kept buzzinÂ’ along. ItÂ’s 2300, thunderstorms due in around 0200. Considered going outside, shaking my fist at the cloudy sky, and yelling, “Bring it on!”, but figured IÂ’d better not press my luck.<br><br>Ran for about 10 minutes, then load back on bypass so we could shut it down and put the filters back together before they got too hot. (Toasty little things, even at the best of times. IÂ’ve seen household furnaces blow out less heat than the UPS exhaust fans do...)<br><br>Anyway, got it all put back together, brought it back up, transferred load, smooth as silk. Turned out the thunderstorms didnÂ’t hit until about 0300, so we had three hours to spare in our little four-day ordeal. WasnÂ’t even close!<br><br>Learned a <i>lot</i> of lessons in all this. IÂ’ll post them next time and answer some questions, but for now IÂ’m gonna get a good nightÂ’s sleep for the first time in the last few days... -- View image here: http://episteme.meincmagazine.com/groupee_common/emoticons/icon_smile.gif --
 

llib

Ars Scholae Palatinae
649
COMMENTS/QUESTIONS<br>Accs had it pretty-much nailed when he suggested rectifiers. And yes, they are IGBTs. And the UPS did indeed turn out to be the initial cause of the problem.<br><br>And although the battery strings are 540v, they are really 40ea 12.6v 6-cell VRLAs. That gives 504v. But in battery discharge more (with no input DC from the rectifiers), the string shows a starting value of 540 at the front panel. Haven’t quite figured that out yet. Should be 504 max, and actually quite a bit lower since theyÂ’d be under heavy load. Still an open question here.<br><br>Xantathar also was correct in pointing out that we should have been monitoring our feed. ThatÂ’s one of the disadvantages of having the data center engineering staff and the campus electrical staff not communicating very well. Actual measured numbers were big UPS: 310a, Small UPS: 85a, HVAC load: 200a (!!). That last one was a big, unpleasant surprise. WeÂ’d been assuming (thereÂ’s that word, again...) about 125a for cooling load since our main 30t CRAC was normally running on building chilled water. But it does have two compressors as backups, and these do run from time to time so their load must be considered. And make sure theyÂ’re all running when you measure!) So we basically had a 310a+85a+200a = 595a load <i>under normal conditions</i>.<br><br>And hereÂ’s a kicker: the 800a main breaker is really a 1600a frame with and offset adjustment of 0.5, giving an 800a trip. BUT... When we pulled the breaker to change it in early troubleshooting, they found a current tap (CT) back inside set at 1200a. So what we really had was 0.5 of 1200a, giving an effective breaker size of 600a. And thatÂ’s right at what we were pulling <i>normally</i>. (FYI: big breakers have four trip modes: short-term, long-term, transient (or peak), and ground-fault. All adjustment can usually be made separately.) When they tested the breaker, they ran 800a for 2.5 minutes before it tripped. But when the UPS faulted, the power monitor showed we were pulling over 1100a/phase peak. You donÂ’t have to be an engineer to see that wasnÂ’t going to work. (Power monitors are way-cool. Our new infrastructure has the capability built-in.)<br><br>As Zaphod suggests, we did consider shedding load. But the problem wasnÂ’t load related (at least we didnÂ’t <i>think</i> it was load related), and the decision was to keep systems up against a possible outage rather than bring some down for a certain outage. And we do have extensive backups, and we all know how absolutely reliable tape backups are... -- View image here: http://episteme.meincmagazine.com/groupee_common/emoticons/icon_wink.gif --<br><br>As for the UPSes, the big one was a Liebert 610-series, 300 kVA. The small one (which we basically ignored, it never having complained even once (other than battery discharges, of course...)) was a Liebert 120 kVA NX. And in spite of the whacky problems the 610 gave us, I would like to point out that even with all the rough handling and abuse we were giving it (and it was giving us), it never once dropped the production load. If that ainÂ’t fault tolerance, I donÂ’t know what is... And the engineering support we received was second to none.<br><br>Our problem right now is the generator (Cummins 500 kW natural gas). I was going by the 675 kVA rating on the alternator housing and was thinking I was drawing 225 kVA on the 610 and 80 kVA on the NX and guestimating (bad mistake, guestimating) maybe 150 on the HVAC, leaving plenty to put another 40-50 kVA additional load for a new customer weÂ’re about to start hosting. Guestimating was bad, and a worse mistake was playing PF/kVA/kW games. But the worst mistake (and I wonÂ’t get over this one for a while) was simply not going out and actually measuring the current we were drawing. To hell with kVAs and kWs, itÂ’s <i>amps</i> that trip the breakers! So the generator can supply 600a <i>at full capacity</i>. We were drawing routinely 600-620a. But you <i>never</i> want to load electrical equipment past 80%. ThatÂ’s industry standard derate, and should never be ignored. So I ignored it. Actually, I never even considered it. We were derating the UPSes (actually the Lieberts force it by giving an overload alarm at 80%), but we werenÂ’t derating the generator. DonÂ’t ask me why. Major brain fart on my part. And facilities didnÂ’t tell us (or didnÂ’t notice) that we were that loaded. I think a lot of us will be paying a lot more attention from now on...<br><br>And with the genset maximally loaded, of course the governor had no ability to handle the UPS fault input current of 477a (normal was 310 or so). And the power monitor recorded a half-second spike of over 1100a, just on the UPS. AinÂ’t no wonder we brought the generator to its knees! We did find some water in the gas regulator that might have limited to some extent the fuel flow, but given the loads we were putting on it I donÂ’t think it would have mattered.<br><br>What weÂ’re going to do now is lease a 350 kW unit and offload the HVAC onto the new set. ThatÂ’ll cut about 200a peak off the old generator, but weÂ’ll put about 50 or so right back on with the new customer. But weÂ’ll still have a net reduction of at least 100a, possibly 150 or so. That should get us by for another few months until our new infrastructure is up and running. Having a megawatt and a half on the new generators will be heaven! -- View image here: http://episteme.meincmagazine.com/groupee_common/emoticons/icon_smile.gif --<br><br>Anyway, a few lessons learned and a bullet dodged. The biggest reason I posted all this (other than the stress relief it offered) was in the hope someone else might read this and say “What a dumb ass. IÂ’d never do anything like that. But maybe weÂ’d better check...”<br><br>Cheers, yÂ’all...<br><br>PS: My apologies for the thread name. I was going to put each part in a separate thread, but it didn't work out that way... -- View image here: http://episteme.meincmagazine.com/groupee_common/emoticons/icon_cool.gif --<br><br>PPS: And you're correct, borrell - there was a lot of stuff happening that I didn't even mention. There were more than a few really confusing times, and we found that it's very hard to think straight when you're on battery discharge, the UPS is frozen, and you're watching the DC bus drop, and drop, and drop...
 

Accs

Ars Legatus Legionis
12,557
Subscriptor
<BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">Originally posted by llib:<BR>And although the battery strings are 540v, they are really 40ea 12.6v 6-cell VRLAs. That gives 504v. But in battery discharge more (with no input DC from the rectifiers), the string shows a starting value of 540 at the front panel. Haven’t quite figured that out yet. Should be 504 max, and actually quite a bit lower since they’d be under heavy load. Still an open question here. </div></BLOCKQUOTE>First, the 540V is the "float" voltage. This is the voltage that the AC->DC converter tries to create. It's the voltage that's used to keep the batteries charged. When the batteries are charging, that float voltage will vary, to kep the charge current from being too high. The inverter will compensate for this.<BR><BR>The 504V (I thought Lead Acid batteries were 2V per cell - your numbers are based on 2.1V per cell) is the voltage that the batteries will stabilize at with no load or charging. It may take a while to get there after the charging current is removed. This is something that you (being in a production data center) should only see briefly.<BR><BR>When the charging current is removed, the batteries act like capacitors for a bit. The voltage will quickly drop to the 504V point (as the capacitive effect holds the float charge, and the load bleeds it off), then it will drop more slowly (as the batteries discharge).<BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">To hell with kVAs and kWs, it’s amps that trip the breakers! </div></BLOCKQUOTE>Correct.<BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">you never want to load electrical equipment past 80%. </div></BLOCKQUOTE>I've seen this incorrect statement time and time again. As per the NEC, you don't want a NORMAL load to exceed 80% rated load. Occasional peaks above this are acceptable.<BR><BR>With your main breaker settings, you were WAY out of where you should have been. Even taking readings would have left you thinking that all was well, when, in fact, you were in trouble.<BR><BR><BR>There's one important "take-away" from this. EVERY critical setting should be recorded as the gear is installed. If it's already installed, and you don't have the numbers, get them the next time there's a scheduled power outage (it may take a few such to get ALL the numbers).
 

llib

Ars Scholae Palatinae
649
<BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content"> First, the 540V is the "float" voltage. </div></BLOCKQUOTE><BR>Yes indeed. The charger voltage will always have to be higher than the battery voltage or there would be no charge. So you're saying that the batteries will act similarly to capacitors for some period of time after the 540v charge is removed, but would eventually fall back to the nominal 504v? Gotcha. I think I remember that - the initial voltage dropped pretty quickly (and scared the crap outta us, too!) but then the discharge rate slowed some...<BR><BR><BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">I've seen this incorrect statement time and time again. As per the NEC, you don't want a NORMAL load to exceed 80% rated load. Occasional peaks above this are acceptable. </div></BLOCKQUOTE><BR>I thought that's what I said. Or at least meant. Yes, your normal (or sustained, long term load) should not exceed 80%. That gives you the 20% margine to handle the unexpected. Encroach upon it at your own risk. Although the Liebert UPSes will <I>not</I> let you encroach upon it - OL alarms at 80%. Also, they will not transfer from bypass to inverter <I>unless</I> you at or below 80%. That's a little gotcha a lot of folks might not realize.<BR><BR><BR><BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">With your main breaker settings, you were WAY out of where you should have been </div></BLOCKQUOTE><BR>And that's the understatement of the year! We were routinely running both our genset <I>and</I> our main CB at 100% sustained load. It was quite the surprise to find our supposedly 800a breaker was really 600. And an even bigger surprise to find that we were routinely drawing 600. We were very lucky. Actually, we found that we were routinely drawing about 560a. The 30t CRAC is dual-source and normally runs on chilled water. Only if both compressors are running dor we draw the full 600-odd amps. Of course you must consider worst-case with everything running when computing max loads.<BR><BR>As you said, we were very lucky in many ways...<BR><BR>And I now spend time every day to take all UPS displayed readings. Only takes a few minutes. I'm not sure how often we'll take clamp-on reading, etc., but I'd like to do it at least quarterly, and on an ad hoc basis after any significant change...<BR><BR>afidel, that's kinda what I thought, too. But our facility electricians want to derate it as well. I'm willing to go along with that just to keep peace in the family.
 

Arbelac

Ars Tribunus Angusticlavius
7,654
<BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">Originally posted by afidel:<BR>Generators are generally good to about 110% of rated load, especially if it's just peak. We unknowingly ran ours above that for a while because there were quite a few building circuits on generator we didn't know about, our datacenter draw was about 85% of rated load. It was fun to find that out with a power survey. </div></BLOCKQUOTE><BR><BR>If it's a 'prime' generator generally yes, it'll have the 110% capacity rating.
 

Baeocystin

Ars Legatus Legionis
17,424
Subscriptor
<blockquote class="ip-ubbcode-quote">
<div class="ip-ubbcode-quote-title">quote:</div>
<div class="ip-ubbcode-quote-content">And although the battery strings are 540v, they are really 40ea 12.6v 6-cell VRLAs. That gives 504v. But in battery discharge more (with no input DC from the rectifiers), the string shows a starting value of 540 at the front panel. Haven’t quite figured that out yet. Should be 504 max, and actually quite a bit lower since theyÂ’d be under heavy load. Still an open question here. </div>
</blockquote>
<br><br>I studied lead acid batteries a few years back when working on an electric car project. One of the things I learned was that the voltage per cell isn't fixed; there's actually a pretty wide range you can set it at via chemistry & cell construction. You hit the gassing threshold at 2.4V/cell, so most all applications keep it below that. Even then, occasionally you need to drive the cells past the gassing threshold to equalize the cells; if you don't, the batteries' lifetime will be compromised.<br><br>When taken off charge, a 12 volt lead-acid battery will actually be pushing out ~13.6 volts. This drops pretty rapidly as the surface lead on both plate reacts, forming a skin that inhibits the chemistry slightly, and gives us the 12 volts we expect. <br><br>Here's an Excel sheet that graphs the available power from a lead-acid battery as a function of discharge rate. Peukert's law made visible, and as good an example as any on why it's always better to have more battery than you need. <br><br>I'm glad everything worked out ok. -- View image here: http://episteme.meincmagazine.com/groupee_common/emoticons/icon_smile.gif --
 

ronelson

Ars Legatus Legionis
21,399
Subscriptor
llib,<BR><BR>I realize that your UPS is much larger than, say, a desktop PC, but when you had that many components fail on the first and second try, what was your logic for avoiding the "forklift"? It worked out in the end, but if you had not had the correct part on hand on the 3rd try, I guarantee your manager would have asked why you did not do it. Support issue? Building access? Just hate to do that kind of forklift replacement?
 
<BLOCKQUOTE class="ip-ubbcode-quote"><div class="ip-ubbcode-quote-title">quote:</div><div class="ip-ubbcode-quote-content">Originally posted by calis00:<BR>My guess is that it's a full rack UPS. Those suckers are no fun to move. Especially if you have one that's got a transformer in there to bring 480 down to 208. </div></BLOCKQUOTE><BR><BR>I'm guessing this is the sort of situation where "forklift" is not a metaphor, euphemism, or shorthand. You'd be needing a real forklift.
 
Status
Not open for further replies.