[Beowulf] UPS & power supply instability

David Kewley kewley at gps.caltech.edu
Thu Sep 29 09:46:39 PDT 2005


On Thursday 29 September 2005 07:46, Mark Hahn wrote:
> > We have a Liebert 600 Series 500kVA UPS feeding two Liebert PDUs.  The
> > PDUs then have a fanout of whips to the computer racks.
>
> the PDU's are just simple networks, not matching transformers or
> harmonic mitigators?  (for our new ~2k cpus machineroom, the local
> physical plant people required us to put in Liebert harmonic mitigators,
> even though we told them all PS's would be PFC.  from HP, if that
> matters.)

I'm afraid I don't know that level of detail (yet :).  I believe the Liebert 
folks said yesterday that our PDUs have surge suppression circuitry, but no 
other filter caps.  And I believe the transformer is a simple isolation & 
stepdown, not filtering.  Certainly there are no caps except associated 
with the surge suppressor.

Someone else on this thread mentioned K20 transformers, which are designed 
to tolerate relatively high harmonics.  These PDU transformers are indeed 
K20, and if I recall correctly, the THD measured by the PDUs is well under 
the K20 limit.  Certainly the PDUs haven't alarmed on THD, nor have the 
onsite techs rasied any issue with the THD.

> > The UPS voltage in/out and the PDU voltage in is 480V 3ph.  The PDU out
> > is 208V 3ph +neutral, 120V wrt neutral.  The whips are 5-conductor
> > (3ph, ground, neutral), and they feed APC AP7960 switchable rack PDUs.
> > The computers are fed 120V from the AP7960s.
>
> it shouldn't be relevant, but did you choose against 208 to the nodes
> for a reason?  (nearly everything is auto-ranging nowadays, and tends
> to run a little more efficiently at 208).

We now wish we had 208.  Unfortunately, when the room was designed & the 
power infrastructure built, we anticipated having an unknown mix of 
equipment in the room, some of which might not handle 208 gracefully.  
(Actually the "we" in the design phase didn't include me; I arrived after 
the room was fully built but not yet populated with computers.)

By the way, I have no previous experience with medium or large data centers, 
so I had no idea until recently that it was common to supply 208 to 
machines.  120 had always been sufficient in my earlier experience.

> > I've balanced the loads on the three phases about as well as possible.
> > We still have neutral current, about 1/3 to 1/2 the magnitude of any of
> > the per-phase currents.
>
> yow!  isn't that very high?  we had an anti-neutral-current squad on
> campus earlier this year, and they freaked out over our old machineroom
> which had neutral that was about 10% of the others...

Yeah, it seems high to me.  The PDUs aren't alarming on it though, and the 
Liebert folks haven't raised any concerns about it.

I have to learn more about where this current is likely coming from, but I 
think I'm hearing that it could be caused by harmonics?

> > The problem is this: We can fire up our cluster to about 40% of maximum
> > load and everything is fine.  But if we go over some threshold right
> > around 40% of max, the output currents from the PDUs go unstable.  It's
>
> "fire up" means power on at the same time?  what happens if you sneak up
> the load (say, one node per minute to be conservative.)?  I'm wondering
> whether part of your problem is inrush/spinup load.

At this moment, we don't het have power-up/down automated over the network, 
so we actually go press all the buttons.  It's not as bad as it sounds -- 
it takes me probably 20 seconds to hit the buttons on a rack of 40 
computers, or 2 nodes per second roughly.

One node per minute would take most of a day to power up the whole cluster, 
so that's out. :)  In my brief discussions with others, 2/second seemed OK 
-- inrush should take less than 1/2 second, right?

These computers & power supplies have standby power.  The inrush from 
"unplugged" to "plugged in", that is, the standby power inrush, is 3A max.  
I've not measured whether there is any inrush associated with powering up 
fully, but I'd expect that not to matter much after .5 second.

As for hard drive spinup, I can't imagine the additional 20W or less per 
node would matter that much at 2 nodes per second.  Our room has been 
powered up to about 300kW on UPS bypass without a hitch.

> > This instability only happens when the UPS is online.  If we put the
> > UPS in bypass, we can go up to around 70% of max load with no
> > instability (all computers on but idling in the OS; we haven't tested
> > all nodes at 100% CPU yet).
>
> ouch!

Yeah.  Tell me about it. :/  Until this problem is solved, we have two 
reasonable choices:

a) Run fulltime with only 40% capacity.

b) Run overnight with 40% capacity with the UPS online, and 100% during the 
workday in UPS bypass.  Cross our fingers that our luck for the past year 
and a half holds, and we get no significant power events.  Oh, and make 
backups religiously (which should go without saying anyway).

David



More information about the Beowulf mailing list