[Beowulf] Re: UPS & power supply instability - ongoing discussions

David Kewley kewley at gps.caltech.edu
Thu Sep 29 15:57:55 PDT 2005


On Thursday 29 September 2005 10:54, Maurice Hilarius wrote:
> David Kewley wrote:
> OTOH, I am appalled by the fact that it has been "4 weeks since you
> reported the problem to Dell and Liebert and apparently they have
> done close to nothing about it.
> I know that on a job of this size, if you had bought from us, and
> reported it, we would be all over it.
> Even Dell can not afford this kind of bad PR..

I am disappointed that Liebert has known for four weeks that we have 
this problem, and until yesterday did essentially nothing that was 
visible to us, aside from sending field techs out to take measurements, 
and giving us assurances that they're working very hard on the problem.  
We haven't gotten the feeling that they'd been putting enough urgency 
on the problem.  If in fact they have, then they need to work on their 
communications with us.

After talking to Liebert engineers yesterday, I feel better about their 
response.  All the same, this is the only issue in the room that I'm 
actually *worried* about.  The other, non-power issues clearly will get 
addressed when I have the resources to do so.

Dell has been very responsive, proactively so, on the problems that seem 
to be in their domain.  It has seemed to us and to Dell, however, that 
the power problems are between Liebert and Caltech, so Dell has kept 
informed but has not attempted to address the perceived room problems.

In light of discussions about power supplies, however, I will keep an 
eye out for things I need to alert Dell to.  If it ends up that their 
power supplies should have different electrical characteristics, I will 
make sure the right people at Dell get the message.  They've been very 
welcoming of my constructive criticism to date, and I've already seen 
them improve their processes.

> So, DID you get any useful response from either Dell or Liebert 4
> weeks ago, and in the interim.

The response has been to send techs out a handful of times & have the 
District Manager talk with us on the phone.  I'd like to think that 
lots more is going on in the background, but I see few direct results 
so far from such activity.

"Useful" response?  Not useful as in solving the problem, no.  Just 
"working on it".  That's really not good enough for us, even if it is 
all they can do.  At least keeping us in the loop with their engineers 
incereases my comfort level, while we wait for a solution.

> So, delays are partially because the staffing at your site is short
> and you simply do not have enough time to do what it takes to make it
> run? If so, I offer sympathy.
> I see this far too often.
> A budget of a million dollars for a cluster, but no cash to implement
> it or maintain it.
> That must be very frustrating!

Thanks for your sympathy.  I was very frustrated about understaffing 
three months ago or so, but it has been stated very clearly to me that 
I will be the only staff member.  In addition to my best efforts, we 
will continue to expect a lot from our vendors and from Caltech staff.  
My local colleagues and our vendors have on the whole been excellent -- 
we've gotten many times more work done in the past months than I could 
have done myself.  If I can automate enough pieces, we might be OK in 
the coming years. :)

I've accepted that "me plus staff and vendors" is the way it's going to 
be, and I do my best within those constraints, letting management know 
exactly where we are and exactly what is too much to expect.  It's 
worked OK so far.

> >To the best of my knowledge, Liebert has not studied these exact
> > power supplies, but they say they understand PSes that are similar
> > enough that they can work out a model of our specific problem. 
> > Until I have time to run experiments myself, I am going to trust
> > them to cover these bases.
>
> I would, in my experience they have a heck of a good rep.

Ditto for me.  Glad to hear you've had excellent experience with them.

> >>I have seen power regulation equipment fail in a similar fashion
> >> before, where the power supplies are pulling down too much current
> >> to the neutral phase,
> >>and making the power feed overload on one phase, driving it into
> >>instability.
> >>This is a classic symptom of cheap, poorly designed and made power
> >>supplies. Or bad room wiring, with undersized neutral lines.
> >
> >The PDUs have a front panel that displays lots of diagnostic
> > measurements, and they sound a rather piercing alarm when any
> > measurement goes over its Liebert-defined limit (they are the only
> > alarms I've heard in that room that can reliably be heard over the
> > room noise, from any part of the room :).  The PDUs also have
> > suitably sized breakers and suitably sized conductors on each of
> > the 93 branch circuits.
> >
> >The three output phase currents all stay well under their limits,
> > even when they begin to become unstable (at the low-power end of
> > the instability, and well into the instability domain).  Toward the
> > high-power end of the instability domain that we've tested, the
> > current oscillations become large enough, and sit on top of a large
> > enough average current, so the PDUs *do* give overcurrent alarms
> > (plus other alarms due to the wild oscillations).
> >
> >Unless something is going on that is not alarmed for, the PDUs and
> > the Liert techs who've been onsite don't indicate any problem with
> > the neutral wiring or the power supplies per se.
>
> So, what DO they think is causing this? I am really curious..

I believe at the moment they're looking at it as a control theory 
problem, with multiple poles in the system, including the UPS, the 
PDUs, and the computer power supplies.  I suppose it's possible that 
the APC strips and wiring play significant roles as well, but that 
seems less likely.  Liebert's speculative workarounds involve reducing 
the magnitude of the poles.

Ask me again for details when the problem is solved.

> >>Liebert make big UPS and power units, and those are their "bread &
> >>butter"
> >>
> >>Frankly I am surprised they have not yet dispatched a tech down to
> >> your site with test equipment by now..
> >
> >When did I say they haven't dispatched a tech to our site?  In fact
> > they have, mutliple times; I just hadn't mentioned that up to this
> > point in this thread.
>
> Ah.. that paints one very different picture.
> So bascially Liebert are on it, you have not mentioned what, if
> anything Dell have done, but your are coming to this list because
> after some weeks you still are not seeing a solution happening?

Liebert is on it, but results (anything more than assurances) have been 
much too slow.  So I came to this list to get ideas.

> If you had mentioned things like what you say about Liebert's actions
> to date, as you have in  in this message, it would have painted a
> different story entirely.

Yes.  I could not realistically include in my first emails all the 
details that you or anyone else might possibly consider important.  I 
waited for your questions before I elaborated further.  Hopefully all 
the important elements are on the table now.

> To meet CSA, CE, and UL one gets what is called a "site inspection"
> Often the best and cheapest way is to take the piece to a certified
> test labs and they do the test, provide a short report, and a sticker
> certifying it is electricall safe and accceeptable
> It is not an FCC radio emissions test and certification, but you can
> ask for that too, albeit at a higher cost.
> They measure power characteristics, PF, current leakage, consumption,
> stability, load maximum, etc.

Interesting.  I'll consider doing this, thanks.

David



More information about the Beowulf mailing list