[Beowulf] Problems with Dell M620 and CPU power throttling
hahn at mcmaster.ca
Fri Aug 30 09:00:05 PDT 2013
> Of course we have done system tuning.
sorry for the unintentional condescenscion -
what I actually meant was "tuning of knobs located in /sys" :)
> Instrumenting temperature probes on individual CPUs has not been performed.
> When we look at temperatures from both the chassis and ipmitool, we see no
> drastic peaks. Maybe we are getting a 60C peak that we don't detect and that
> is the cause. But I doubt it.
could you try "modprobe coretemp",
and see whether interesting things appear under:
afaik, reading the coretemp*/temp*_input values would let you do
higher-resolution monitoring to see whether you're getting spikes.
> power consumption is around 80W. That tells me that the system is cool
> enough. Should I not believe those values? i have no reason to from past
I'm not casting aspersions, just that chassis temps don't tell the
whole story. is your exact model of CPU actually rated for higher
power? we've got some ProLiant SL230s Gen8 with E5-2680's - rated
for 130, and don't seem to be throttling.
> Input air is about 22C. For our data center, you'd have a better chance of
> getting this adjusted to 15C than I would! As for fans, these don't have
yes, well, it is nice to have one's own datacenter ;)
but seriously, I find it sometimes makes a difference to open front
and back doors of the rack (if any), do some manual sampling of air
flow and temperatures (wave hand around)...
> For heat sink thermal grease problems, I'd expect this to be visible using
> the ipmitools but maybe that is not where the temperatures are being
> measured. I don't know about that issue. I'd expect that a bad thermal
> grease issue would manifest itself by showing up on a per socket level and
> not on both sockets. It seems odd that every node exhibiting this problem
> would have both sockets having the same issue.
well, if both sockets have poor thermal contact with heatsinks...
I'm not trying to FUD up any particular vendor(s), but mistakes do happen.
I was imagining, for instance, that an assembly line might be set up
with HS and thermal compound tuned for E5-2637 systems (80W/socket),
but was pressed into service for some E5-2690 nodes (135W).
> Again, the magnitude of the problem is about 5-10% at any time. Given 600
if I understand you, the prevalence is only 5-10%, but the magnitude (effect)
is much larger, right?
More information about the Beowulf