[Beowulf] gpu+server health monitoring -- ensure system cooling
eliote at microway.com
Tue Jun 9 15:20:33 PDT 2015
I can confirm that Ganglia supports the Tesla K80 GPU monitoring just fine.
Regarding GPU temperatures, I'm seeing ~60C in one of NVIDIA's
officially-certified servers for Tesla K80 (4U Supermicro
SYS-7048GR-TR). You might not want to use Tesla K20/K40 as comparisons,
because they had lower levels of GPU Boost (and thus might not push the
TDP envelope as much).
On 06/08/2015 12:07 AM, Kevin Abbey wrote:
> Thank you each for the notes. The current host bios/bmc appears to
> read data from a MIC card but not the Nvidia. I'm considering to find
> a method to simply force an increased fan speed in the server for jobs
> using the gpu. I'll also ask intel again if they can help, perhaps
> with a custom sdr file. I assume they have done this on their current
> generation of hardware which would hopefully be portable to a
> sandybrige board.
> Are there published average running temperatures of gpu: k20, k40, k80?
> nvidia-smi reported 66C during a few test jobs. This is below the
> power throttle temperature on the gpu, but the utilization was still
> below 75%.
> Thanks, I'll check for the ECC errors too.
> On 6/7/2015 9:14 PM, Paul McIntosh wrote:
>> we use nvidia-smi also
>> You should also keep an eye out for GPU ECC errors as we have found
>> these are good predictors of bad things happening due to heat.
>> Generally you should see none.
>> In the past we had major issues with the node heat sensors being
>> designed around detecting CPU heat and not the GPU's living in the
>> same box. A firmware upgrade fixed the issue but the ECC checks where
>> the thing that best found the problem nodes.
>> ----- Original Message -----
>> From: "Michael Di Domenico" <mdidomenico4 at gmail.com>
>> To: "Beowulf Mailing List" <Beowulf at beowulf.org>
>> Sent: Monday, 8 June, 2015 7:50:40 AM
>> Subject: Re: [Beowulf] gpu+server health monitoring -- ensure system
>> nvidia-smi will also show the current temperature of the card. you
>> could script it to save the results over time. it even includes xml
>> output if you're savvy at parsing it
>> On Sat, Jun 6, 2015 at 10:09 AM, Adam DeConinck <ajdecon at ajdecon.org>
>>> Hi Kevin,
>>> nvidia-healthmon is the tool I've used for this kind of thing in the
>>> It can do temperature checks as well as some sanity checks for
>>> things like
>>> PCIe connectivity.
>>> For more general monitoring (I.e. compute and memory usage), I've used
>>> Ganglia with the NVML plugins. Not sure how well maintained these are
>>> On Friday, June 5, 2015, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:
>>>> I recently installed a Nvidia K80 gpu in a server. Can anyone share
>>>> methods and procedures for monitoring and ensuring the card is cooled
>>>> sufficiently by the server fans? I need to set this up and test
>>>> running any compute tests.
>>>> Kevin Abbey
>>>> Systems Administrator
>>>> Center for Computational and Integrative Biology (CCIB)
More information about the Beowulf