[Beowulf] gpu+server health monitoring -- ensure system cooling

Kevin Abbey kevin.abbey at rutgers.edu
Sun Jun 7 21:07:43 PDT 2015


Thank you each for the notes.  The current host bios/bmc appears to read 
data from a MIC card but not the Nvidia.  I'm considering to find a 
method to simply force an increased fan speed in the server for jobs 
using the gpu.  I'll also ask intel again if they can help, perhaps with 
a custom sdr file.  I assume they have done this on their current 
generation of hardware which would hopefully be portable to a sandybrige 
board.


Are there published average running temperatures of gpu: k20, k40, k80?

nvidia-smi reported 66C during a few test jobs.  This is below the power 
throttle temperature on the gpu, but the utilization was still below 75%.

Thanks, I'll check for the ECC errors too.
Kevin


On 6/7/2015 9:14 PM, Paul McIntosh wrote:
> we use nvidia-smi also
>
> You should also keep an eye out for GPU ECC errors as we have found these are good predictors of bad things happening due to heat. Generally you should see none.
>
> In the past we had major issues with the node heat sensors being designed around detecting CPU heat and not the GPU's living in the same box. A firmware upgrade fixed the issue but the ECC checks where the thing that best found the problem nodes.
>
> Cheers,
>
> Paul
>
>
> ----- Original Message -----
> From: "Michael Di Domenico" <mdidomenico4 at gmail.com>
> To: "Beowulf Mailing List" <Beowulf at beowulf.org>
> Sent: Monday, 8 June, 2015 7:50:40 AM
> Subject: Re: [Beowulf] gpu+server health monitoring -- ensure system cooling
>
> nvidia-smi will also show the current temperature of the card.  you
> could script it to save the results over time.  it even includes xml
> output if you're savvy at parsing it
>
> On Sat, Jun 6, 2015 at 10:09 AM, Adam DeConinck <ajdecon at ajdecon.org> wrote:
>> Hi Kevin,
>>
>> nvidia-healthmon is the tool I've used for this kind of thing in the past.
>> It can do temperature checks as well as some sanity checks for things like
>> PCIe connectivity.
>>
>> http://docs.nvidia.com/deploy/healthmon-user-guide/index.html
>>
>> For more general monitoring (I.e. compute and memory usage), I've used
>> Ganglia with the NVML plugins. Not sure how well maintained these are
>> though.
>>
>> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
>>
>> Adam
>>
>>
>> On Friday, June 5, 2015, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:
>>> Hi,
>>>
>>> I recently installed a Nvidia K80 gpu in a server. Can anyone share
>>> methods and procedures for monitoring and ensuring the card is cooled
>>> sufficiently by the server fans?  I need to set this up and test before
>>> running any compute tests.
>>>
>>>
>>> Thanks,
>>> Kevin
>>>
>>> --
>>> Kevin Abbey
>>> Systems Administrator
>>> Center for Computational and Integrative Biology (CCIB)
>>> http://ccib.camden.rutgers.edu/
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/

Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.abbey at rutgers.edu



More information about the Beowulf mailing list