[Beowulf] GPU diagnostics?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joe Landman landman at scalableinformatics.comMon Mar 30 15:31:17 PDT 2009
- Previous message: [Beowulf] GPU diagnostics?
- Next message: [Beowulf] GPU diagnostics?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
David Mathog wrote: > Donald Becker wrote: >> On Mon, 30 Mar 2009, David Mathog wrote: >> >>> Joe Landman wrote: >>>> Vendors have an nVidia supplied *GEMM based burn in test. Been > thinking >>>> about a set of diagnostics end users can run as a sanity check. >>> My suspicion is that vendors run such burn in tests only for a very >>> brief time. That time being "the minimum time required to find the >>> percentage of failed units above which it would cost us more if they >>> were found to be bad in the field" - and not a second longer. >> I don't know about other vendors, but that's not Penguin's approach. > > By "vendor" I meant graphics card vendors, not cluster or HPC vendors. > My interest in this sort of diagnostic arose in relation to an > inexpensive graphics card bought at Newegg. I was asking here > specifically because it seemed likely that HPC vendors _would_ have > the sort of GPU diagnostic I was seeking, and might be willing to share > it. (As opposed to the tool Joe referred to, which seems not to be > generally available.) FWIW, we agree with (and implement something similar to) Don's burn in procedure, and yes, it sometimes annoys customers who want it *now*. But it also (massively) reduces infant mortality rates (and we we have even designed new disk packaging to reduce the impact of the sometimes fatal disk malady named UPS/Fedex-osis). This said, there really isn't a memory checker for GPUs just yet. Could be done, and probably should be ... Also, likely we should have a long term crunching diagnostic, where we already know the answer to a computational problem, and simply have it burn cycles. But GPUs are more complex than this, we need to worry about PCIe bus transfers, several different flavors of memory, etc. Really, since there is very little you can do if a GPU card is toast, other than replace it, it might be better to have the test done at this granularity. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615
- Previous message: [Beowulf] GPU diagnostics?
- Next message: [Beowulf] GPU diagnostics?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
