[Beowulf] Node Drop-Off
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tony Ladd ladd at che.ufl.eduMon Dec 4 07:35:29 PST 2006
- Previous message: [Beowulf] Node Drop-Off
- Next message: [Beowulf] Node Drop-Off
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Tim Our university HPC cluster had similar problems with dual-core opterons 275's. They had about 20 bad ones out of a batch of 400. The nodes would run OK for a while and then die. It took many months to track down the source of the problem-AMD gave the same lame excuse-bad QA. Tony -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Tim Moore Sent: Monday, December 04, 2006 10:16 AM To: beowulf at beowulf.org Subject: Re: [Beowulf] Node Drop-Off Update to node drop-off: I wrote a few weeks ago to ask about node drop-off. A quick note...I had a cluster run for 3 years without failure and I upgraded the Opteron 240 CPUs to 250s. The upgrade required a BIOS upgrade and while I was at it, upgraded the OS and security. Some readers provided good suggestions for diagnosis. As it turned out, of the 16 CPU batch...two were flawed. No success was derived from replacing power supplies, HDD, resetting memory and the cooling solution. The CPU flaw only manifested itself (at first) after several hours of CPU usage. With each failure, the time duration shortened before the next failure and by the time I figured it out was down to about 2 minutes. The AMD engineer with whom I talked was amazed that such CPUs made it beyond quality control. He also suggested that the vendor may have inadvertently mixed returned (previously fetermined to be flawed processors) with the new ones and sent them out (again) as new. Just for future reference...is there an easy way to determine if a CPU is flawed with 2 weeks of down time and extensive hair extraction???? Tim
- Previous message: [Beowulf] Node Drop-Off
- Next message: [Beowulf] Node Drop-Off
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
