[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Frank Gruellich frank.gruellich at navteq.comMon Apr 6 11:32:56 PDT 2009
- Previous message: [Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
- Next message: [Beowulf] OT: Windows tex editors/processors
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, Rahul Nabar wrote: > We had bought 23 Dell-SC1435-PowerEdge servers for our latest cluster > addition mid-2008. IMHO SC1435 are some kind of low-cost metal from DELL. I would not use them if I want a reliable system. Especially in HPC where one failed systems ruins your whole (maybe long running) job. I found some SC1435 we bought before April 2008 and some before July 2008, but both deliveries had only minor problems after initial set up (as usual). From time to time a PDU or memory goes bad. But we do not have the load of HPC, so maybe our hardware has less stress. We had a cluster of 5 machines (DELL PE1950) from one delivery where we had to replace most of all PDUs (redundant, so no big deal). These machines have a bit more load, though I don't believe in a connection. Sometimes DELL has just bad days. > I contact Dell. Responses range from the clueless to absurd. The DELL support is a bit tricky. We have Silver or Gold support for most systems, I don't know how they work for lower levels. I can't complain about Gold. For Silver they always try to make us doing stuff like cross testing memory, CPU or other things. (The most interesting request is to do a BIOS update to cure a (obviously) memory problem. The machine went 2 years fine with the old BIOS -- memory combination and suddenly it complains about it?) While I really like to do such hardware games I just don't have the time for it. If you keep refusing these requests, eventually they give up and send a technican replacing different pieces of hardware. > First, they convinced us it was Fedora. So I shifted to CentOS. > They still claim CentOS is "unvalidated" but I refuse to spend a > fortune to move over to RHEL like they want me to. We use CentOS for most installation and DELL support never complained about it. And IMHO the OS should be able to cause an error detected by the management board. > Then I go through the whole circus running dset, I have dset reports in place, before calling support, because they always request them. That speeds up chit-chat a bit. > Now I have a new machine go down and it's back to wasting my time > going all over those debugging procedures. IMHO that's not your job. I don't know about your setup, but if you maybe have some dozens more machines next to these particluar set of machines all running without problems I would blame DELL. > In spite of having paid for next-day service each time we have > waited more than a month while Dell goes through all the debugging > circus. Keep refusing to do that. You don't have the time. They can do this debugging after they have replaced your hardware giving you 100% back of what you paid for. > "The next day (service) refers to normal break fix issues that > involve normal parts, since we may be replacing an entire server it > may take longer". I would not accept that. > I'm just a grad-student here and without a dedicated sys-admin it > takes a lot of time running all these testing etc. that Dell demands; > I am ok running basic tests but if machines go bad during the warranty > is such testing within my domain or the vendors? That's another problem: IMHO your university should have a dedicated guy taking care about computer system, someone who has the time to deal with DELL support and so on. 23 machines don't give a full time job, but maybe someone who's taking care about some other Linux installation already. It's not a good idea to have just some grad-student doing that job part-time (no offense). I know that reallity looks bad. Kind regards, -- Navteq (DE) GmbH Frank Gruellich Map24 Systems and Networks Duesseldorfer Strasse 40a 65760 Eschborn Germany Phone: +49 6196 77756-414 Fax: +49 6196 77756-100 USt-ID-No.: DE 197947163 Managing Directors: Thomas Golob, Alexander Wiegand, Hans Pieter Gieszen, Martin Robert Stockman
- Previous message: [Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
- Next message: [Beowulf] OT: Windows tex editors/processors
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
