[Beowulf] A couple of interesting comments
gerry.creager at tamu.edu
Wed Sep 24 07:33:57 PDT 2008
Prentice Bisbal wrote:
> Oops. e-mailed to the wrong address. The cat's out of the bag now! No
> big deal. I was 50/50 about CC-ing the list, anyway. Just remove the
> phrase "off-list" in the first sentence, and that last bit about not
> posting to the list because...
> Great. I'll never get a job that requires security clearance now! ;)
> Prentice <---- still can't figure out how to use e-mail properly
Recently, I was proven to be unable to handle spreadsheets. That can be
embarrassing when I claim to be able to manage and write numerical models...
> Prentice Bisbal wrote:
>> I wanted to let you know off-list that I'm going through the same
>> problems right now. I thought you'd like to know you're not alone. We
>> purchased a cluster from the *allegedly* same vendor. The PXE boot and
>> keyboard errors were the least of our problems.
>> First, our cluster was delayed 2 months due to shortages of the network
>> hardware we specified. It was not the vendor standard for clustering,
>> but still a brand they resold.
>> When it did arrive, the doors were damaged by the inadequately equipped
>> delivery co.
>> When the technician arrived to finish setting up the cluster, he
>> discovered that the IB cables provided were too short to be within spec:
>> the bend radius would be too tight, and were too short to be supported
>> from above the connectors.
>> And, the final problem I'm going to mention: the fiber network cables to
>> connect our ethernet switches to each other (we have Ethernet and IB
>> networks in this cluster) were missing.
>> It's been over two weeks since our cluster arrived, and one week since
>> the technician noticed these shortages and reported them. Still haven't
>> had these problems rectified, and the technician will have to fly to our
>> site again in a couple weeks to complete the installation.
>> I'm writing an article about this experience for Doug to publish. I
>> haven't posted this to the mailing list b/c I'm not sure what my
>> management will be happy with me sharing (the article will be reviewed
>> by them before publishing).
I'll add that we paid for next-day service, but I continue to be amaze
that this means Matt or I have to evaluate and troubleshoot the node
before the vendor sends out service. We can manage to drag "next
business day" out a few more days, somehow.
Our iSCSI cables were partially sent, but we were told we'd gotten what
they interpreted to be the right number; we bought more and it only took
a week or so to get them in. We discovered the RAID shelves we'd
gotten, where the RFQ specifically called out RAID6 hardware-capable,
weren't, so we're doing JBOD/software RAID6 (our experience has proven
that we NEED RAID6). When we enquired about giving back the RAID
shelves we were told that wasn't a possibility.
My impression is that the vendor is well-suited for small-medium
business-based clusters, but unfamiliar with how things work in the *nix
world, overall (I know there are exceptions). I am concerned that each
of our compute nodes is, to them, just another webserver, and if it's
mission critical, we should have bought all sorts of additional services
and a shelf-spare server. Or maybe we should just virtualize (yeah!
that's the ticket! a virtual HPC cluster?).
We're starting to look again for HPC resources, but I doubt they'll be
asked to bid.
>>> We recently purchased a set of hardware for a cluster from a hardware
>>> vendor. We've encountered a couple of interesting issues with bringing
>>> the thing up that I'd like to get group comments on. Note that the RFP
>>> and negotiations specified this system was for a cluster installation,
>>> so there would be no misunderstanding...
>>> 1. We specified "No OS" in the purchase so that we could install CentOS
>>> as our base. We got a set of systems with a stub OS, and an EULA for
>>> the diagnostics embedded on the disk. After clicking thru the EULA, it
>>> tells us we have no OS on the disk, but does not fail to PXE.
>>> 2. BIOS had a couple of interesting defaults, including warn on
>>> keyboard error (Keyboard? Not intentionally. This is a compute node,
>>> and should never require a keyboard. Ever.) We also find the BIOS is
>>> set to boot from hard disk THEN PXE. But due to item 1, above, we never
>>> can fail over to PXE unless we load up a keyboard and monitor, and hit
>>> F12 to drop to PXE.
>>> In discussions with our sales rep, I'm told that we'd have had to pay
>>> extra to get a real bare hard disk, and that, for a fee, they'd have
>>> been willing to custom-configure the BIOS. OK, with the BIOS this isn't
>>> too unreasonable: They have a standard BIOS for all systems and if you
>>> want something special, paying for it's the norm... But, still, this is
>>> a CLUSTER installation we were quoted, not a desktop.
>>> Also, I'm now told that "almost every customer" ordered their cluster
>>> configuration service at several kilobucks per rack. Since the team I'm
>>> working with has some degree of experience in configuring and installing
>>> hardware and software on computational clusters, now measured in at
>>> least 10 separate cluster installations, this seemed like an unnecessary
>>> expense. However, we're finding vendor gotchas that are annoying at the
>>> least, and sometimes cause significant work-around time/effort.
>>> Finally, our sales guy yesterday was somewhat baffled as to why we'd
>>> ordered without OS, and further why we were using Linux over Windows for
>>> HPC. Not trying to revive the recent rant-fest about Windows HPC
>>> capabilities, can anyone cite real HPC applications generally run on
>>> significant clusters (I'll accept Cornell's work, although I remain
>>> personally convinced that the bulk of their Windows HPC work has been
>>> dedicated to maintaining grant funding rather than doing real work)?
>>> No, I won't identify the vendor.
>>> Gerry Creager -- gerry.creager at tamu.edu
>>> Texas Mesonet -- AATLT, Texas A&M University
>>> Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983
>>> Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
More information about the Beowulf