[Beowulf] Building new cluster - estimate
bill at cse.ucdavis.edu
Tue Jul 29 23:42:19 PDT 2008
stephen mulcahy wrote:
> Bill Broadley wrote:
>> In general I'd say that the new kernels do much better on modern
>> hardware than the ugly situation of downloading a random RPM, or
>> waiting for official support. Seems like quite a few companies (ati,
>> 3ware, areca, intel, amd, and many others I'm sure) are trying hard to
>> improve the mainline kernel drivers.
>> I understand why RHEL doesn't change the kernel (stability, testing,
>> etc.), but not sure it's the best fit for HPC type applications,
>> especially with the pace of hardware changes these days.
> Hi Bill,
> My take on recent (2.6.x) mainline kernels was that there isn't as clear
> a distinction between production quality and developer quality kernels
Yup, pretty much all the mainline kernel.org releases receive a fair bit of
testing and percentage wise change very little, occasionally there is an
exception like what happened in, er, I think it was 2.6.10 when they changed
either the MMU or scheduler.
> these days as there used to be in the previous even/odd
> production/developer kernels. From scanning the kernel releases, it
> looks like you'd want to stay a minor revision or two behind the
> bleeding edge if you want some stability.
Sure, although I'm not sure you mean 2.6.24 when 2.6.26 is out, or 220.127.116.11
when 18.104.22.168 is out. Seems pretty rare that any mainline kernel is outright
unstable. Even when it is it's usually just a particular problem that effects
a relatively small fraction of users.... something I'd hope would be exposed
by relatively simple testing.
With HPC type use if a kernel dies in product I'll revert, sure I like to run
reliable clusters, but I'm usually abandoning the centos kernel because of a
major win like a more reliable RAID.
But sure, I'd recommend joining the kernel list if you run a kernel.org kernel
to see if people start screaming bloody murder. I'd strongly recommend a mail
reader that supports threads, it's basically impossible to read all of it.
> Has this been your experience or do you have extensive test facilities
> before rolling out mainline kernels onto production systems?
Extensive test facilities... no definitely not. Enough to see that the centos
kernels are completely broken on my hardware... often. Raid corruptions,
dropped disks, horrible network performance, unsupported cards, poor memory
performance, assuming wrong defaults for a CPU, missing PCI ids, disabled
driver because someone somewhere on the planet made a broken motherboard, numa
issues, CPU frequency issues, cpu temperature sensor issues, etc.
But in the 10 clusters I run I usually make decisions for the file servers vs
compute nodes differently and have a workload that i use to decide if it's
good enough to try in small production runs. Not particularly comprehensive,
but definitely tests the stuff I use heavily.
After all I'm using something like less than 1% of the kernel, very few
drivers and my hardware is identical (at least within a cluster).
More information about the Beowulf