[Beowulf] File server dual opteron suggestions?
landman at scalableinformatics.com
Fri Aug 4 06:12:34 PDT 2006
Mike Davis wrote:
> I don't mean to hijack the thread, but if Dave's users can fit the db's
> that they are running (Blast for instance against) in /tmp on the
> compute nodes, overall performance increases.
Yes, I agree with that. I like reminding customers that there is
nothing as fast in aggregate as a local file system.
> This certainly doesn't
> work with genbank (unless you have 130+gb of /tmp. But it does work well
> with nr, uniprot, and the other protein db's.I run a relatively large
> /tmp filesystems on my nodes (55-100GB). But my nodes are more general
When we build systems that have a large block sequential access
(read/write), we focus on building a faster local IO capability. Like
ram, compute node disk space is cheap, though for sequential access
dominated loads, more spindles is again almost always better.
> purpose and may be running blast one day, Gaussian 03 or VASP the next,
> and Fluent or abaqus after that.
> The performance increase will depend on the size of the db, the size of
> client and server caches, and the number of spindles.
> Joe Landman wrote:
>> Mark Hahn wrote:
>>>> I would recommend upping the memory. Computing or not, large buffer
>>>> caches on file servers are with very rare exception, a preferred
>>> unclear. the FS's memory does act as an excellent cache, but then
>>> the client memory does too. do you have a pattern of file accesses
>>> in which
>>> the same files are frequently re-read and would fit in memory? the
>>> I've looked at closely have had mostly write and attribute activity,
>>> since the client's own cache already has a high hit-rate. for
>>> writes, of
>>> course, more FS memory is not important unless you have extremely high
>> I was actually assuming read-dominated. Dave does informatics as I
>> remember, and most of the informatics we have dealt with tends to be
>> read dominated. Doesn't mean much though without the workload info
>> though. So I agree with the caution, though I humbly note that a 1GB
>> stick costs about 120$ +/- a bit these days. Eg, it is not a large
>> price, and the potential impact on performance is much higher than for
>> 10k RPM drives.
>> FWIW I have a pair of 10k RPM SATA raptors and I am not all that
>> impressed with them.
>>> bandwidth net and disks. in fact, I've been using the following
>>> # delay writing dirty blocks hoping to collect further writes
>>> (default 30s)
>>> vm.dirty_expire_centisecs = 1000
>>> # try writing back every 1s (default 500=5s)
>>> vm.dirty_writeback_centisecs = 100
>>> in short, don't bother working at write caching much. with a lot of
>>> an untuned machine will exhibit unpleasant oscillations of delaying
>>> then frantically flushing.
>> Yup. I had my dirty around 250 for a long time. Write caching is
>> harder because if you really want to play it safe, you shouldn't cache
>> the write ...
>>>> 2Gb/socket minimum. Nothing serves files faster than having them
>>>> already sitting in ram.
>>> true, but is that actually your working set size? it would be rather
>>> embarassing if 3 of the 4 GB were files read once a month...
>> Hmmm... again, this is a good workload problem. If Dave's users are
>> going through big "databases" from NCBI, lots of ram is a good thing.
>> It it is just a buncha small files, yeah, could be overkill.
>> But if I had to spend extra $$ on ram versus 10kRPM drives, I know
>> where I would spend it ...
>>>>> 4 x 74 Gb disks Ultra320 (or make an argument for a particular SATA)
>>> SATA disks are SATA disks, of course. dumb controllers are all pretty
>>> similar as well (cheap, fast, not-cpu-consuming). if you have your
>>> heart set on HW raid, at least get a 3ware 9550, which is quite fast.
>>> (most other HW raid are surprisingly bad.)
>> The LSI SAS unit is pretty good. I like the 3ware, the Areca, and a
>> few others. We just created a nice 500+ MB/s "file server" for a
>> large customer out of an Areca card, 16 spindles and some tweaking. I
>> haven't seen production performance data for it yet, but our in house
>> testing exceeded the 500 MB/s by a little bit.
>>>>> dual 10/100/1000 ethernet on the mobo
>>>> Careful on this... we and our customers have been badly bitten by
>>>> tg3 and broadcom NICs. If the MB doesn't have Intel NICs, get an
>>>> Intel 1000/MT dual gigabit card. You won't regret that, and it is
>>>> money well spent.
>>> that's odd; I have quite a few of both tg3 and bcm nics, and can't
>>> say I've had any complaints. what are the problems?
>> Interrupted to death. The tg3 doesn't seem to have NAPI turned on by
>> default in the standard distro kernels. Haven't tried the FC* with
>> this, hopefully it is saner there. Under heavy load, we see
>> interrupts climb past 40k/s, and it context switches like mad. Seen
>> this from early 2.6 through 2.6.13 on SuSE and RHEL. Makes using AOE
>> (Coraid) nearly useless with Broadcom, formatting the unit with ext3
>> renders the server unusable for hours. Drop a nice Intel unit in
>> there, do the same thing and it works great, server is responsive
>> during formatting. Same issues for file service and heavy load.
>> Seen this on Tyan, iWill, Arima?, MSI(ibm e32*), and others.
>>>>> case - 2U (big enough for adequate ventilation, right?)
>>>> Yeah, just make sure you have good airflow.
>>> 2U still requires a custom PS, doesn't it? it's kind of nice to be
>>> able to put in an ATX-ish PS. and is 2U tall enough for stock/standard
>> Don't know if it is custom. I like the redundant PS, but the small
>> redundant PSes tend not to supply enough current to boot the system.
>> Need a 3U case for that.
>> Best cooling designs I have seen involve baffles, and a pull or
>> push-pull config. We have used some units where under load the
>> processors are happily working around 22-28C. Fans are loud though.
>> Case (1U) is very cool to the touch.
>> For 2U you still need to worry about flow. I find it hard to believe
>> that most people get efficient flow out the back grating on 2U and
>> larger without a helper fan of some sort.
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
More information about the Beowulf