From john.hearns at streamline-computing.com  Tue Feb  1 09:34:32 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Tue, 01 Feb 2005 17:34:32 +0000
Subject: [Beowulf] Re: real hard drive failures
In-Reply-To: <Pine.LNX.4.44.0501310936090.27702-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0501310936090.27702-100000@coffee.psychology.mcmaster.ca>
Message-ID: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com>

On Mon, 2005-01-31 at 14:14 -0500, Mark Hahn wrote:
>  
> on that note, though - does anyone have comments about booting 
> machines from flash?
> 
I've booted a mini-ITX system from flash,
the distribution in question was a wireless access point.
All you need is a CF to IDE adapter.

Its common to have firewall distributions, such as ipcop,
to boot from flash.
http://www.ipcop.org/1.4.0/en/install/html/mkflash.html

I believe one wrinkle is to either log to a remote host,
or if you log locally to log to a ramdisk and only write to
the CF card at infrequent intervals.

John Hearns

ps.
>sounds like putting mudflaps and a cattle bar on a city-SUV.

Called Chelsea Tractors in the part of the world I live in.


From hahn at physics.mcmaster.ca  Tue Feb  1 10:07:37 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Tue, 1 Feb 2005 13:07:37 -0500 (EST)
Subject: [Beowulf] Re: real hard drive failures
In-Reply-To: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com>
Message-ID: <Pine.LNX.4.44.0502011306080.13445-100000@coffee.psychology.mcmaster.ca>

> > on that note, though - does anyone have comments about booting 
> > machines from flash?
> > 
> I've booted a mini-ITX system from flash,
> the distribution in question was a wireless access point.
> All you need is a CF to IDE adapter.

I don't really see those much at all.  perhaps I'm not using 
the right search terms.

have you looked into booting from usb-flash?  that would be very 
much dependent on bios, of course, but far more accessible.

thanks, mark hahn.


From James.P.Lux at jpl.nasa.gov  Tue Feb  1 10:32:35 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Tue, 01 Feb 2005 10:32:35 -0800
Subject: [Beowulf] Re: real hard drive failures
References: <Pine.LNX.4.44.0501310936090.27702-100000@coffee.psychology.mcmaster.ca>
Message-ID: <6.1.1.1.2.20050201103123.0417a670@mail.jpl.nasa.gov>

At 09:34 AM 2/1/2005, John Hearns wrote:
>On Mon, 2005-01-31 at 14:14 -0500, Mark Hahn wrote:
> >
> > on that note, though - does anyone have comments about booting
> > machines from flash?
> >
>I've booted a mini-ITX system from flash,
>the distribution in question was a wireless access point.
>All you need is a CF to IDE adapter.
>
>Its common to have firewall distributions, such as ipcop,
>to boot from flash.
>http://www.ipcop.org/1.4.0/en/install/html/mkflash.html
>
>I believe one wrinkle is to either log to a remote host,
>or if you log locally to log to a ramdisk and only write to
>the CF card at infrequent intervals.
>
>John Hearns
>
>ps.
> >sounds like putting mudflaps and a cattle bar on a city-SUV.
>
>Called Chelsea Tractors in the part of the world I live in.
>

I boot mini-ITX systems from flash, and also via PXE, both wireless and wired.
As John says, you need a CF to IDE adapter, which in my case is combined 
with the unregulated 12VDC to ATX power supply, a watchdog timer, and some 
other hardware.


James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From James.P.Lux at jpl.nasa.gov  Tue Feb  1 10:50:28 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Tue, 01 Feb 2005 10:50:28 -0800
Subject: [Beowulf] Re: real hard drive failures
References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com>
Message-ID: <6.1.1.1.2.20050201104425.041f1938@mail.jpl.nasa.gov>

At 10:07 AM 2/1/2005, Mark Hahn wrote:
> > > on that note, though - does anyone have comments about booting
> > > machines from flash?
> > >
> > I've booted a mini-ITX system from flash,
> > the distribution in question was a wireless access point.
> > All you need is a CF to IDE adapter.
>
>I don't really see those much at all.  perhaps I'm not using
>the right search terms.

Try JKMicrodevices or ituner.com or www.mini-itx.com or 
www.damnsmalllinux.org or wwww.logicsupply.com or www.epiacenter.com

(google for "compact flash mini-itx" )


They run about $15-$20, depending on configuration, and there's nothing 
special about them for MiniITX.. they should work on anything.

There ARE rumored to be "difficulties" with how the CF is formatted in some 
contexts, but I don't know any details.  Maybe it has to do with whether 
the BIOS supports the "virtual" head, track, sector details?

I've also heard that one cannot boot "Win xx" from CF, but have no reason 
to see why this would be so (it's a disk drive, after all...)  Maybe with a 
PCI<>CF adapter it's a problem?


>have you looked into booting from usb-flash?  that would be very
>much dependent on bios, of course, but far more accessible.

Oooooh... that didn't work so well for me on the various machines I tried 
it on.  The IDE/CF is essentially bios independent (to the BIOS, it just 
looks link another IDE drive).  The USB drive has to have all the USB stuff 
up and running first.


>thanks, mark hahn.

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From alvin at Mail.Linux-Consulting.com  Tue Feb  1 13:58:51 2005
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Tue, 1 Feb 2005 13:58:51 -0800 (PST)
Subject: [Beowulf] Re: real hard drive failures
In-Reply-To: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com>
Message-ID: <Pine.LNX.3.96.1050201134807.25366B-100000@Maggie.Linux-Consulting.com>

On Tue, 1 Feb 2005, John Hearns wrote:

> > on that note, though - does anyone have comments about booting 
> > machines from flash?
> > 
> I've booted a mini-ITX system from flash,
> the distribution in question was a wireless access point.
> All you need is a CF to IDE adapter.

ANY system can be booted from CF ...

amd for an AP, you'd probably want to boot off a usb stick
since those are presumably hotswappable whereas CF is not

there are lots of "CF - ide adpators"

	pcengine.ch makes um and resells to the list of folks
	in the list jim posted
	( ituner(mini-box), logicsupply, etc ... )

	they also have those that plug the CF into the ide port
	on the motherboard

	- but, i havent seen any hotswap cf-ide adaptors yet though

> Its common to have firewall distributions, such as ipcop,
> to boot from flash.
> http://www.ipcop.org/1.4.0/en/install/html/mkflash.html

installing to 128MB or 256MB CF implies that you
install the minimum packages ( glibc + networking ) and
have the rest of your binaries on nfs-server:/usr/local/cluster-stuff
which gets automounted onto the CF-based nodes

it'd be good to keep a master CFnode ( minimal system install ) so that 
it can be updated and patched as needed on one place, and those patch
files also makes it to the next CF release for the other nodes  -- or
dont patch the cf after its made :-)
 
> I believe one wrinkle is to either log to a remote host,
> or if you log locally to log to a ramdisk and only write to
> the CF card at infrequent intervals.

writing to CF is good and bad ... since it has limited write
capabilities, but there's not much writing that needs to 
be done, and even if there is, one can write all the system
data to /dev/ramdisk instead of CF

the CF can be mounted read-only

c ya
alvin
 

From john.hearns at streamline-computing.com  Wed Feb  2 08:38:38 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Wed, 2 Feb 2005 16:38:38 -0000 (GMT)
Subject: [Beowulf] Re: real hard drive failures
In-Reply-To: <Pine.LNX.4.44.0502011306080.13445-100000@coffee.psychology.mcmaster.c
	a>
References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com>
	<Pine.LNX.4.44.0502011306080.13445-100000@coffee.psychology.mcmaster.ca>
Message-ID: <33904.143.167.3.70.1107362318.squirrel@webmail.streamline-computing.com>

>> > on that note, though - does anyone have comments about booting
>> > machines from flash?
>>
>
> have you looked into booting from usb-flash?  that would be very
> much dependent on bios, of course, but far more accessible.
>
Indeed, as Alvin says any system can be booted from a CF.

Some mini-ITX Cases come with a little slot, which makes changing
the CF card easy.

I agree with the USB comment - I always travel with a USB stick
which has Stresslinux on it.
www.stresslinux.org
This is a little distro which has lm_sensors, cpu_burn etc. on it,
plus memtest.
Invaluable for the roaming engineer :-)


From list-beowulf at onerussian.com  Thu Feb  3 19:20:05 2005
From: list-beowulf at onerussian.com (Yaroslav Halchenko)
Date: Thu, 3 Feb 2005 22:20:05 -0500
Subject: [Beowulf] NFS over TCP or smth else... WHAT I've done wrong?
Message-ID: <20050204032005.GB2444@washoe.rutgers.edu>

Dear Beowulfers,

Today is sad day for our 25 nodes cluster: I decided to improve its
performance and as a result I crippled it quite a lot.

The story is that for some reason many nodes started loosing connection
with the NFS server node, I started looking for a solution and decided
to try NFS over TCP. After I've adjusted configs across the cluster
(cfengine rulez), even rebooted the nodes (besides the main one) for
the sake of it, and put a slight load on a cluster (occupied 6 nodes
with intensive I/O which rw data from the NFS server) pretty much all of
60 nfsd instances start occupying CPU on the main node, so load reached
around 20 or 30 which is star hitting number... main node (NFS server)
start to behave unresponsively and start killing applications due to
reason of "running out of memory". 

So what is wrong in the next config:
vana:/raid        /raid   nfs defaults,tcp,hard,rw,nosuid,wsize=8192,rsize=8192
?
later I've adjusted it with bg,timeo=60,noatime to reduce the load but
it didn't quite help.

details about cluster: 23 active nodes at the moment running 2.6.8.1 SMP,
main node with 8GB, RPCNFSDCOUNT=70, nfs-kernel-server

What would be the best NFS config for it if we provide two directories
from the NFS server: 

/raid as rw,sync
and
/share/apps as ro,async

Thank you in advance

P.S. BTW - here is the dump from "killing mess"

Fixed up OOM kill of mm-less task
oom-killer: gfp_mask=0xd0
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
HighMem per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16

Free pages:     2969440kB (2966528kB HighMem)
Active:506964 inactive:611412 dirty:461 writeback:0 unstable:0 free:742360 slab:193835 mapped:269296 pagetables:2983
DMA free:1048kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB
protections[]: 8 476 732
Normal free:1864kB min:936kB low:1872kB high:2808kB active:32632kB inactive:21288kB present:901120kB
protections[]: 0 468 724
HighMem free:2966528kB min:512kB low:1024kB high:1536kB active:1995096kB inactive:2424488kB present:7471104kB
protections[]: 0 0 256
DMA: 0*4kB 15*8kB 10*16kB 8*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1048kB
Normal: 14*4kB 2*8kB 0*16kB 38*32kB 1*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1864kB
HighMem: 0*4kB 0*8kB 0*16kB 48126*32kB 16915*64kB 2081*128kB 109*256kB 57*512kB 20*1024kB 0*2048kB 0*4096kB = 2966528kB
Swap cache: add 538373, delete 522525, find 54148646/54172304, race 0+5
Out of Memory: Killed process 17465 (gnome-settings-).


-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]
             Key  http://www.onerussian.com/gpg-yoh.asc
GPG fingerprint   3BB6 E124 0643 A615 6F00  6854 8D11 4563 75C0 24C8

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050203/a3da36ad/attachment.sig>

From wytsang at clustertech.com  Tue Feb  1 02:42:37 2005
From: wytsang at clustertech.com (Clotho)
Date: Tue, 01 Feb 2005 18:42:37 +0800
Subject: [Beowulf] IntelMPITEST-1.0 compiled with icc in heterogeneous
	environment
Message-ID: <41FF5D1D.50903@clustertech.com>

Hi,

I would like to ask a question about using icc to compile IntelMPITEST-1.0
and run the program in heterogeneous environment.

I have a i386 node and a x86_64 node.
I have configed and compiled IntelMPITEST-1.0 testsuite at the i386 node.
I run the testsuite in the i386 node, and use "mpirun -machinefile" to run
the binary on both nodes.

I have tried the test with gcc and pgi compilers, they work.
But for icc8, I have encounter error in c/blocking/functional/MPI_Ssend_ator

The error message is very long,  but has similar pattern like:

MPITEST error (3): i=0, long double value=  -0.0000000000, expected    
0.0000000000
MPITEST error (3): 10 errors in buffer (3,0,13) len 8 commsize 4 
commtype -10 data_type 13 root 3
MPITEST error (3): Send/Receive lengths differ - 
Sender(node/length)=0/8,  Receiver(node/length)=3/-32766
MPITEST error (3): i=0, long double value=  -0.0000000000, expected    
0.0000000000
MPITEST error (3): 117 errors in buffer (4,0,13) len 83 commsize 4 
commtype -10 data_type 13 root 3
MPITEST error (3): Send/Receive lengths differ - 
Sender(node/length)=0/83,  Receiver(node/length)=3/-32766

All the errors are related to data_type 13 and 14.
This error does not happen when I run the tests on 2 i386 nodes.

Have you any idea on the problem? Thank you.

PS.
I find that the error message is produced from "libmpitest.c" line 2361.

And I find that one of the many compilation warning is related to the line
./libmpitest.c(2361): warning #181: argument is incompatible with 
corresponding format string conversion
                          i, ((derived1 *)buffer)[i].LongDouble[k],

May be it's related, I am not sure.


From denis.che at gmail.com  Tue Feb  1 07:18:22 2005
From: denis.che at gmail.com (Denis)
Date: Tue, 1 Feb 2005 10:18:22 -0500
Subject: [Beowulf] Re: Max common block size, global array size on ia32
Message-ID: <d9a0a6da0502010718602d666d@mail.gmail.com>

>A more involved fix is to change the location of the shared
>libraries in memory by changing kernel. Look for the variable
>__PAGE_OFFSET in the kernel header files.


How exactly do you go about doing this?  I know how to
compile/recompile a kernel, but I have no idea as to how to implement
this fix...

I have a similar machine... Dual Xeon 2.2-GHz with 2GB RAM and exactly
the same problem with mem limitations for a single fixed-size array...

Thanks


From Kris.Boutilier at scrd.bc.ca  Tue Feb  1 10:23:05 2005
From: Kris.Boutilier at scrd.bc.ca (Kris Boutilier)
Date: Tue, 1 Feb 2005 10:23:05 -0800 
Subject: [Beowulf] Re: real hard drive failures
Message-ID: <C56ABDBB289D224AA7EA7EFF754E522201418F@terra.secure.scrd.bc.ca>

There quite an elegant set of scripts available at
http://gate-bunker.p6.msu.ru/~berk/router.html to tweak a standard debian
installation to boot from an IDE device and run entirely from tempfs from
that point on, thereby avoiding the 'worn out' compact flash problem.
Targeted at router applications but certainly useful for other semi-embedded
applications.

> -----Original Message-----
> From:	John Hearns [SMTP:john.hearns at streamline-computing.com]
> Sent:	Tuesday, February 01, 2005 9:35 AM
> To:	beowulf at beowulf.org
> Subject:	Re: [Beowulf] Re: real hard drive failures
> 
> On Mon, 2005-01-31 at 14:14 -0500, Mark Hahn wrote:
> >  
> > on that note, though - does anyone have comments about booting 
> > machines from flash?
> > 
> I've booted a mini-ITX system from flash,
> the distribution in question was a wireless access point.
> All you need is a CF to IDE adapter.
> 
> Its common to have firewall distributions, such as ipcop,
> to boot from flash.
> http://www.ipcop.org/1.4.0/en/install/html/mkflash.html
> 
	{clip}


From dwu at swales.com  Tue Feb  1 11:18:05 2005
From: dwu at swales.com (Dominic Wu)
Date: Tue, 1 Feb 2005 11:18:05 -0800
Subject: [Beowulf] Re: real hard drive failures
References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com>
	<6.1.1.1.2.20050201104425.041f1938@mail.jpl.nasa.gov>
Message-ID: <003701c50892$c2876df0$69704e89@jpl.nasa.gov>

It is motherboard dependent and if your BIOS supports USB boot and most
newer ones do, there should be no problem in theory.  That said, booting up
CF (or any solidstate/microdrive devices) via an IDE interface is still
probably easier with less drivers you have to load.

>
> >have you looked into booting from usb-flash?  that would be very
> >much dependent on bios, of course, but far more accessible.
>
> Oooooh... that didn't work so well for me on the various machines I tried
> it on.  The IDE/CF is essentially bios independent (to the BIOS, it just
> looks link another IDE drive).  The USB drive has to have all the USB
stuff
> up and running first.
>
>
> >thanks, mark hahn.
>
> James Lux, P.E.
> Spacecraft Radio Frequency Subsystems Group
> Flight Communications Systems Section
> Jet Propulsion Laboratory, Mail Stop 161-213
> 4800 Oak Grove Drive
> Pasadena CA 91109
> tel: (818)354-2075
> fax: (818)393-6875
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From Glen.Gardner at verizon.net  Tue Feb  1 17:20:00 2005
From: Glen.Gardner at verizon.net (Glen Gardner)
Date: Tue, 01 Feb 2005 20:20:00 -0500
Subject: [Beowulf] Re: real hard drive failures
References: <Pine.LNX.4.44.0502011306080.13445-100000@coffee.psychology.mcmaster.ca>
Message-ID: <42002AC0.3090806@verizon.net>

USB flash is really slow. Regular CF (@ 128 KB/s writes) on a cf to ide 
adapter is a lot faster (particularly write speed) than USB flash (@ 64 
KB/s write speed) "thumb drives".

I've had good luck with IBM microdrives, but CF is getting cheaper than 
microdrives.  Of course , the microdrives are a lot faster (@ 1MB/s R/W) 
than CF on write.  But CF is pretty fast on read (10 MB/s ??).

CF has a limited number of writes before it fails , anywhere from 100K 
to 1M write cycles. The time for write cycles is typically anywhere from 
300 milliseconds to 500 milliseconds for a 32 KB chunk for regular CF. 
 Typically you write a chunk of CF at once in each write cycle, and 32KB 
is a typical figure for that (but it varies with the particular memory 
chips used). This is why CF is so awfully slow when writing. Using 
serial CF makes it even worse, which is one reason why USB thumb drives 
are even slower than regular CF cards.

CF is okay for booting a system from, but things like /tmp , /var are 
best mountd in a memory file  system and only written to cf when 
shutting down.
Swap partitions and /home need to be mounted via NFS.  

At present, I have two nodes of a 14 node cluster booting from CF, and 
/home is mounted on another machine with a proper hard drive via NFS. 
 Ten of the nodes are booting from microdrives, and two nodes have ata 
133 hard drives for /home, development and backups.

/var  /tmp and swap are actually mounted on the cf card, and I'm waiting 
to see  how long before the cf actually expires.  These nodes have been 
up 24/7 for over a month now, with no problems. I have not tried to 
force the nodes to swap. For saving power and reducing heat, CF is going 
to be the best you can get. Microdrives are almost as good, laptop 
drives are pretty good, and a regular IDE drive is a pig in comparison.

I use a USB thumb drive with a bootable OS on it as an emergency boot 
drive.  It comes in handy when installing a node.
Since I use microdrives, all I do is shut down the node and plug the new 
microdrive into the cf adapter, and the cf thumb drive in the usb port 
and  turn the node on, and it boots from USB so I can then install a 
system image stored on the development node onto the new microdrive via 
an NFS mount. It takes about 5 minutes to install and configure a new 
node in this fashion. Writing the disk image to a 512 MB cf card is 
going to take up to an hour, and plan on at least twice that to write a 
disk image to a 512 MB USB flash. (CF is just plain slow)

Glen


Mark Hahn wrote:

>>>on that note, though - does anyone have comments about booting 
>>>machines from flash?
>>>
>>>      
>>>
>>I've booted a mini-ITX system from flash,
>>the distribution in question was a wireless access point.
>>All you need is a CF to IDE adapter.
>>    
>>
>
>I don't really see those much at all.  perhaps I'm not using 
>the right search terms.
>
>have you looked into booting from usb-flash?  that would be very 
>much dependent on bios, of course, but far more accessible.
>
>thanks, mark hahn.
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>

-- 
Glen E. Gardner, Jr.
AA8C
AMSAT MEMBER 10593
Glen.Gardner at verizon.net


http://members.bellatlantic.net/~vze24qhw/index.html


From award at andorra.ad  Tue Feb  1 23:25:01 2005
From: award at andorra.ad (Alan Ward i Koeck)
Date: Wed, 02 Feb 2005 08:25:01 +0100
Subject: [Beowulf] Re: real hard drive failures
References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com>
	<6.1.1.1.2.20050201104425.041f1938@mail.jpl.nasa.gov>
Message-ID: <4200804D.66B07502@andorra.ad>

Jim Lux wrote:
> 
> At 10:07 AM 2/1/2005, Mark Hahn wrote:
> 
> >have you looked into booting from usb-flash?  that would be very
> >much dependent on bios, of course, but far more accessible.
> 
> Oooooh... that didn't work so well for me on the various machines I tried
> it on.  The IDE/CF is essentially bios independent (to the BIOS, it just
> looks link another IDE drive).  The USB drive has to have all the USB stuff
> up and running first.

Done that, though I had to use a kernel diskette with USB et al compiled
in.
My BIOS could only boot from a USB external hard drive/CD, not flash.

Alan Ward
 
> >thanks, mark hahn.
> 
> James Lux, P.E.
> Spacecraft Radio Frequency Subsystems Group
> Flight Communications Systems Section
> Jet Propulsion Laboratory, Mail Stop 161-213
> 4800 Oak Grove Drive
> Pasadena CA 91109
> tel: (818)354-2075
> fax: (818)393-6875
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mtpratol at cs.sfu.ca  Wed Feb  2 12:54:28 2005
From: mtpratol at cs.sfu.ca (Matthew Pratola)
Date: Wed, 2 Feb 2005 12:54:28 -0800 (PST)
Subject: [Beowulf] SGE web frontends
Message-ID: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>

Hi all,

Can anyone recommend a simple web frontend for submitting SGE jobs?

Thanks,

Matthew Pratola
M.Sc. Candidate
Dept. of Statistics and Actuarial Science
Simon Fraser University
Vancouver, BC, CANADA


From diep at xs4all.nl  Wed Feb  2 19:53:27 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Thu, 03 Feb 2005 04:53:27 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
Message-ID: <3.0.32.20050203045323.01002100@pop.xs4all.nl>

Good morning!

With the intention to run my chessprogram on a beowulf to be constructed
here (starting with 2 dual-k7 machines here) i better get some good advice
on which network to buy. Only interesting thing is how fast each node can
read out 64 bytes randomly from RAM of some remote cpu. All nodes do that
simultaneously.

The faster this can be done the better the algorithmic speedup for parallel
search in a chess program (property of YBW, see publications in journal of
icga: www.icga.org). This speedup is exponential (or better you get
punished exponential compared to single cpu performance).

Which network cards considering my small budget are having lowest latencies
can be used?

quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per
card when i altavista'ed online and i wonder how to get more than 2 nodes
to work without switch. Perhaps there is low cost switches with reasonable
low latency?

Please note MPI is probably what i'll use, though i keep finding online
information about 'gamma'. Is that faster latency than MPI implementations?

Note normal 1Gbit cards for normal network traffic.

Each node is a SMP or NUMA node and not only multiprocessor also
multithreaded.

I welcome any advice,
Best regards,
Vincent

Vincent Diepeveen


From rhamann at uccs.edu  Wed Feb  2 23:56:13 2005
From: rhamann at uccs.edu (R Hamann)
Date: Thu, 03 Feb 2005 00:56:13 -0700
Subject: [Beowulf] MPICH2: Handle Limit?
Message-ID: <web-21778622@uccs.edu>

I've been having some strange problems with a program using the MPICH2 
library.  When I added some new datatypes for ghost cell exchange, the 
program would hang.  I figured out that any number of handles over 84 
would cause this.  Fortunately, I could delete some handles that I no 
longer needed, but it still seemed strange.  Are my calculations 
correct that for each process there is an 84 handle limit? or am I 
seeing some other problem?

Ron


From maurice at harddata.com  Thu Feb  3 16:05:19 2005
From: maurice at harddata.com (Maurice Hilarius)
Date: Thu, 03 Feb 2005 17:05:19 -0700
Subject: [Beowulf] Re: Botting from flash ( was Re: Re: real hard drive
	failures)
In-Reply-To: <200501311938.j0VJc3lt003632@bluewest.scyld.com>
References: <200501311938.j0VJc3lt003632@bluewest.scyld.com>
Message-ID: <4202BC3F.4070401@harddata.com>


>From: Mark Hahn <hahn at physics.mcmaster.ca>
>Subject: Re: [Beowulf] Re: real hard drive failures
>...
>
>on that note, though - does anyone have comments about booting 
>machines from flash?
>  
>
Compact Flash (CF) IS an ATA device, and requires no specific drivers 
other than standard kernel ATA driver.
CF slot reader/writers are now under $25, and as a matter of fact we 
offer this as an option on both our tower workstations on in our rack 
chassis.
Recent prices on CF are at $50 or less for 512MB, so a "CD sized" boot 
image flash device is now trivial.

If you look inside a Force10 network switch you will see the OS and 
firmware are loaded on a flash card.

You can even buy CF packaged in a device that is a 40 pin female 
"dongle" that plugs directly to the motherboard HD IDE slot.
These go for around $100 for 512MB

Other flash types, like SD, XD, Memory stick, &c do not have the AT 
interface built in, so a chip and driver are needed to use them, pretty 
well ruling them out as useful for boot devices, unless you write the 
driver into BIOS, on, for example, LinuxBIOS.


With our best regards,

Maurice W. Hilarius        Telephone: 01-780-456-9771
Hard Data Ltd.  FAX:       01-780-456-9772
11060 - 166 Avenue         email:maurice at harddata.com
Edmonton, AB, Canada       http://www.harddata.com/
   T5X 1Y3


This email, message, and content, should be considered confidential,
and is the copyrighted property of Hard Data Ltd., unless stated otherwise.


From rgb at phy.duke.edu  Fri Feb  4 03:55:48 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 4 Feb 2005 06:55:48 -0500 (EST)
Subject: [Beowulf] SGE web frontends
In-Reply-To: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>
Message-ID: <Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>

On Wed, 2 Feb 2005, Matthew Pratola wrote:

> Hi all,
> 
> Can anyone recommend a simple web frontend for submitting SGE jobs?

  http://www.globus.org/

One stop shopping.

    rgb

> 
> Thanks,
> 
> Matthew Pratola
> M.Sc. Candidate
> Dept. of Statistics and Actuarial Science
> Simon Fraser University
> Vancouver, BC, CANADA
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From landman at scalableinformatics.com  Fri Feb  4 05:20:48 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 04 Feb 2005 08:20:48 -0500
Subject: [Beowulf] SGE web frontends
In-Reply-To: <Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>
	<Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>
Message-ID: <420376B0.7000107@scalableinformatics.com>


Robert G. Brown wrote:
> On Wed, 2 Feb 2005, Matthew Pratola wrote:
> 
> 
>>Hi all,
>>
>>Can anyone recommend a simple web frontend for submitting SGE jobs?
> 
> 
>   http://www.globus.org/
> 
> One stop shopping.

Did I miss something?  Was a tongue planted in cheek with this reply?

As far as I know there are very few web interfaces to running SGE (or 
LSF, or ...) jobs.  If I am wrong please do provide links/references.

Globus is not a web interface (last I checked), but a large group of 
middleware to manage something that looks a lot closer to the definition 
of a grid than SGE.  SGE is a job scheduler (with a name "engineered" to 
make you think it is a one-stop-shop as a grid-in-a-box).

My company is interested in (and we are developing) web portals for end 
user cluster work, so if you know of any, we would like to hear about 
them.  Good open-source platforms that are current/supported could be 
worth looking at (and will save us time/development effort).  There seem 
to be lots of bits of abandonware in the grid portal/user-interface 
area.  We don't want to re-invent wheels, but at the same time, we don't 
want to adopt abandoned ones either.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615


From rene at renestorm.de  Fri Feb  4 03:12:39 2005
From: rene at renestorm.de (rene)
Date: Fri, 4 Feb 2005 12:12:39 +0100
Subject: [Beowulf] SGE web frontends
In-Reply-To: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>
Message-ID: <200502041212.39511.rene@renestorm.de>

Hi Matthew,

i've been working/thinking on that a year ago and my opinion:
"You don't want to do that."

But there are some questions 
Do you want to go public with that little webpage?
Do you want to execute common sge jobs or is it just one application?
Do these jobs have input data?
How complex is your authorization hierarchy?
What do you do with the next sge release?
How to you share the results and the status with the users?

There are webfrontend for cluster apllications out there eg
 NCBI's blast, but never heard of it for common jobs.

Cya


> Hi all,
>
> Can anyone recommend a simple web frontend for submitting SGE jobs?

>
> Thanks,
>
> Matthew Pratola
> M.Sc. Candidate
> Dept. of Statistics and Actuarial Science
> Simon Fraser University
> Vancouver, BC, CANADA
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Rene Storm
@Cluster


From diep at xs4all.nl  Fri Feb  4 04:35:22 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Fri, 04 Feb 2005 13:35:22 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
Message-ID: <3.0.32.20050204133518.01007860@pop.xs4all.nl>

At 00:29 4-2-2005 -0800, Bill Broadley wrote:
>On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
>> Good morning!
>> 
>> With the intention to run my chessprogram on a beowulf to be constructed
>> here (starting with 2 dual-k7 machines here) i better get some good advice
>> on which network to buy. Only interesting thing is how fast each node can
>> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that
>> simultaneously.
>
>Is there any way to do this less often with a larger transfer?  
>If you
>wrote a small benchmark that did only that (send 64 bytes randomly
>from a large array in memory) and make it easy to download, build, run,
>and report results, I suspect some people would.

One way pingpong with 64 bytes will do great.

Shared memory examples i have plenty, but one way pingpong approaches it
excellent. Just multiply the time with 2 and one knows the bound :)

>> The faster this can be done the better the algorithmic speedup for parallel
>> search in a chess program (property of YBW, see publications in journal of
>> icga: www.icga.org). This speedup is exponential (or better you get
>> punished exponential compared to single cpu performance).
>> 
>> Which network cards considering my small budget are having lowest latencies
>> can be used?
>
>Define small budget.  For more than 2 nodes myrinet needs a switch.
>Do you expect to be totally network latency bound?  How low is enough
>to keep the processors busy?

CPU's are 100% busy and after i know how many times a second the network
can handle in theory requests i will do more probes per second to the
hashtable. The more probes i can do the better for the game tree search.

>> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per
>> card when i altavista'ed online and i wonder how to get more than 2 nodes
>> to work without switch. Perhaps there is low cost switches with reasonable
>> low latency?
>Do you know that gigabit is too high latency?

The few one way pingpong times i can find online from gigabit cards are not
exactly promising, to say it very polite. Something in the order or 50 us
one way pingpong time i don't even consider worth taking a look at at the
picture.

Each years cpu's get faster. For small networks 10 us really is the upper
limit.

>Can't you send enough
>work, like say search 3 moves ahead on the head node, then for each legal
>move send that search tree to a different node?  Each node would reply with
>the highest ranked moves when done.

Let's not discuss parallel chess algorithm too much in depth. 100 different
algorithms/enhancements get combined with each other. They are not the
biggest latency problem. The latency problem is caused by the hashtable.
Hashtable is a big cache. The bigger the better. It avoids researching the
same tree again.

In games like chess and every search terrain (even simulated flight) you
can get back to the same spot by different means causing a transposition.
Like suppose you start the game with 1.e4,e5 2.d4 that leads to the same
position like 1.d4,e5 2.e4. So if we have searched already 1.e4,e5 2.d4
that position P we store into a large cache. Other cpu's first want to know
whether we already searched that position. 

Those hashtable positions get created quite quickly. Deep Blue created them
at a 100 million positions a second and simply didn't store vaste majority
in hashtable (would be hard as it was in hardware). That's one of the
reasons why it searched only 10-12 ply, already in 1999 that was no longer
spectacular when 4 processor pc's showed up at world champs. 

At a PC with a shared hashtable nowadays i get 10-12 ply (ply = half move,
full move is when both sides make a move) in a few seconds, searching a
100000 positions per second a cpu.

So before we start searching every node (=position) we quickly want to find
out whether other cpu's already searched it.

At the origin3800 at 512 processors i used a 115 GB hashtable (i started
search at 460 processors). Simply because the machine has 512GB ram.

So in short you take everything you can get.

The search works with internal iterative deepending which means we first
search 1 ply, then 2 ply, then 3 ply and so on.

The time it takes to get to the next iteration i hereby define as the
branching factor (Knuth has a different definition as he just took into
account 1 algorithm, the 'todays' definition looks more appropriate).

In order to search 1 ply deeper obvious it's important to maintain a good
branching factor. I'm very bad in writing out mathematical proofs, but it's
obvious that the more memory we use, the more we can reduce the number of
legal moves in this position P as next few ply it might be in hashtable,
which trivially makes the time needed to search 1 ply deeper shorter.

Storing closer to the root (position where we started searching) is of
course more important than near the leafs of the search tree.

When for example not storing in hashtable last 10 ply near the leafs in an
overnight experiment the search depth dropped at 460 processors from 20 ply
to 13 ply.

Of course each processor of supercomputers is deadslow for game tree search
(it's branchy 100% integer work completely knocking down the caches), so
compared to pc's you already start at a disadvantage of a factor 16 or so
very quickly, before you start searching (in case of TERAS i had to fight
with outdated 500Mhz MIPS processors against opterons and high clocked quad
Xeons), so upgrading my own networkcards is more clever. 

Yet getting yourself a network even between a few nodes as quick as those
supercomputers is not so easy...

Additional your own beowulf network you can first decently test at before
playing at a tournament, and without good testing at the machine you play
at in tournaments you have a hard 0% chance that it plays well. 

The only thing in software that matters is testing.

>-- 
>Bill Broadley
>Computational Science and Engineering
>UC Davis


From ilumb at platform.com  Fri Feb  4 06:06:14 2005
From: ilumb at platform.com (Ian Lumb)
Date: Fri, 4 Feb 2005 09:06:14 -0500
Subject: [Beowulf] SGE web frontends
Message-ID: <4AB0624F069DAD4E90F18B13A818EEFE016B50A6@catoexm04.noam.corp.platform.com>

Open Source GridPort (www.gridport.net) merits consideration. It interfaces with SGE, LSF, PBS, etc., via Globus.

And NICE EnginFrame (http://www.enginframe.com) is a commercial offering which already has customizations for the Life Sciences.

For the record, we provide our own Web GUI with Platform LSF, and make use of these portals as required.

-Ian


-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]On
Behalf Of Joe Landman
Sent: Friday, February 04, 2005 8:21 AM
To: Robert G. Brown
Cc: Matthew Pratola; beowulf at beowulf.org
Subject: Re: [Beowulf] SGE web frontends


Robert G. Brown wrote:
> On Wed, 2 Feb 2005, Matthew Pratola wrote:
> 
> 
>>Hi all,
>>
>>Can anyone recommend a simple web frontend for submitting SGE jobs?
> 
> 
>   http://www.globus.org/
> 
> One stop shopping.

Did I miss something?  Was a tongue planted in cheek with this reply?

As far as I know there are very few web interfaces to running SGE (or 
LSF, or ...) jobs.  If I am wrong please do provide links/references.

Globus is not a web interface (last I checked), but a large group of 
middleware to manage something that looks a lot closer to the definition 
of a grid than SGE.  SGE is a job scheduler (with a name "engineered" to 
make you think it is a one-stop-shop as a grid-in-a-box).

My company is interested in (and we are developing) web portals for end 
user cluster work, so if you know of any, we would like to hear about 
them.  Good open-source platforms that are current/supported could be 
worth looking at (and will save us time/development effort).  There seem 
to be lots of bits of abandonware in the grid portal/user-interface 
area.  We don't want to re-invent wheels, but at the same time, we don't 
want to adopt abandoned ones either.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From laurence at scalablesystems.com  Fri Feb  4 06:08:09 2005
From: laurence at scalablesystems.com (Laurence Liew)
Date: Fri, 04 Feb 2005 22:08:09 +0800
Subject: [Beowulf] SGE web frontends
In-Reply-To: <420376B0.7000107@scalableinformatics.com>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>	<Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>
	<420376B0.7000107@scalableinformatics.com>
Message-ID: <420381C9.808@scalablesystems.com>

Hi all

We have the SGE web interface.

It integrates into our Rocks cluster management web interface. That is 
you NEED to use ROCKS (www.rockscluster.org)

You can download RxC from www.scalablesystems.com

It is free for non-commercial, academic use.

It provides web based:
- SGE management
- SGE job submission
- some basic reporting
- and of course managing a Rocks cluster via a web interface.

Have fun.

Laurence

Joe Landman wrote:
> 
> 
> Robert G. Brown wrote:
> 
>> On Wed, 2 Feb 2005, Matthew Pratola wrote:
>>
>>
>>> Hi all,
>>>
>>> Can anyone recommend a simple web frontend for submitting SGE jobs?
>>
>>
>>
>>   http://www.globus.org/
>>
>> One stop shopping.
> 
> 
> Did I miss something?  Was a tongue planted in cheek with this reply?
> 
> As far as I know there are very few web interfaces to running SGE (or 
> LSF, or ...) jobs.  If I am wrong please do provide links/references.
> 
> Globus is not a web interface (last I checked), but a large group of 
> middleware to manage something that looks a lot closer to the definition 
> of a grid than SGE.  SGE is a job scheduler (with a name "engineered" to 
> make you think it is a one-stop-shop as a grid-in-a-box).
> 
> My company is interested in (and we are developing) web portals for end 
> user cluster work, so if you know of any, we would like to hear about 
> them.  Good open-source platforms that are current/supported could be 
> worth looking at (and will save us time/development effort).  There seem 
> to be lots of bits of abandonware in the grid portal/user-interface 
> area.  We don't want to re-invent wheels, but at the same time, we don't 
> want to adopt abandoned ones either.
> 
> Joe
> 

-- 
Laurence Liew, CTO		Email: laurence at scalablesystems.com
Scalable Systems Pte Ltd	Web  : http://www.scalablesystems.com
(Reg. No: 200310328D)
7 Bedok South Road		Tel  : 65 6827 3953
Singapore 469272		Fax  : 65 6827 3922


From brian at cmrl.wustl.edu  Fri Feb  4 07:14:02 2005
From: brian at cmrl.wustl.edu (Brian Henerey)
Date: Fri, 04 Feb 2005 09:14:02 -0600
Subject: [Beowulf] SGE web frontends
In-Reply-To: <420376B0.7000107@scalableinformatics.com>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>	<Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>
	<420376B0.7000107@scalableinformatics.com>
Message-ID: <4203913A.1030202@cmrl.wustl.edu>


I don't mean to hijack this thread, but I'd also be interested to know 
if there are any open source web frontends for launching jobs on 
clusters. I've mostly written my own anyway, but if something's out 
there I'd like to know.

Thanks,
Brian Henerey


Joe Landman wrote:
> 
> 
> Robert G. Brown wrote:
> 
>> On Wed, 2 Feb 2005, Matthew Pratola wrote:
>>
>>
>>> Hi all,
>>>
>>> Can anyone recommend a simple web frontend for submitting SGE jobs?
>>
>>
>>
>>   http://www.globus.org/
>>
>> One stop shopping.
> 
> 
> Did I miss something?  Was a tongue planted in cheek with this reply?
> 
> As far as I know there are very few web interfaces to running SGE (or 
> LSF, or ...) jobs.  If I am wrong please do provide links/references.
> 
> Globus is not a web interface (last I checked), but a large group of 
> middleware to manage something that looks a lot closer to the definition 
> of a grid than SGE.  SGE is a job scheduler (with a name "engineered" to 
> make you think it is a one-stop-shop as a grid-in-a-box).
> 
> My company is interested in (and we are developing) web portals for end 
> user cluster work, so if you know of any, we would like to hear about 
> them.  Good open-source platforms that are current/supported could be 
> worth looking at (and will save us time/development effort).  There seem 
> to be lots of bits of abandonware in the grid portal/user-interface 
> area.  We don't want to re-invent wheels, but at the same time, we don't 
> want to adopt abandoned ones either.
> 
> Joe
> 


From rgb at phy.duke.edu  Fri Feb  4 10:02:48 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 4 Feb 2005 13:02:48 -0500 (EST)
Subject: [Beowulf] SGE web frontends
In-Reply-To: <420376B0.7000107@scalableinformatics.com>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>
	<Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>
	<420376B0.7000107@scalableinformatics.com>
Message-ID: <Pine.LNX.4.58.0502041153170.18382@ganesh.phy.duke.edu>

On Fri, 4 Feb 2005, Joe Landman wrote:

> 
> 
> Robert G. Brown wrote:
> > On Wed, 2 Feb 2005, Matthew Pratola wrote:
> > 
> > 
> >>Hi all,
> >>
> >>Can anyone recommend a simple web frontend for submitting SGE jobs?
> > 
> > 
> >   http://www.globus.org/
> > 
> > One stop shopping.
> 
> Did I miss something?  Was a tongue planted in cheek with this reply?

Actually, it was a reply I snapped off on my way out the door on the
edge of late for teaching.

Let me reconsider my answer.

You don't like yes/globus, how about "no".

At least if you mean really really simple by simple.

I would argue that a cluster designed to run primarily embarrassingly
parallel jobs, fronted by a web portal/interface, is a not uncommon form
of a grid, although perhaps the definition is large enough to include a
union of such clusters or some more general structure (certainly access
to other kinds of resources than strictly "a cluster"). So I read this
question as "I want to make my local cluster into a grid, so users
faraway with no direct LAN accounts or access can submit jobs into my
local SGE queue after being properly authenticated".  And, of course, be
notified (with messages) when the jobs crash or terminate normally,
facilitate data transfer and resource allocation requests, etc.  Not
exactly simple...

Globus TK is as I understand it a toolkit from which one can build a web
interface for generalized remote task submission to "a grid".  It has to
have lots of moving parts to do that well -- just AUTHENTICATING data
transfer and job execution via a web interface isn't really terribly
"simple", becaues to do it decently generally requires e.g. stuff like
kerberos, ssl, ssh that aren't terribly simple either.  So I definitely
failed on the "simple" bit.

However, simple or not, I believe that Globus does contain the
components to do what you want -- provide a very generic web interface
for people far away who don't share any LAN components such as mounted
filespace, authentication/userid mappings, etc to transfer data and job
execution instructions to a system.  That system, if it is a front end
running SGE and/or stuff like condor (policy, load balance, batch job
tools) can then put the job into a queue, run it, and let globus know
when it is finished so it can tell the original user.

If you look over just their security layer (GSI -- Grid Security
Infrastructure) you rapidly come to realize that to run any sort of
remote job execution service you NEED most of its components --
authentication (including a Certificate Authority CA), encryption
(public/private key, managed with certificates), permissions, etc.  Some
grid designs I've seen use just this component of Globus and use other
tools (like PBS or SGE or custom designed stuff) for other components.

Ian Foster seems to have a list of at least some of the major grid
projects around the world -- enough to be able to google on them by
name -- here:

  http://www-fp.mcs.anl.gov/~foster/grid-projects/

Perhaps you can find a reusable interface at one of their project
websties.  You can also check out e.g. the Grid Portal Development Kit:

  http://doesciencegrid.org/projects/GPDK/

or The Grid Portal Kit:

  https://gridport.npaci.edu/

or the Open Grid Computing Environment:

  http://www.ogce.org/index.php

all of which I believe use globus as at least part of their middleware
for e.g. authentication etc.  Some of these are (e.g. the DOE's GPDK)
currently unsupported although still available and possibly still
reasonably functional.  I don't really know the status of the rest of
them, and I doubt that this is all of them.

So you're right, I should have answered "no" because it isn't simple to
offer a web interface to any active service, ESPECIALLY one that permits
a remote user to upload arbitrary programs for execution on arbitrary
data of arbitrary size where authentication, encryption, data transport,
and remote job management become absolutely essential components of the
solution.

AFAIK, Globus is one of the if not the only middleware toolkits of
choice for people who run the big grids -- they probably write their own
actual web portal, but they use Globus to do at least some of the heavy
lifting that goes on behind the scenes.  Maybe one of the "portal
projects" above (all open source) will be of use in setting up a
"simple" portal to your cluster, but be aware that the problem itself is
far from simple.

However, I could be wrong and as always cherish being corrected.

   rgb

> 
> As far as I know there are very few web interfaces to running SGE (or 
> LSF, or ...) jobs.  If I am wrong please do provide links/references.
> 
> Globus is not a web interface (last I checked), but a large group of 
> middleware to manage something that looks a lot closer to the definition 
> of a grid than SGE.  SGE is a job scheduler (with a name "engineered" to 
> make you think it is a one-stop-shop as a grid-in-a-box).
> 
> My company is interested in (and we are developing) web portals for end 
> user cluster work, so if you know of any, we would like to hear about 
> them.  Good open-source platforms that are current/supported could be 
> worth looking at (and will save us time/development effort).  There seem 
> to be lots of bits of abandonware in the grid portal/user-interface 
> area.  We don't want to re-invent wheels, but at the same time, we don't 
> want to adopt abandoned ones either.
> 
> Joe
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From mwill at penguincomputing.com  Fri Feb  4 10:36:51 2005
From: mwill at penguincomputing.com (Michael Will)
Date: Fri, 4 Feb 2005 10:36:51 -0800
Subject: [Beowulf] Information Reseach Lab
In-Reply-To: <200502010848.19789.bvanhaer@sckcen.be>
References: <E1CvYAs-00070B-00@cool.aub.edu.lb>
	<200502010848.19789.bvanhaer@sckcen.be>
Message-ID: <200502041036.52020.mwill@penguincomputing.com>

It depends on the GIS software used. There was some work
to mpi-enable GRASS modules a while back, no idea where it
went. Here is something about a parallel version of s.surf.rst:

http://skagit.meas.ncsu.edu/~helena/grasswork/grasscontrib/

And of course if you program against the GIS api's you
might be able to take advantage of a cluster as well.

There is a paper that mentiones they used MPI for paralelizing
their GIS/EM4 software on http://www.colorado.edu/research/cires/banff/pubpapers/104/

Michael
On Monday 31 January 2005 11:48 pm, Ben Vanhaeren wrote:
> On Monday 31 January 2005 11:46, Ziad Shaaban wrote:
> > Dear All,
> >
> > I am planning to have an information lab in our faculty built of: Dell,
> > Linux, Oracle and GIS.
> >
> > Can I use Beowulf to analyze GIS Data and display them on the web using
> > ArcIMS, all three vendors said yes, but can I use Beowulf?
> >
> I think you should read the Beowulf FAQ: 
> http://www.beowulf.org/overview/faq.html#1
> Beowulf is a concept not a piece of software.
> 
> I don't think you are going to need a beowulf cluster for the kind of 
> application you want to run (analyzing GIS data). If you want to guarantee 
> availability of your GIS data or do loadbalancing (distribute the load to 
> several servers) you should take a look at linux HA project:
> http://www.linux-ha.org/
> Apache loadbalancing with mod_backhand:
> http://www.backhand.org/ApacheCon2001/US/backhand_course_notes.pdf
> and Oracle Real Application Clusters (RAC).
> 
> 

-- 
Michael Will, Linux Sales Engineer
Tel:  415-954-2822  Toll Free:  888-PENGUIN   
Fax:  415-954-2899  www.penguincomputing.com
Visit us at LinuxWorld 2005!
Hynes Convention Center, Boston, MA
February 15th-17th, 2005
Booth 609


From rgb at phy.duke.edu  Fri Feb  4 10:37:02 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 4 Feb 2005 13:37:02 -0500 (EST)
Subject: [Beowulf] SGE web frontends
In-Reply-To: <4203913A.1030202@cmrl.wustl.edu>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>
	<Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>
	<420376B0.7000107@scalableinformatics.com>
	<4203913A.1030202@cmrl.wustl.edu>
Message-ID: <Pine.LNX.4.58.0502041335460.18382@ganesh.phy.duke.edu>

On Fri, 4 Feb 2005, Brian Henerey wrote:

> 
> I don't mean to hijack this thread, but I'd also be interested to know 
> if there are any open source web frontends for launching jobs on 
> clusters. I've mostly written my own anyway, but if something's out 
> there I'd like to know.

Same topic.  The issue is having a web "portal" that manages stuff like
authentication, data transport, job submission/status etc.  Running the
submissions through SGE rather than something else is just a detail.

   rgb

> 
> Thanks,
> Brian Henerey
> 
> 
> Joe Landman wrote:
> > 
> > 
> > Robert G. Brown wrote:
> > 
> >> On Wed, 2 Feb 2005, Matthew Pratola wrote:
> >>
> >>
> >>> Hi all,
> >>>
> >>> Can anyone recommend a simple web frontend for submitting SGE jobs?
> >>
> >>
> >>
> >>   http://www.globus.org/
> >>
> >> One stop shopping.
> > 
> > 
> > Did I miss something?  Was a tongue planted in cheek with this reply?
> > 
> > As far as I know there are very few web interfaces to running SGE (or 
> > LSF, or ...) jobs.  If I am wrong please do provide links/references.
> > 
> > Globus is not a web interface (last I checked), but a large group of 
> > middleware to manage something that looks a lot closer to the definition 
> > of a grid than SGE.  SGE is a job scheduler (with a name "engineered" to 
> > make you think it is a one-stop-shop as a grid-in-a-box).
> > 
> > My company is interested in (and we are developing) web portals for end 
> > user cluster work, so if you know of any, we would like to hear about 
> > them.  Good open-source platforms that are current/supported could be 
> > worth looking at (and will save us time/development effort).  There seem 
> > to be lots of bits of abandonware in the grid portal/user-interface 
> > area.  We don't want to re-invent wheels, but at the same time, we don't 
> > want to adopt abandoned ones either.
> > 
> > Joe
> > 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From landman at scalableinformatics.com  Fri Feb  4 11:33:30 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 4 Feb 2005 14:33:30 -0500 (EST)
Subject: [Beowulf] SGE web frontends
In-Reply-To: <Pine.LNX.4.58.0502041153170.18382@ganesh.phy.duke.edu>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>
	<Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>
	<420376B0.7000107@scalableinformatics.com>
	<Pine.LNX.4.58.0502041153170.18382@ganesh.phy.duke.edu>
Message-ID: <Pine.LNX.4.58.0502041428560.15044@crunch.scalableinformatics.com>


There are web interfaces to SGE, and there are web interfaces to grids ...

I think the important aspect of this is the marketing use of the term 
"Grid" in a name.

Way back in high school, they used to teach us that what was in a 
name was exactly opposite of what it really was...  A bit cynical, but 
amazingly effective at cutting through marketing.

Globus is glue.  Middleware.  There are portals atop globus.  SGE 
(despite its name) is a job scheduler.  As is LSF.  And others.

The short version of things are that in order to get a web interface to 
SGE, one need not go through the joy of Globus, especially as Globus will 
not in and of itself get you where you want to go.

GridPort I knew of.  The other I did not.  

Joe


From josip at lanl.gov  Fri Feb  4 11:57:28 2005
From: josip at lanl.gov (Josip Loncaric)
Date: Fri, 04 Feb 2005 12:57:28 -0700
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050204133518.01007860@pop.xs4all.nl>
References: <3.0.32.20050204133518.01007860@pop.xs4all.nl>
Message-ID: <4203D3A8.4070702@lanl.gov>

Vincent Diepeveen wrote:
> At 00:29 4-2-2005 -0800, Bill Broadley wrote:
>>
>>Do you know that gigabit is too high latency?

Gigabit Ethernet adapters often need tweaking to deliver reasonable 
latency, bandwidth, and CPU utilization.

For example, if your system uses the e1000 driver (Intel's gigabit 
Ethernet), the default setting is "dynamic Interrupt Throttle Rate" -- 
which means that the card will delay interrupting the CPU by up to about 
130 microseconds after receiving a packet.  Moreover, the "dynamic" part 
causes the network chip microcode to vary this delay in multiples of 
about 16 microseconds, so that different packets will generally 
experience different receive delays.

For the e1000 driver, 
https://lists.dulug.duke.edu/pipermail/dulug/2004-August/015415.html 
recommends using "options e1000 InterruptThrottleRate=80000" (add this 
line to /etc/modules.conf).  Users of this driver may also want to check 
Intel's parameters for e1000 listed at 
http://www.intel.com/support/network/sb/cs-009209.htm#parameters -- just 
don't assume that the default values are appropriate for cluster use.

Other gigabit Ethernet adapters have similar interrupt mitigation 
strategies, all designed to gracefully cope with high packet rates at 
high network speeds.  For cluster use, adjustments are usually advisable.

The basic Rx interrupt mitigation scheme is this: the receiver's CPU 
won't be interrupted until at least N packets have arrived or M 
microseconds have elapsed (whichever comes first).  This clearly adds up 
to M microseconds to network latency.  BTW, one often sees N=6 
(otherwise NFS performance can seriously degrade) and M>=16.  Other 
variants of this basic scheme are possible; but they all mean increased 
latencies.

Finally, don't forget the Tx side interrupt mitigation, or else the 
sending CPU might not be told promptly that it's OK to send more.  The 
default Tx settings are probably fine for full size packets, but if your 
applications send lots of small packets, tweaking your network driver's 
Tx settings may help.

Sincerely,
Josip


From atp at piskorski.com  Fri Feb  4 12:20:23 2005
From: atp at piskorski.com (Andrew Piskorski)
Date: Fri, 4 Feb 2005 15:20:23 -0500
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050203045323.01002100@pop.xs4all.nl>
References: <3.0.32.20050203045323.01002100@pop.xs4all.nl>
Message-ID: <20050204202023.GA32459@piskorski.com>

On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:

> Please note MPI is probably what i'll use, though i keep finding
> online information about 'gamma'. Is that faster latency than MPI
> implementations?

  http://www.disi.unige.it/project/gamma/

Gamma is a non-TCP/IP Linux 2.6.x network driver for Intel Pro/1000
gigabit ethernet cards, for use with MPI.  It offers much better
latency (11 us or so) than TCP/IP over ethernet (maybe 60 or 100 us),
but worse than the specialized HPC interconnects (maybe 3 us).

The attraction of GAMMA, is that Intel Pro/1000 cards can be had for
$11 to $60 or so each (depending on exact model, etc.), and gigabit
switches are also pretty cheap, while SCI or Myrinet is somewhere in
the $500 to $1500 per node range (I don't keep track).

So if your application can benefit from lower latency, but you want
something really cheap, GAMMA should be well worth trying.

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/


From lindahl at pathscale.com  Fri Feb  4 12:50:34 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Fri, 4 Feb 2005 12:50:34 -0800
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <20050204202023.GA32459@piskorski.com>
References: <3.0.32.20050203045323.01002100@pop.xs4all.nl>
	<20050204202023.GA32459@piskorski.com>
Message-ID: <20050204205034.GA18717@greglaptop.internal.keyresearch.com>

On Fri, Feb 04, 2005 at 03:20:23PM -0500, Andrew Piskorski wrote:
> On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
> 
> > Please note MPI is probably what i'll use, though i keep finding
> > online information about 'gamma'. Is that faster latency than MPI
> > implementations?
> 
>   http://www.disi.unige.it/project/gamma/

In addition to gamma, there's also MVAPICH from LBL, and at least two
commercial products, one from Scali, and one from the Cluster
Competence Center.

-- greg


From ctierney at HPTI.com  Fri Feb  4 14:13:19 2005
From: ctierney at HPTI.com (Craig Tierney)
Date: Fri, 04 Feb 2005 15:13:19 -0700
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <20050204202023.GA32459@piskorski.com>
References: <3.0.32.20050203045323.01002100@pop.xs4all.nl>
	<20050204202023.GA32459@piskorski.com>
Message-ID: <1107555198.2916.4.camel@localhost.localdomain>

On Fri, 2005-02-04 at 13:20, Andrew Piskorski wrote:
> On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
> 
> > Please note MPI is probably what i'll use, though i keep finding
> > online information about 'gamma'. Is that faster latency than MPI
> > implementations?
> 
>   http://www.disi.unige.it/project/gamma/
> 
> Gamma is a non-TCP/IP Linux 2.6.x network driver for Intel Pro/1000
> gigabit ethernet cards, for use with MPI.  It offers much better
> latency (11 us or so) than TCP/IP over ethernet (maybe 60 or 100 us),
> but worse than the specialized HPC interconnects (maybe 3 us).

See Josip's post on tweaking interrupts on gigE drivers, but
I have a small system with Intel gigE cards and a Dell gigE switch.
Latency between two nodes through the swtich is 30 us.  This is
typical of what I see for other gigE cards.  A latency of 60-100
is a bit high.  

Avoiding TCP/IP is still a big improvement.

Craig

> 
> The attraction of GAMMA, is that Intel Pro/1000 cards can be had for
> $11 to $60 or so each (depending on exact model, etc.), and gigabit
> switches are also pretty cheap, while SCI or Myrinet is somewhere in
> the $500 to $1500 per node range (I don't keep track).
> 
> So if your application can benefit from lower latency, but you want
> something really cheap, GAMMA should be well worth trying.


From john.hearns at streamline-computing.com  Sat Feb  5 00:58:49 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sat, 05 Feb 2005 08:58:49 +0000
Subject: [Beowulf] SGE web frontends
In-Reply-To: <Pine.LNX.4.58.0502041335460.18382@ganesh.phy.duke.edu>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>
	<Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>
	<420376B0.7000107@scalableinformatics.com>
	<4203913A.1030202@cmrl.wustl.edu>
	<Pine.LNX.4.58.0502041335460.18382@ganesh.phy.duke.edu>
Message-ID: <1107593929.5504.1.camel@Vigor45>

On Fri, 2005-02-04 at 13:37 -0500, Robert G. Brown wrote:
> On Fri, 4 Feb 2005, Brian Henerey wrote:
> 
> > 
> > I don't mean to hijack this thread, but I'd also be interested to know 
> > if there are any open source web frontends for launching jobs on 
> > clusters. I've mostly written my own anyway, but if something's out 
> > there I'd like to know.
> 
> Same topic.  The issue is having a web "portal" that manages stuff like
> authentication, data transport, job submission/status etc.  Running the
> submissions through SGE rather than something else is just a detail.
I know that the London E-science centre do work in that area.

Have a look at GridSAM
http://www.lesc.ic.ac.uk/gridsam/index.html

Haven never used it myself mind - it was only out in beta
last week!
And sadly: "The DRMConnector for launching to Grid Engine resource using
DRMAA is currently in development and not yet released. "


Also, you could ask the same question on the SGE list.
http://gridengine.sunsource.net/project/gridengine/maillist.html


From kus at free.net  Fri Feb  4 09:06:01 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Fri, 04 Feb 2005 20:06:01 +0300
Subject: [Beowulf] Home Beowulf - NIC latencies
Message-ID: <web-507063@free.net>


>Good morning!

>With the intention to run my chessprogram on a beowulf to be >constructed here (starting with 2 dual-k7 machines here) i better get some good advice
>on which network to buy. Only interesting thing is how fast each node >can read out 64 bytes randomly from RAM of some remote cpu. All nodes >do that simultaneously.
I'm very glad that parallelised chessprograms are developed,
but I'm regretted that chess programs don't have coarse-grained 
parallelizm ... :-( I thought that every processor can handle
some big part of moves tree. Unfortunatelly I can't win Deep Fritz 8 
also w/o parallelization :-)

>The faster this can be done the better the algorithmic speedup for >parallel search in a chess program (property of YBW, see publications >in journal of
>icga: www.icga.org). This speedup is exponential (or better you get
>punished exponential compared to single cpu performance).

>Which network cards considering my small budget are having lowest >latencies can be used?
>quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro > per card when i altavista'ed online and i wonder how to get more than >2 nodes to work without switch. Perhaps there is low cost switches >with reasonable low latency?
   One idea for "low price & low latency interconnect infrastructure"
may be ATOLL (//www.atoll-net.de), because it has no "external"
switches. But I don't know about commercial availability of ATOLL 
hardware just now.
>Please note MPI is probably what i'll use, though i keep finding online
>information about 'gamma'. Is that faster latency than MPI >implementations?
   You can use MPI over GAMMA having more low latencies.

Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow

>Note normal 1Gbit cards for normal network traffic.
>Each node is a SMP or NUMA node and not only multiprocessor also
>multithreaded.

>I welcome any advice,
>Best regards,
>Vincent

Vincent Diepeveen


From nj at hemeris.com  Fri Feb  4 09:21:24 2005
From: nj at hemeris.com (Nicolas Jungers)
Date: Fri, 04 Feb 2005 18:21:24 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050203045323.01002100@pop.xs4all.nl>
References: <3.0.32.20050203045323.01002100@pop.xs4all.nl>
Message-ID: <1107537685.6224.12.camel@lcube.bxl.jungers.net>

On Thu, 2005-02-03 at 04:53 +0100, Vincent Diepeveen wrote:
> Good morning!
> 
> With the intention to run my chessprogram on a beowulf to be
> constructed
> here (starting with 2 dual-k7 machines here) i better get some good
> advice
> on which network to buy. Only interesting thing is how fast each node
> can
> read out 64 bytes randomly from RAM of some remote cpu. All nodes do
> that
> simultaneously.
> 
> The faster this can be done the better the algorithmic speedup for
> parallel
> search in a chess program (property of YBW, see publications in
> journal of
> icga: www.icga.org). This speedup is exponential (or better you get
> punished exponential compared to single cpu performance).
> 
> Which network cards considering my small budget are having lowest
> latencies
> can be used?
> 
> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro
> per
> card when i altavista'ed online and i wonder how to get more than 2
> nodes
> to work without switch. Perhaps there is low cost switches with
> reasonable
> low latency?
> 
> Please note MPI is probably what i'll use, though i keep finding
> online
> information about 'gamma'. Is that faster latency than MPI
> implementations?

gamma bypass tcp/ip, then shaving most of the latency. Unfortunately
it's not very actively developed, though they "recently" (last year)
updated their stack to the e1000 (intel Giga ethernet) NIC. I know that
(some at) the CERN use they own communication stack on e1000 similar to
gamma, with impressive results. I dunno if it's widely available.

Nicolas


From ashley at quadrics.com  Fri Feb  4 09:31:01 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Fri, 04 Feb 2005 17:31:01 +0000
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050204133518.01007860@pop.xs4all.nl>
References: <3.0.32.20050204133518.01007860@pop.xs4all.nl>
Message-ID: <1107538261.13957.10.camel@localhost.localdomain>

On Fri, 2005-02-04 at 13:35 +0100, Vincent Diepeveen wrote:
> At 00:29 4-2-2005 -0800, Bill Broadley wrote:
> >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
> >> Good morning!
> >> 
> >> With the intention to run my chessprogram on a beowulf to be constructed
> >> here (starting with 2 dual-k7 machines here) i better get some good advice
> >> on which network to buy. Only interesting thing is how fast each node can
> >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that
> >> simultaneously.
> >
> >Is there any way to do this less often with a larger transfer?  
> >If you
> >wrote a small benchmark that did only that (send 64 bytes randomly
> >from a large array in memory) and make it easy to download, build, run,
> >and report results, I suspect some people would.
> 
> One way pingpong with 64 bytes will do great.

pingpong is not really the same, adding a random element can slow down
comms and ideally it sounds like you want a one-sided operation.
Perhaps you should look at tabletoy (cray shmem) or gups (MPI) as a
benchmark.

> CPU's are 100% busy and after i know how many times a second the network
> can handle in theory requests i will do more probes per second to the
> hashtable. The more probes i can do the better for the game tree search.

Are you overlapping comms and compute or doing blocking reads?  If you
are overlapping then the issue rate for reads is more important than the
raw latency.

> >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per
> >> card when i altavista'ed online and i wonder how to get more than 2 nodes
> >> to work without switch. Perhaps there is low cost switches with reasonable
> >> low latency?
> >Do you know that gigabit is too high latency?
> 
> The few one way pingpong times i can find online from gigabit cards are not
> exactly promising, to say it very polite. Something in the order or 50 us
> one way pingpong time i don't even consider worth taking a look at at the
> picture.
> 
> Each years cpu's get faster. For small networks 10 us really is the upper
> limit.

10us is easily achievable, I've just measured a read time of a little
over 3us and a issue rate of 1.33us.

> So before we start searching every node (=position) we quickly want to find
> out whether other cpu's already searched it.
> 
> At the origin3800 at 512 processors i used a 115 GB hashtable (i started
> search at 460 processors). Simply because the machine has 512GB ram.
> 
> So in short you take everything you can get.

So is this a parallel algorithm or simply a big "memory farm" you are
after?  You don't hear much of clusters being used for the latter but in
some cases it's a eminently sensible thing to do.

Ashley,


From monang at gmail.com  Fri Feb  4 11:23:13 2005
From: monang at gmail.com (Monang Setyawan)
Date: Sat, 5 Feb 2005 02:23:13 +0700
Subject: [Beowulf] Newbie Question
Message-ID: <5dc04bbf050204112319fe7fbf@mail.gmail.com>

Hi. I'm a newbie in this parallel computing thing. 
(sorry for my bad english, I'm Indonesian)

My current project is a software that analyze DNA/Protein sequence
data that needs high performance aspect on it. I plan to deploy this
software on network of workstations (mm, may be just about 10 PCs on
the network). Am I in wrong place now?

I am going to use message passing paradigm (MPI) to write the
software. I've read that there are several choice of MPI
implementation. The problem is, I'm bad in both C or Fortran (I
usually use Java as my favorite language). Some source said that Java
(or it's MPI wrapper or pure MPI implementation) isn't good enough to
implement a parallel computing solution. Is that right?

My third question is, is there any pdf/ps/one file version of
"Engineering a Beowulf-style Compute Cluster''?

Thanks in advance.

-- 
For the sake of time..


From rhamann at uccs.edu  Fri Feb  4 11:31:01 2005
From: rhamann at uccs.edu (R Hamann)
Date: Fri, 04 Feb 2005 12:31:01 -0700
Subject: [Beowulf] MPICH2: Handle Limit?
In-Reply-To: <Pine.LNX.4.58.0502041214470.29537@terra.mcs.anl.gov>
References: <web-21778622@uccs.edu>
	<Pine.LNX.4.58.0502041214470.29537@terra.mcs.anl.gov>
Message-ID: <web-21848472@uccs.edu>

Rob,

I thought any limit would be wierd, let alone something like 84 (7 X 
12?)  Anyway, I thought it was based on the number of MPI variables 
declared (data_types, windows, requests) because every time I added 
new declarations, it would hang on Fedora core 2, but run to 
completion on Scyld (but with erroneous results). If I deleted unused 
MPI declarations, it would start to work again.  I counted all my 
handles and came up with 84.

However, after deleting two 26 element arrays of handles, I thought it 
would work.  When I added more handles, it bombed again.  I started to 
try other things.  I added 4 junk ints.  I didn't use the variables I 
declared, but it still bombed.  When I converted them to chars, it 
started working again.  Very strange.

Have you ever encountered this before?   I'm doing a 3d cellular 
automata, so I need a lot of datatypes for exchange of ghost cells. 
 It's obviously some strange error I've made that's manifesting itself 
in MPI instead of a runtime or sytax error.  I'm gonna try looking for 
any buffer overruns now, but other than that I'm stumped.

GCC on Fedora Core 2 and on Scyld Beowulf
MPICH 2 1.0

Thanks,

R


On Fri, 4 Feb 2005 12:16:17 -0600 (CST)
  Rob Ross <rross at mcs.anl.gov> wrote:
> Hi Ron,
> 
> There should not be an 84 handle limit.
> 
> Can you tell me what version of MPICH2 this is, and what 
>architecture and 
> OS you're running on?  Do you have a simple test that exhibits the 
> problem?
> 
> Thanks,
> 
> Rob
> ---
> Rob Ross, Mathematics and Computer Science Division, Argonne 
>National Lab
> 
> 
> On Thu, 3 Feb 2005, R Hamann wrote:
> 
>> I've been having some strange problems with a program using the 
>>MPICH2 
>> library.  When I added some new datatypes for ghost cell exchange, 
>>the 
>> program would hang.  I figured out that any number of handles over 
>>84 
>> would cause this.  Fortunately, I could delete some handles that I 
>>no 
>> longer needed, but it still seemed strange.  Are my calculations 
>> correct that for each process there is an 84 handle limit? or am I 
>> seeing some other problem?
>> 
>> Ron


From rodmur at maybe.org  Fri Feb  4 12:29:55 2005
From: rodmur at maybe.org (Dale Harris)
Date: Fri, 4 Feb 2005 12:29:55 -0800
Subject: [Beowulf] scyld's beorun
Message-ID: <20050204202955.GS32046@maybe.org>


Hey, I was looking some web page talking about schedulers and Scyld's
beowulf, and using the beorun command.  I'm not able to find much of any
documentation out there about what this command is, or does.  Anyone
familiar with it?

-- 
Dale Harris   
rodmur at maybe.org
/.-)


From diep at xs4all.nl  Fri Feb  4 11:39:12 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Fri, 04 Feb 2005 20:39:12 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
Message-ID: <3.0.32.20050204203911.01006630@pop.xs4all.nl>

At 17:31 4-2-2005 +0000, Ashley Pittman wrote:
>On Fri, 2005-02-04 at 13:35 +0100, Vincent Diepeveen wrote:
>> At 00:29 4-2-2005 -0800, Bill Broadley wrote:
>> >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
>> >> Good morning!
>> >> 
>> >> With the intention to run my chessprogram on a beowulf to be constructed
>> >> here (starting with 2 dual-k7 machines here) i better get some good
advice
>> >> on which network to buy. Only interesting thing is how fast each node
can
>> >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do
that
>> >> simultaneously.
>> >
>> >Is there any way to do this less often with a larger transfer?  
>> >If you
>> >wrote a small benchmark that did only that (send 64 bytes randomly
>> >from a large array in memory) and make it easy to download, build, run,
>> >and report results, I suspect some people would.
>> 
>> One way pingpong with 64 bytes will do great.

>pingpong is not really the same, adding a random element can slow down
>comms and ideally it sounds like you want a one-sided operation.
>Perhaps you should look at tabletoy (cray shmem) or gups (MPI) as a
>benchmark.

Thank you for your answer, I indeed investigated quadrics cards
intensively. Ask your college Daniel Kidger. 

The shmem is an ideal solution for what searching algorithms are doing.

Regrettably seems no one is willing to sell old quadrics cards (QM400).

>> CPU's are 100% busy and after i know how many times a second the network
>> can handle in theory requests i will do more probes per second to the
>> hashtable. The more probes i can do the better for the game tree search.
>
>Are you overlapping comms and compute or doing blocking reads?  If you
>are overlapping then the issue rate for reads is more important than the
>raw latency.

A node (chessposition in this case) eats on average 10 us. sometimes that's
50us other times it's 1us. That's the time a cpu is busy calculating a
chesstechnical value how good the position is applying human chesspatterns.
Called evaluation function in search world.

Before applying evaluation function one is doing a lookup to the cache
whether one already searched this position. In case of a 2 node beowulf
that means you have 50% odds that this position is in local memory and 50%
chance it's a remote lookup.

The reason for this is very simple by explaining the hash function which in
a lot of different software gets used too (not only search, also encryption
and string matching and all types of caches).

For each piece at each square take a random value ( long long
randomvalue[12][64] )
XOR all values with each other and you have what is called a Zobrist hash
from a position. Very effectively. Nothing beats the speed of Zobrist as
you can do it incremental.

Now suppose we use the lower 20 bits to lookup at 1 million entries. So we
AND the hash number with 2^20 - 1 and lookup at that adress in the hashtable.

Obviously such cache is distributed across the nodes. Each node having an
equal share of the global transpositiontable as it is called officially.

Trivially doing this each 10 us will put too much stress on the network. So
usually one doesn't do it at the leaves itself (called quiescencesearch).
That means only in 20% of the nodes such a thing gets tried. That's already
on average once in each 100 us. The slower the network card, the less
remote hashtable lookups one tries obviously. Finding for each cluster an
optimum search depth when to try it is of course not so difficult to figure
out.

1 lookup reads 64 bytes and that's 4 entries where the position could be
stored.

1 entry is 16 bytes and stores quite some information. Apart from a lot of
bits to identify a chessposition, the score is there (20 bits) and what the
best move was in this position. 

>> >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684
euro per
>> >> card when i altavista'ed online and i wonder how to get more than 2
nodes
>> >> to work without switch. Perhaps there is low cost switches with
reasonable
>> >> low latency?
>> >Do you know that gigabit is too high latency?
>> 
>> The few one way pingpong times i can find online from gigabit cards are not
>> exactly promising, to say it very polite. Something in the order or 50 us
>> one way pingpong time i don't even consider worth taking a look at at the
>> picture.
>> 
>> Each years cpu's get faster. For small networks 10 us really is the upper
>> limit.
>
>10us is easily achievable, I've just measured a read time of a little
>over 3us and a issue rate of 1.33us.

Suppose a 8 node quadrics setup so with a switch in the middle and all
processors trying to read nonstop over the network to a random remote
processor. Each processor reading out of the 64MB on card. So never in the
physical memory of a processor, just at the remote network cards.

What speed is achievable then to read 64 bytes?

SGI with their supercomputers never get better than 5.8 us there (that's
reading 8 bytes) on average (origin3800) when the numaflex routers kick in.
Altix3000 is way worse there. More bandwidth optimized i guess.

>> So before we start searching every node (=position) we quickly want to find
>> out whether other cpu's already searched it.
>> 
>> At the origin3800 at 512 processors i used a 115 GB hashtable (i started
>> search at 460 processors). Simply because the machine has 512GB ram.
>> 
>> So in short you take everything you can get.
>
>So is this a parallel algorithm or simply a big "memory farm" you are
>after?  You don't hear much of clusters being used for the latter but in
>some cases it's a eminently sensible thing to do.

I take care the cpu's get nearly 100% load and say am prepared to sacrafice
10% of the scaling at a network to read/write latency to the hashtable. So
i just figure out how many reads i can do in 10% system time and fill that
with reads. The other 90% system time it has to evaluate chesspositions and
be busy with the real stuff.

By putting the depth in the search at which it is allowed to read higher or
lower, i can manual adjust the traffic over the network.

Best regards,
Vincent

>Ashley,


From diep at xs4all.nl  Fri Feb  4 12:39:47 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Fri, 04 Feb 2005 21:39:47 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
Message-ID: <3.0.32.20050204213943.010127d0@pop.xs4all.nl>

At 11:38 4-2-2005 -0800, Bill Broadley wrote:
>> 
>> One way pingpong with 64 bytes will do great.
>> 
>
>A very similar number I build a circularly linked list and read a value,
>add 1 to it, and send it to the next host, with a GigE network:
>
>compute-0-8.local compute-0-7.local compute-0-2.local compute-0-4.local
compute-0-8.local compute-0-7.local compute-0-2.local compute-0-4.local
>size=   10, 131072 hops, 8 nodes in  5.30 sec ( 40.4 us/hop)    966 KB/sec
>
>Oh, you said 64 (I'm sending INTs, so 16):
>size=   16, 131072 hops, 8 nodes in  5.35 sec ( 40.8 us/hop)   1531 KB/sec

I'm amazed you get it to 40.8 us. Probably you tested at an idle network?

How fast is it when the cpu's are 100% busy doing integer work?

>> CPU's are 100% busy and after i know how many times a second the network
>> can handle in theory requests i will do more probes per second to the
>> hashtable. The more probes i can do the better for the game tree search.
>
>With a gigE network that sounds like 40us or so.  With Myrinet or IB
>it's in the 4-6us range.  If you bought dual opterons with the special

At the quadrics and dolphin homepage they both claim 12+ us for Myrinet.

For example :
  http://www.dolphinics.com/pdf/datasheet/Dolphin_socket_4p.pdf

>hypertransport slot you could get it down to 1.5us or so.  SGI
>altix machines can get that down again to around 1.0us.  Of course
>speed isn't cheap.

Altix3000 has worse latency than origin3800 if interpret results well. 
Altix3000 is 3-4 us one way pingpong at 64 processors, which origin3800
gets at 512 processors.

At 64 processors see extensive benchmarking by prof Aad v/d Steen for dutch
government organisations. His results are at www.sara.nl in pdf format.
Look for his presentation 1 july 2003. 

When i ran at limited number of cpu's my latency tests (using shared
memory) the origin3800 really is a lot faster in latency than altix3000.

A problem of altix3000 design is that of course scheduling is very hard
thanks to the complex routing as each brick is connected to 2 routers which
each connect to other parts of the machine.

This causes for immense scheduling problems when there is a 150 users
simultaneously on the machine normally spoken which are not there when you
can benchmark an entire empty machine with just 1 user.

>> The few one way pingpong times i can find online from gigabit cards are not
>> exactly promising, to say it very polite. Something in the order or 50 us
>> one way pingpong time i don't even consider worth taking a look at at the
>> picture.
>> 
>> Each years cpu's get faster. For small networks 10 us really is the upper
>> limit.
>
>Okay, so dolphin, myrinet, or IB.

Have URL's from where IB is buyable without needing to buy entire system?

>> Let's not discuss parallel chess algorithm too much in depth. 100 different
>> algorithms/enhancements get combined with each other. They are not the
>> biggest latency problem. The latency problem is caused by the hashtable.
>> Hashtable is a big cache. The bigger the better. It avoids researching the
>> same tree again.
>
>Okay, so my question is, which would be better:
>* 8 4GB caches that you could query 80 million times a second?

This one by far.

Actually for the top searches not such big caches are needed. Locally i may
allocate 200-400MB a cpu for cache, but a shared cache can be easily as low
as 4MB a cpu, no problem. Could get it even down to less than that if needed.

99% of all nodes (chesspositions) that get searched are near the leafs. So
if i move up the variable where it also may lookup at remote cpu's from 0
to 2, then already 99% of all nodes don't get looked up remote.

>* 1 64GB cache that you could query 200,000 times a second?
 
>> In games like chess and every search terrain (even simulated flight) you
>> can get back to the same spot by different means causing a transposition.
>> Like suppose you start the game with 1.e4,e5 2.d4 that leads to the same
>> position like 1.d4,e5 2.e4. So if we have searched already 1.e4,e5 2.d4
>> that position P we store into a large cache. Other cpu's first want to know
>> whether we already searched that position. 
>
>Right.  But if you can calculate a few Billion operations per second
>sometimes it is faster to recalculate then wait 10-20us for an answer.

To look 1 ply deeper in search is exponential. At a 460 cpu search
(origin3800) moving the variable from 1 (default so it was already not
storing/looking up the leaves remote) to 10, lost me 7 ply search depth.

That's about 3^7 = factor 2187

To answer the question, YES 1 fast pc processor would outsearch in such a
case handsdown a 512 processor supercomputer.

Supercomputers are of course notorious here. It takes a year or so to
deliver them and the processor chosen at the time of buying already wasn't
the fastest, so when they finally work fine for users the processors are at
least 2 times slower than pc processors (for integer work).

Clusters are far superior in that respect.

>> Those hashtable positions get created quite quickly. Deep Blue created them
>> at a 100 million positions a second and simply didn't store vaste majority
>> in hashtable (would be hard as it was in hardware). That's one of the
>> reasons why it searched only 10-12 ply, already in 1999 that was no longer
>> spectacular when 4 processor pc's showed up at world champs. 
>
>Indeed, better algorithms can allow a 4 cpu to compete with a 2000.

The Sheikh (one of the princes of the united arab emirates, see
www.hydrachess.com) plans on building a 1024 processor chess computer he
told me over MSN. He's having bad advisors IMHO. He's using myrinet and a
bad parallel search (speedup less than square root out of total number of
cpu's). Objectivity and desert sand are a bad combination.

>> At a PC with a shared hashtable nowadays i get 10-12 ply (ply = half move,
>> full move is when both sides make a move) in a few seconds, searching a
>> 100000 positions per second a cpu.
>> 
>> So before we start searching every node (=position) we quickly want to find
>> out whether other cpu's already searched it.
>
>So that operation will cost around 80us with GigE, and 10-16us with IB
>or Myri.

80 us is what i read elsewhere too yes for GigE. 

Is it so hard to make a card with lower latency for a few dollar?

I mean if i buy for 135 euro a cpu i can get myself an opteron 1.4Ghz or
something. If i buy for 1000 euro i get myself say a 2.4Ghz opteron.

Less than factor 2 faster.

If you buy for 135 euro a network card it is 80 us. When you buy a highend
netwerk card it's factor 10 faster from user viewpoint.

That's quite a lot!

>> At the origin3800 at 512 processors i used a 115 GB hashtable (i started
>> search at 460 processors). Simply because the machine has 512GB ram.
>
>The origin 3800 has a very healthy interconnect, shared memory lookups
>are in the few 100 ns range, and MPI with the newest libraries are
>in the 1-2us range.

If the interconnects (hubs) of the origin are fine, then they must use real
slow routers.

It's 5.8 us is a shared memory lookup on average at 460 processors
origin3800, no one else at the system (looking up 8 bytes). 3-4 us one way
pingpong.

That machine is equipped with so called 35ns routers.

Lookup to local memory is 280 ns by the way at both itanium2 as well as
origin. Of course everything is randomized. It's complete TLB trashing.

>> So in short you take everything you can get.
>
>Of course.
>
>> The search works with internal iterative deepending which means we first
>> search 1 ply, then 2 ply, then 3 ply and so on.
>> 
>> The time it takes to get to the next iteration i hereby define as the
>> branching factor (Knuth has a different definition as he just took into
>> account 1 algorithm, the 'todays' definition looks more appropriate).
>> 
>> In order to search 1 ply deeper obvious it's important to maintain a good
>> branching factor. I'm very bad in writing out mathematical proofs, but it's
>> obvious that the more memory we use, the more we can reduce the number of
>> legal moves in this position P as next few ply it might be in hashtable,
>> which trivially makes the time needed to search 1 ply deeper shorter.
>> 
>> Storing closer to the root (position where we started searching) is of
>> course more important than near the leafs of the search tree.
>> 
>> When for example not storing in hashtable last 10 ply near the leafs in an
>> overnight experiment the search depth dropped at 460 processors from 20 ply
>> to 13 ply.
>> 
>> Of course each processor of supercomputers is deadslow for game tree search
>> (it's branchy 100% integer work completely knocking down the caches), so
>> compared to pc's you already start at a disadvantage of a factor 16 or so
>> very quickly, before you start searching (in case of TERAS i had to fight
>> with outdated 500Mhz MIPS processors against opterons and high clocked quad
>> Xeons), so upgrading my own networkcards is more clever. 
>
>Interesting.  Of course the Origin 3800 is quite dated, not that the
>Itanium is an opteron killer, but it is much more competitive, and has
>much larger caches.

Itanium 1.3Ghz using 24 hours of PGO and after i figured out all kind of
options in the compiler to not take shortcuts by default, is same speed
like a 1.3Ghz opteron for DIEP. I understand why governments buy them. They
are good on paper and have no real weak spots. Horror & co to program for
those itaniums. 

L3 cache sizes for diep are not important. See extensive benchmarking at
the different hardware sites of my program. For example by Johan de Gelas or :

  Aceshardware : http://www.aceshardware.com/read.jsp?id=60000259  
  Sudhian : http://www.sudhian.com/showdocs.cfm?aid=635&pid=2403
  Soon also tested at www.anandtech.com !

>> Yet getting yourself a network even between a few nodes as quick as those
>> supercomputers is not so easy...
>
>Quadrics and Pathscale's infinipath have networks available that are in the
>same ballpark as the SGI origin.  Even dolphin although I'm not very
>familar with them.

I am very impressed by the quadrics and dolphin cards. Probably by
infinipath too when i check them out. Will do. 

I'm not so impressed yet by myrinet actually, but if cluster builders can
earn a couple of hundreds of dollars more on each node i'm sure they'll do it.

>> Additional your own beowulf network you can first decently test at before
>> playing at a tournament, and without good testing at the machine you play
>> at in tournaments you have a hard 0% chance that it plays well. 
>> 
>> The only thing in software that matters is testing.
>
>Indeed, good luck, thanks for the overview.  I'm planning on a cluster
>with a very fast (sub 2.5us network), but I won't have it for a few months.
>
>I had some infiniband hardware on loan, but I had to return it.
>
>-- 
>Bill Broadley
>Computational Science and Engineering
>UC Davis
>
>


From diep at xs4all.nl  Fri Feb  4 13:33:47 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Fri, 04 Feb 2005 22:33:47 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
Message-ID: <3.0.32.20050204223345.01009170@pop.xs4all.nl>

Thanks for your deep inside, this is very helpful!

Vincent
www.diep3d.com

At 12:57 4-2-2005 -0700, Josip Loncaric wrote:
>Vincent Diepeveen wrote:
>> At 00:29 4-2-2005 -0800, Bill Broadley wrote:
>>>
>>>Do you know that gigabit is too high latency?
>
>Gigabit Ethernet adapters often need tweaking to deliver reasonable 
>latency, bandwidth, and CPU utilization.
>
>For example, if your system uses the e1000 driver (Intel's gigabit 
>Ethernet), the default setting is "dynamic Interrupt Throttle Rate" -- 
>which means that the card will delay interrupting the CPU by up to about 
>130 microseconds after receiving a packet.  Moreover, the "dynamic" part 
>causes the network chip microcode to vary this delay in multiples of 
>about 16 microseconds, so that different packets will generally 
>experience different receive delays.
>
>For the e1000 driver, 
>https://lists.dulug.duke.edu/pipermail/dulug/2004-August/015415.html 
>recommends using "options e1000 InterruptThrottleRate=80000" (add this 
>line to /etc/modules.conf).  Users of this driver may also want to check 
>Intel's parameters for e1000 listed at 
>http://www.intel.com/support/network/sb/cs-009209.htm#parameters -- just 
>don't assume that the default values are appropriate for cluster use.
>
>Other gigabit Ethernet adapters have similar interrupt mitigation 
>strategies, all designed to gracefully cope with high packet rates at 
>high network speeds.  For cluster use, adjustments are usually advisable.
>
>The basic Rx interrupt mitigation scheme is this: the receiver's CPU 
>won't be interrupted until at least N packets have arrived or M 
>microseconds have elapsed (whichever comes first).  This clearly adds up 
>to M microseconds to network latency.  BTW, one often sees N=6 
>(otherwise NFS performance can seriously degrade) and M>=16.  Other 
>variants of this basic scheme are possible; but they all mean increased 
>latencies.
>
>Finally, don't forget the Tx side interrupt mitigation, or else the 
>sending CPU might not be told promptly that it's OK to send more.  The 
>default Tx settings are probably fine for full size packets, but if your 
>applications send lots of small packets, tweaking your network driver's 
>Tx settings may help.
>
>Sincerely,
>Josip
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From rross at mcs.anl.gov  Fri Feb  4 10:16:17 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Fri, 4 Feb 2005 12:16:17 -0600 (CST)
Subject: [Beowulf] MPICH2: Handle Limit?
In-Reply-To: <web-21778622@uccs.edu>
References: <web-21778622@uccs.edu>
Message-ID: <Pine.LNX.4.58.0502041214470.29537@terra.mcs.anl.gov>

Hi Ron,

There should not be an 84 handle limit.

Can you tell me what version of MPICH2 this is, and what architecture and 
OS you're running on?  Do you have a simple test that exhibits the 
problem?

Thanks,

Rob
---
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab


On Thu, 3 Feb 2005, R Hamann wrote:

> I've been having some strange problems with a program using the MPICH2 
> library.  When I added some new datatypes for ghost cell exchange, the 
> program would hang.  I figured out that any number of handles over 84 
> would cause this.  Fortunately, I could delete some handles that I no 
> longer needed, but it still seemed strange.  Are my calculations 
> correct that for each process there is an 84 handle limit? or am I 
> seeing some other problem?
> 
> Ron


From rross at mcs.anl.gov  Sat Feb  5 08:05:04 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Sat, 5 Feb 2005 10:05:04 -0600 (CST)
Subject: [Beowulf] MPICH2: Handle Limit?
In-Reply-To: <web-21848472@uccs.edu>
References: <web-21778622@uccs.edu>
	<Pine.LNX.4.58.0502041214470.29537@terra.mcs.anl.gov>
	<web-21848472@uccs.edu>
Message-ID: <Pine.LNX.4.58.0502050958200.29537@terra.mcs.anl.gov>

Hi Ron,

Well there *is* a limit, because the handles are represented by an 
integer, but from a practical perspective you should never have to worry 
about it.

I have not ever encountered this before.  I wrote most of that code, so I
would very much like to figure out what is happening in your case.  I tend
to agree that it is probably some sort of buffer overrun.  We test on IA32
with gcc as our primary environment.

What exactly is happening when it "bombs"?  Are you getting a segfault?  
Is this something where you could capture a core file and get a stack 
trace?  Are there any errors reported?

Will the problem manifest itself with a single-process run?  If so, you 
could try valgrind.

Actually, while we're discussing it, why do you need "lots" of datatypes 
to exchange ghost cells?  There might be a way to simplify that too.

Regards,

Rob

On Fri, 4 Feb 2005, R Hamann wrote:

> I thought any limit would be wierd, let alone something like 84 (7 X 
> 12?)  Anyway, I thought it was based on the number of MPI variables 
> declared (data_types, windows, requests) because every time I added 
> new declarations, it would hang on Fedora core 2, but run to 
> completion on Scyld (but with erroneous results). If I deleted unused 
> MPI declarations, it would start to work again.  I counted all my 
> handles and came up with 84.
> 
> However, after deleting two 26 element arrays of handles, I thought it 
> would work.  When I added more handles, it bombed again.  I started to 
> try other things.  I added 4 junk ints.  I didn't use the variables I 
> declared, but it still bombed.  When I converted them to chars, it 
> started working again.  Very strange.
> 
> Have you ever encountered this before?   I'm doing a 3d cellular 
> automata, so I need a lot of datatypes for exchange of ghost cells. 
>  It's obviously some strange error I've made that's manifesting itself 
> in MPI instead of a runtime or sytax error.  I'm gonna try looking for 
> any buffer overruns now, but other than that I'm stumped.
> 
> GCC on Fedora Core 2 and on Scyld Beowulf
> MPICH 2 1.0
> 
> Thanks,
> 
> R


From h.jasak at wikki.co.uk  Fri Feb  4 11:13:04 2005
From: h.jasak at wikki.co.uk (Hrvoje Jasak)
Date: Fri, 04 Feb 2005 14:13:04 -0500
Subject: [Beowulf] OpenFOAM
Message-ID: <4203C940.1000402@wikki.co.uk>

Hi Mike,

I've just found your post on OpenFOAM.  I am one of the (two) main 
authors/developers of FOAM and have been using it since 1993.  Linux is 
these days the main and most important parallel platforms for FOAM and 
it is regularly used for large-scale simulations (especially LES).  I am 
still developing the code and doing research/working with students etc. 
with it - if you've got any questions or would like to get involved in 
keeping FOAM alive, please feel free to contact me.

Regards,

Hrvoje Jasak

-- 
Dr. Hrvoje Jasak

Wikki Ltd.
10 Palmerston House,               Tel: +44 (0)20 7221 9815
60 Kensington Place,               E-mail: H.Jasak at wikki.co.uk
London W8 7PU, United Kingdom


From mprinkey at aeolusresearch.com  Fri Feb  4 13:00:02 2005
From: mprinkey at aeolusresearch.com (Michael T. Prinkey)
Date: Fri, 4 Feb 2005 16:00:02 -0500 (EST)
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <20050204205034.GA18717@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.44.0502041556460.23003-100000@ra.thebes>

On Fri, 4 Feb 2005, Greg Lindahl wrote:
> On Fri, Feb 04, 2005 at 03:20:23PM -0500, Andrew Piskorski wrote:
> > On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
> > 
> > > Please note MPI is probably what i'll use, though i keep finding
> > > online information about 'gamma'. Is that faster latency than MPI
> > > implementations?
> > 
> >   http://www.disi.unige.it/project/gamma/
> 
> In addition to gamma, there's also MVAPICH from LBL, and at least two
> commercial products, one from Scali, and one from the Cluster
> Competence Center.
> 
> -- greg

Greg, I think you mean MVICH at LBL.  It and MVIA are all but dead, 
AFAICT:

http://old-www.nersc.gov/research/FTG/mvich/index.html

Mike


From fant at pobox.com  Fri Feb  4 13:27:47 2005
From: fant at pobox.com (Andrew D. Fant)
Date: Fri, 04 Feb 2005 16:27:47 -0500
Subject: [Beowulf] SGE web frontends
In-Reply-To: <4203913A.1030202@cmrl.wustl.edu>
References: <Pine.GSO.4.58.0502021252340.25804@geranium.css.sfu.ca>	<Pine.LNX.4.58.0502040655280.12407@lilith.rgb.private.net>	<420376B0.7000107@scalableinformatics.com>
	<4203913A.1030202@cmrl.wustl.edu>
Message-ID: <4203E8D3.9030509@pobox.com>

Brian Henerey wrote:
> 
> I don't mean to hijack this thread, but I'd also be interested to know 
> if there are any open source web frontends for launching jobs on 
> clusters. I've mostly written my own anyway, but if something's out 
> there I'd like to know.
> 
> Thanks,
> Brian Henerey

Most of this is admittedly not open-source, but it is what I can think 
of off the top of my head for web/gui cluster front end tools.

I think Platform explored a web front end for LSF after they killed off 
the xlsf tools.  The tool I have seen lately that I would be more 
interested in seeing more of is Auger from the Jefferson Laboratory in 
Norfolk.  Technically it's not a web front end, because it's a java 
front end tool, but it looks nice in any case.  Most of the true web 
front ends for cluster jobs that I have seen are application specific 
portals.  NCSA has some examples, and PNL has a nice distributed web 
front end for computation chemistry applications, as well.

Andy


From nix at petelancashire.com  Sat Feb  5 10:07:56 2005
From: nix at petelancashire.com (Pete Lancashire)
Date: Sat, 05 Feb 2005 10:07:56 -0800
Subject: [Beowulf] real hard drive failures
In-Reply-To: <Pine.LNX.4.44.0501251715440.11920-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0501251715440.11920-100000@coffee.psychology.mcmaster.ca>
Message-ID: <1107626876.3794.17.camel@l1.pdxeng.com>

The nice thing and about the only nice thing about using
a fan is in this case, the failure of a fan is not going
to kill you. If your mother board has a 3-wire fan 'port'
not used you can have it report failure.

In the past I've built using a 8pin MicroChip a simple
failure detector. I would think with some imagination
you could take a 555 + transistor + pizo buzzer and
create a simple alarm.

Another thing to use but I've not seen as an individual
item is a heat sink. The Sun SPUD brackets come with a
plate that attaches to the bottom of the drive, the plate
has been punched with hmmm .. louvers ?.

-pete "ah the days of so many fans you could not hear yourself talk"

On Tue, 2005-01-25 at 14:26, Mark Hahn wrote:
> > > I'm only partially interested in the thread "Cooling vs HW replacement" but 
> > > the problem with drive failures is a real pain for me. So, I thought I'd 
> > > share some of my experience.
> > 
> > i'd add 1 or 2 cooling fans per ide disk, esp if its 7200rpm or 10,000 rpm
> > disks 
> 
> I'm pretty dubious of this: adding two 50Khour moving parts to 
> improve the airflow around a 1Mhour moving part which only dissipates
> 10W in the first place?  designing the chassis for proper airflow 
> with minimum fanage is obviously smarter and probably safer.
> 
> > 	- if downtime is important, and should be avoidable, than raid
> > 	is the worst thing, since it's 4x slower to bring back up than
> > 	a single disk failure
> 
> eh?  you have a raid which is not operational while rebuilding?
> 
> > 	- raid will NOT prevent your downtime, as that raid box
> > 	will have to be shutdown sooner or later 
> > 	( shutting down sooner ( asap )  prevents data loss )
> 
> huh?  hotspares+hotplug=zero downtime.
> 
> but yes, treating whole servers as your hotspare+hotplug element is 
> a nice optimization, since hotplug ethernet is pretty cheap vs 
> $50 hotplug caddies for each and every disk ;)
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From john.hearns at streamline-computing.com  Sun Feb  6 00:55:21 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sun, 06 Feb 2005 08:55:21 +0000
Subject: [Beowulf] Newbie Question
In-Reply-To: <5dc04bbf050204112319fe7fbf@mail.gmail.com>
References: <5dc04bbf050204112319fe7fbf@mail.gmail.com>
Message-ID: <1107680121.28574.5.camel@Vigor45>

On Sat, 2005-02-05 at 02:23 +0700, Monang Setyawan wrote:
> Hi. I'm a newbie in this parallel computing thing. 
> (sorry for my bad english, I'm Indonesian)
> 
> My current project is a software that analyze DNA/Protein sequence
> data that needs high performance aspect on it. I plan to deploy this
> software on network of workstations (mm, may be just about 10 PCs on
> the network). Am I in wrong place now?
> 
You could start by looking at the BioBrew Linux distribution.
It probably has a lot of the tools you want for this work.
http://bioinformatics.org/biobrew


From john.hearns at streamline-computing.com  Sun Feb  6 01:07:05 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sun, 06 Feb 2005 09:07:05 +0000
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050204213943.010127d0@pop.xs4all.nl>
References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl>
Message-ID: <1107680825.28574.14.camel@Vigor45>

On Fri, 2005-02-04 at 21:39 +0100, Vincent Diepeveen wrote:

> >
> >So that operation will cost around 80us with GigE, and 10-16us with IB
> >or Myri.
> 
> 80 us is what i read elsewhere too yes for GigE. 
> 
> Is it so hard to make a card with lower latency for a few dollar?
> 
> I mean if i buy for 135 euro a cpu i can get myself an opteron 1.4Ghz or
> something. If i buy for 1000 euro i get myself say a 2.4Ghz opteron.

We supply turnkey clusters with the SCore environment, which gives
excellent latency figures using standard gigabit ethernet NICs.


If you are looking for different hardware, Google for 'TOE' - TCP
Offload Engine. These are claimed to offer lower latency than onboard
adapters. But caveats apply: I've no idea how these work with MPI type
applications, as they're probably aimed at high bandwidth applications,
and it is probably more cost effective to go Myrinet/Quadrics/IB

Actually, it would be worth having the list's opinions on TOE adapters.
My guess is that they really don't do much for the latency, but
would be very good on webservers and databases servers.


From rgb at phy.duke.edu  Sun Feb  6 06:15:56 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sun, 6 Feb 2005 09:15:56 -0500 (EST)
Subject: [Beowulf] Newbie Question
In-Reply-To: <5dc04bbf050204112319fe7fbf@mail.gmail.com>
References: <5dc04bbf050204112319fe7fbf@mail.gmail.com>
Message-ID: <Pine.LNX.4.58.0502060913520.3924@lilith.rgb.private.net>

On Sat, 5 Feb 2005, Monang Setyawan wrote:

> Hi. I'm a newbie in this parallel computing thing. 
> (sorry for my bad english, I'm Indonesian)
> 
> My current project is a software that analyze DNA/Protein sequence
> data that needs high performance aspect on it. I plan to deploy this
> software on network of workstations (mm, may be just about 10 PCs on
> the network). Am I in wrong place now?
> 
> I am going to use message passing paradigm (MPI) to write the
> software. I've read that there are several choice of MPI
> implementation. The problem is, I'm bad in both C or Fortran (I
> usually use Java as my favorite language). Some source said that Java
> (or it's MPI wrapper or pure MPI implementation) isn't good enough to
> implement a parallel computing solution. Is that right?
> 
> My third question is, is there any pdf/ps/one file version of
> "Engineering a Beowulf-style Compute Cluster''?

On my personal website, on brahma, both.  Follow the links for beowulf
and beowulf book on my personal page, or use google with "beowulf book
pdf" to go right there.

Also, there are images for both US letter and Euro A4 there, as you
might have either kind of printer/paper.

   rgb

> 
> Thanks in advance.
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From patrick at myri.com  Sat Feb  5 18:27:57 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Sat, 05 Feb 2005 21:27:57 -0500
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050204213943.010127d0@pop.xs4all.nl>
References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl>
Message-ID: <420580AD.5050003@myri.com>

Hi Vincent,

Vincent Diepeveen wrote:
>>>CPU's are 100% busy and after i know how many times a second the network
>>>can handle in theory requests i will do more probes per second to the
>>>hashtable. The more probes i can do the better for the game tree search.
>>
>>With a gigE network that sounds like 40us or so.  With Myrinet or IB
>>it's in the 4-6us range.  If you bought dual opterons with the special
> 
> 
> At the quadrics and dolphin homepage they both claim 12+ us for Myrinet.

Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), 
that includes fibers and a switch in the middle:

    Length   Latency(us)    Bandwidth(MB/s)
         0       2.684          0.000
         1       2.874          0.336
         2       2.898          0.690
         4       2.978          1.343
         8       2.965          2.699
        16       2.993          5.347
        32       3.409          9.388
        64       3.563         17.960
       128       3.977         32.185
       256       5.699         44.916

Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I 
didn't hear about noticeable SCI clusters in a long time.

> I am very impressed by the quadrics and dolphin cards. Probably by
> infinipath too when i check them out. Will do. 
> 
> I'm not so impressed yet by myrinet actually, but if cluster builders can
> earn a couple of hundreds of dollars more on each node i'm sure they'll do it.

I don't think Myrinet would be the cheapest, I am sure you can get a 
better deal from desperate interconnect vendors.

What does not impress you in Myrinet ?

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From diep at xs4all.nl  Sat Feb  5 19:36:20 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Sun, 06 Feb 2005 04:36:20 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
Message-ID: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl>

At 21:27 5-2-2005 -0500, Patrick Geoffray wrote:
>Hi Vincent,
>
>Vincent Diepeveen wrote:
>>>>CPU's are 100% busy and after i know how many times a second the network
>>>>can handle in theory requests i will do more probes per second to the
>>>>hashtable. The more probes i can do the better for the game tree search.
>>>
>>>With a gigE network that sounds like 40us or so.  With Myrinet or IB
>>>it's in the 4-6us range.  If you bought dual opterons with the special
>> 
>> 
>> At the quadrics and dolphin homepage they both claim 12+ us for Myrinet.
>
>Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), 
>that includes fibers and a switch in the middle:
>
>    Length   Latency(us)    Bandwidth(MB/s)
>         0       2.684          0.000
>         1       2.874          0.336
>         2       2.898          0.690
>         4       2.978          1.343
>         8       2.965          2.699
>        16       2.993          5.347
>        32       3.409          9.388
>        64       3.563         17.960
>       128       3.977         32.185
>       256       5.699         44.916
>
>Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I 
>didn't hear about noticeable SCI clusters in a long time.
>
>> I am very impressed by the quadrics and dolphin cards. Probably by
>> infinipath too when i check them out. Will do. 
>> 
>> I'm not so impressed yet by myrinet actually, but if cluster builders can
>> earn a couple of hundreds of dollars more on each node i'm sure they'll
do it.
>
>I don't think Myrinet would be the cheapest, I am sure you can get a 
>better deal from desperate interconnect vendors.
>
>What does not impress you in Myrinet ?

Thanks for your kind answer Patrick,

Obviously i mentionned that number because i read it elsewhere.

Well a number of points bother my mind from which majority is true for
others as well. But first let me note that i'm not against myrinet in
general. I am just trying to solve a very specific case. For that specific
case i'm not so impressed.

Note that so far i didn't find any desperate vendor. For sure quadrics
doesn't look desperate to me, they aren't even selling old cards anymore
though they must have still thousands of them lying at home from returned
upgraded networks. Finding second hand highend cards seems to be very seldom.

First of all i'm interested in how quick i can get 4-64 bytes from remote
memory. So not from some kind of network card cache, as myrinet doesn't
have some megabytes on chip, but just a few tens of kilobytes. The memory
has to come therefore from the remote nodes main memory, at a random adress
in the main memory. No streaming at all happens. that 400 ns extra that the
TLB gives is definitely not the problem i guess. 

The problem for me is to understand: "how do you get that memory at a
cluster?"

A latency on paper says of course nothing when you can't actually get it
within that time.

"Paper supports everything."
    Arturo Ochoa (Caracas, Venezuela)

I hope everyone realizes that an important consequence from beowulf
clusters is that you actually want to *use* all those cpu's you have to
your avail.

So every cpu has a program running that eats 100% system time. Because if
it wouldn't use 100% system time, you wouldn't need a cluster!

>From that 100% system time obviously you must be prepared to give away some
to serve other nodes as quickly as possible doing a read. 

All latencies i see quoted at all hardware sites, it is very hard to figure
for me out whether that's a latency that is supported by paper, or whether
it's a practical latency i can take into account as a programmer with all
software layers overhead when each cpu is 100% running a program.

Secondly, but as i'm not a cluster expert i don't know how to avoid that,
it's of course a big LOSS in sequential speed if my program each few
instructions must check whether there is some MPI message to get handled.
If i check a lot that will slow down my program 20 times. If i don't check
a lot, other cpu's will have to wait longer and that defeats the purpose of
a fast network card.

Factor 20 is about the slowdown of the average 'old' supercomputer
chessprograms which use MPI type solutions. Zugzwang (Paderborn-Siemens),
P.Conners (Paderborn-Siemens), cilkchess (MIT). I've been playing with my
own eyes against those programs in world champs and despite that it has
happened that i played at the same hardware with a similar amount of cpu's
and a program having factor 100 more chessknowledge (which slows down the
program *considerable*), the actual speed at which the program searches
nodes was up to factor 5-10 faster. 

Now a few years ago this was not a major problem because for example
Cilkchess which obviously ran factor 20-40 times slower than it could, used
1800 processors for example in world champs 1995 (Hong kong) and 512
processors in world champs 1999 (Paderborn). Of course because 1 processor
was real real fast compared to the speed of 1 pc processor in those days,
they practical were searching a lot deeper than pc programs (and both
played excellent for its days, especially Don Dailey needs to get a big
compliment for that). 

However if i show up with 2 pc's and 2 network cards, then it sure matters
when i lose a lot of speed. 

Obviously for embarassingly parallel software this is no issue, but usually
for embarrassingly parallel software all you need is gigabit ethernet. 

There is so many MPI applications which are not exactly embarassingly
parallel from which you see that a decent programmer single cpu would be
doing that 20 times faster. Or to quote someone who has been doing such
rewriting work for some physical applications that run here and there: "I
didn't blink my eyes when i managed to speedup an application factor 1000".

So it is very interesting for us all and me especially to understand how
*fast* you can get that memory under full load of all the logical cpu's.

Third each pc has 2 cheapo k7 processors which are a lot slower than opterons.

Second problem i have is that i can get easily dual k7 pc's from
chessplayers and they can get bought cheap still. Dual k7 is practical same
speed like a dual xeon 3.06Ghz Northwood with all memory slots filled with
2-2-2 DIMMS for DIEP. So just compare the price of such a system with a
cheapo dual k7 with registered cas3 RAM. 

Those dual k7's have 64 bits 66Mhz slots, not pci-x as far as i know and
also those who do have A64's or P4's usually don't have pci-x onboard
either. Sure there is boards that have them and i'm sure that if you make a
network

Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX
mainboards and claim somewhere a paper latency of 1.x us. 

What is the achieved read speed to remote memory myrinet gets at 64 bits /
66Mhz in software, so ready to use 4-64 bytes for applications? 

I'm not asking it to be accurate within 400ns, as that's the delay you'll
have from TLB trashing the remote node. But accuracy within 1.5 us would be
quite nice.

First of all for integer intensive applications i'm doing fastest processor
is opteron, k7 comes second and P4 comes third. Exception is a P4 machine
equipped with the most expensive stuff (2-2-2 ram and all banks filled)
good mainboard and northwoods and overclocked at the mainboard. However for
that price a dual opteron can get bought and it just blows away that P4
bigtime.

Every year that new software gets released of course that P4 gets slower,
because newer software only gets more and more complex with more options
and will fit less perfectly in P4's small tiny caches, let alone when we
get a lot of 64 bits programs. They won't fit at all in those tiny slow
caches.

So until the dual core opterons arrive at low cost, obviously you can make
dual k7 nodes for just a few hundreds of dollar a node. 

When adding new nodes which in the future no doubt are dual opteron, you
still run further with those dual k7 nodes and want to mix them obviously
with dual opterons. Is that possible?


>Patrick
>-- 
>
>Patrick Geoffray
>Myricom, Inc.
>http://www.myri.com
>
>


From diep at xs4all.nl  Sun Feb  6 07:10:39 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Sun, 06 Feb 2005 16:10:39 +0100
Subject: [Beowulf] Newbie Question
Message-ID: <3.0.32.20050206161035.01013bd0@pop.xs4all.nl>

At 02:23 5-2-2005 +0700, Monang Setyawan wrote:
>Hi. I'm a newbie in this parallel computing thing. 
>(sorry for my bad english, I'm Indonesian)
>
>My current project is a software that analyze DNA/Protein sequence
>data that needs high performance aspect on it. I plan to deploy this
>software on network of workstations (mm, may be just about 10 PCs on
>the network). Am I in wrong place now?

>I am going to use message passing paradigm (MPI) to write the
>software. I've read that there are several choice of MPI
>implementation. The problem is, I'm bad in both C or Fortran (I
>usually use Java as my favorite language). Some source said that Java
>(or it's MPI wrapper or pure MPI implementation) isn't good enough to
>implement a parallel computing solution. Is that right?

You definitely want to write it in C. 

Basically protein research, which might touch a field which is forbidden to
research in EU countries, but not forbidden to research in USA, Israel and
i must admit i'm amazed that's legal in Indonesia usually is heavily
floating point oriented. Just calculating what i would classify as matrix
invariants to determine origins and consequences of modifications.

In C there is superb libraries you want to consider. Certain calculations
can get speeded up bigtime by FFT, but not always, as sometimes you just
want accurate results and not approximations. C is ideal because it's
easier to use SSE2 for it which is what you need of course. Please note
both P4 and A64/Opteron have that functionality and Opteron is 2 times
faster than P4 there, but perhaps you can get the P4 hardware factor 2
cheaper, which would make it very attractive for such a cluster.

In all cases such software is embarassingly parallel. gigabit ethernet is
more than sufficient. Yet taking care the pc's have relative fast floating
point possibilities is very relevant. Cheapest gflop per dollar might be
probably surprising hardware. 

A beowulf definitely is ideal for this type of software.

>My third question is, is there any pdf/ps/one file version of
>"Engineering a Beowulf-style Compute Cluster''?
>
>Thanks in advance.
>
>-- 
>For the sake of time..
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From wytsang at clustertech.com  Sun Feb  6 19:24:43 2005
From: wytsang at clustertech.com (Clotho)
Date: Mon, 07 Feb 2005 11:24:43 +0800
Subject: [Beowulf] ifort MPI_FILE_OPEN err with romio testsuite
Message-ID: <4206DF7B.3040705@clustertech.com>

In MPICH-1.2.6, romio directory, there is a test program called 
"fcoll_test.f".
The test program run successfully with gcc compiler.
However, with ifort (8.0/8.1) compiler, the program fails.

After debugging, I find that the function MPI_FILE_OPEN fails (ierr is 
non-zero).
But change the size of character array from 1024 to 200 can solve the 
problem.


I have found another people with similar experience as me: (in Chinese)
http://www.lasg.ac.cn/cgi-bin/forum/view.cgi?forum=4&topic=2519

Here is the full program :
http://clustertech.com/~wytsang/fcoll_test.f

Here is the simplier version of the program.

     program main
     implicit none

     include 'mpif.h'

     integer nprocs
     integer mynod
     integer fh, ierr
     character*1024 str    ! used to store the filename
c     character*200 str    ! this will work

     integer writebuf(1)


     call MPI_INIT(ierr)
     call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
     call MPI_COMM_RANK(MPI_COMM_WORLD, mynod, ierr)


     str = 'test'
     writebuf(0) = 0

     call MPI_FILE_OPEN(MPI_COMM_WORLD, str,                           &
    &     MPI_MODE_CREATE+MPI_MODE_RDWR, MPI_INFO_NULL, fh, ierr)
     print *,ierr

     call MPI_FINALIZE(ierr)

     stop
     end


From patrick at myri.com  Mon Feb  7 00:11:58 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Mon, 07 Feb 2005 03:11:58 -0500
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl>
References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl>
Message-ID: <420722CE.5010408@myri.com>

Vincent,

Vincent Diepeveen wrote:
> Thanks for your kind answer Patrick,
> 
> Obviously i mentionned that number because i read it elsewhere.

I know, I have seen worse.

> Note that so far i didn't find any desperate vendor. For sure quadrics
> doesn't look desperate to me, they aren't even selling old cards anymore
> though they must have still thousands of them lying at home from returned
> upgraded networks. Finding second hand highend cards seems to be very seldom.

Tip: desperate companies are usually young and spend a lot of VC money 
on marketing. Quadrics does not fit, I am afraid, they have been around 
too long :-) Furthermore, selling old hardware is not very cost 
effective for a vendor: compatibility troubles with newer machines, 
require to support old hardware in new drivers and new middlewares, tap 
in inventory reserved for replacement parts, etc.

> First of all i'm interested in how quick i can get 4-64 bytes from remote
> memory. So not from some kind of network card cache, as myrinet doesn't
> have some megabytes on chip, but just a few tens of kilobytes. The memory
> has to come therefore from the remote nodes main memory, at a random adress
> in the main memory. No streaming at all happens. that 400 ns extra that the
> TLB gives is definitely not the problem i guess.

Myrinet has 2 MB of SRAM in standard, used by firmware code, data and 
buffers.

What you want to do basically is a Get. In practice, the origin of the 
Get will send a small packet with a virtual address or a RDMA handle and 
an offset, the NIC on the target side converts it in a physical address, 
fetches the data by DMA and sends it back to the origin side.

> All latencies i see quoted at all hardware sites, it is very hard to figure
> for me out whether that's a latency that is supported by paper, or whether
> it's a practical latency i can take into account as a programmer with all
> software layers overhead when each cpu is 100% running a program.

No, it's not likely to fit your usage. Vendors quote MPI latency on 
pingpong. That's pretty much the cost of sending/receiving an MPI 
message from user space to user space. Often, this is also with only 2 
nodes, optimal conditions and everybody holding their breath.

You want RMA Get. The latency for a Get is larger than for a MPI send. 
For 64 bytes, it is basically the MPI latency for 0 bytes (for the Get 
request) + the latency for 64 bytes (for the reply). Assuming that you 
don't Get all over the host memory, the virtual/physical translation 
will be hot in the target NIC so the translation cost will be very 
small. You want less than 3us per Get of 64 Bytes ? I don't know if even 
Quadrics can do it. The good news is that you can pipeline it very well. 
So it may cost more than 3 us for one Get, but you may complete a Get 
every 0.5 us if you post a bunch of them.

> Secondly, but as i'm not a cluster expert i don't know how to avoid that,
> it's of course a big LOSS in sequential speed if my program each few
> instructions must check whether there is some MPI message to get handled.

If you want perfect overlap and if you are ready to go as low level as 
possible, one-sided communication are for you (no host CPU involved on 
the target side). All low level communication interfaces support 
one-sided communications (not yet released for MX on Myrinet, but GM has 
it).

> However if i show up with 2 pc's and 2 network cards, then it sure matters
> when i lose a lot of speed. 
> 
> Obviously for embarassingly parallel software this is no issue, but usually
> for embarrassingly parallel software all you need is gigabit ethernet. 

If you can and know how to overlap, latency is irrelevant. It's hard to 
do on complex irregular codes, but you can usually do it if you can use 
one-sided communications. Don't put your communications in the critical 
path. Post them early and post many of them concurrently, pipelining 
will hide the latency of the critical path.

That's why desperate vendors use pipelined pingpong to get better curves.

> There is so many MPI applications which are not exactly embarassingly
> parallel from which you see that a decent programmer single cpu would be
> doing that 20 times faster. Or to quote someone who has been doing such

Most of the times, you go parallel to go bigger, not faster. If the 
problem size fits in one node, don't use a cluster, use a 
multi-processor nodes. You will have more bangs for your bucks.

> So it is very interesting for us all and me especially to understand how
> *fast* you can get that memory under full load of all the logical cpu's.

Using one-sided communications, there is little difference if the CPUs 
are loaded or not on the target side.

> Third each pc has 2 cheapo k7 processors which are a lot slower than opterons.

IO bus is more important for the communications part. I don't know of 
cheapo k7 machines with a decent PCI bus. However, for 64 bytes, even a 
cheesy PCI will not slow things down that much.

> Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX
> mainboards and claim somewhere a paper latency of 1.x us.

How long can you hold your breath ?

> What is the achieved read speed to remote memory myrinet gets at 64 bits /
> 66Mhz in software, so ready to use 4-64 bytes for applications? 

I have no idea, I am not even sure that I have a 64 bits/66 Mhz machine 
around to measure it. With GM, I would say at least 10 us. Certainely more.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From patrick at myri.com  Mon Feb  7 01:48:22 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Mon, 07 Feb 2005 04:48:22 -0500
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com>
References: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com>
Message-ID: <42073966.7090009@myri.com>

Hi Duncan,

duncan.roweth at quadrics.com wrote:
> This example reports the average time for 1000
> blocking get calls. Patrick's description of the
> mechanism is essentially correct, apart from the
> detail that we have a fast path for short operations
> that avoids the need to set up a DMA. 

How can you do one-sided operations without a DMA on the target side ?!?

The only way that I can think of is to map the host virtual memory into 
the NIC memory space and let all memory writes generates PIO writes to 
actually modify the NIC memory. Surely, you must be talking about 
another DMA.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From john.hearns at streamline-computing.com  Mon Feb  7 01:50:58 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Mon, 07 Feb 2005 09:50:58 +0000
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <420722CE.5010408@myri.com>
References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl>
	<420722CE.5010408@myri.com>
Message-ID: <1107769858.12606.57.camel@Vigor45>

On Mon, 2005-02-07 at 03:11 -0500, Patrick Geoffray wrote:
> Vincent,
> 

> Tip: desperate companies are usually young and spend a lot of VC money 
> on marketing. Quadrics does not fit, I am afraid, they have been around 
> too long :-) Furthermore, selling old hardware is not very cost 
> effective for a vendor: compatibility troubles with newer machines, 
> require to support old hardware in new drivers and new middlewares, tap 
> in inventory reserved for replacement parts, etc.

Why would Quadrics have old/second hand hardware to sell anyway?
If they have older model cards unsold they would be holding them as
spares for customers who are still running those models, as Patrick
says.

Clusters which have been upgraded or scrapped are unlikely to be
returned to Quadrics/Myricom.
Clusters are usually bought as completely integrated systems, from
companies such as ourselves. We install and configure the Myrinet
networking for customers - they don't buy direct from Myricom.
And, like many companies on this list, we provide continuing support
and advice.


So I'd say there is no conspiracy against you - if you are seeking
second hand high performance networking gear, look on eBay or ask nicely
on this list.
I was surprised recently to see small fibre channel switches go very
cheaply on eBay - not so long ago you would pay $$$$ for them.


From jcownie at etnus.com  Mon Feb  7 08:26:29 2005
From: jcownie at etnus.com (James Cownie)
Date: Mon, 07 Feb 2005 16:26:29 +0000
Subject: [Beowulf] Home beowulf - NIC latencies 
In-Reply-To: Message from Patrick Geoffray <patrick@myri.com> 
	of "Mon, 07 Feb 2005 04:48:22 EST." <42073966.7090009@myri.com> 
References: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com>
	<42073966.7090009@myri.com> 
Message-ID: <20050207162629.C478D1C826@amd64.cownie.net>


> Patrick Geoffray <patrick at myri.com> wrote:
> duncan.roweth at quadrics.com wrote:
> > This example reports the average time for 1000
> > blocking get calls. Patrick's description of the
> > mechanism is essentially correct, apart from the
> > detail that we have a fast path for short operations
> > that avoids the need to set up a DMA.
> 
> How can you do one-sided operations without a DMA on the target side ?!?
> 
> The only way that I can think of is to map the host virtual memory
> into the NIC memory space and let all memory writes generates PIO
> writes to actually modify the NIC memory. Surely, you must be talking
> about another DMA.

I think you're talking at cross-purposes.

Patrick is right that in the target machine there is a DMA operation
initiated by the NIC.

However Duncan is saying that Quadrics don't send a DMA request packet
over their network, but have a more optimised less general request that
they can issue without having to build a full DMA descriptor in the host
machine and transfer it to the target. 

Therefore in Quadrics' terms no DMA operation is sent over the net,
whereas from Patricks' viewpoint a DMA operation _does_ occur.

-- 
-- Jim
--
James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


From duncan.roweth at quadrics.com  Mon Feb  7 01:29:15 2005
From: duncan.roweth at quadrics.com (duncan.roweth at quadrics.com)
Date: Mon, 7 Feb 2005 09:29:15 -0000
Subject: [Beowulf] Home beowulf - NIC latencies
Message-ID: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com>

Patrick, Vincent

Some input into your discussion. Here is the data
on get latency for Elan4 in an Opteron cluster

quorumi: prun -N2 pgping -f get 0 64
  1:        4 bytes      2.36 uSec     1.69 MB/s
  1:        8 bytes      2.35 uSec     3.40 MB/s
  1:       16 bytes      2.38 uSec     6.73 MB/s
  1:       32 bytes      2.37 uSec    13.50 MB/s
  1:       64 bytes      2.43 uSec    26.30 MB/s

This example reports the average time for 1000
blocking get calls. Patrick's description of the
mechanism is essentially correct, apart from the
detail that we have a fast path for short operations
that avoids the need to set up a DMA. 

You can probably do a bit better on the very fastest
nodes, but this is what I see on the system we have 
in the office. 

> You want less than 3us per Get of 64 Bytes ? I don't 
> know if even Quadrics can do it. 

Yes we can!

> The good news is that you can pipeline it very well. 

Indeed. There is lots of parallelism in the hardware
so you can me processing multiple requests at the same
time. In this sequence of short jobs I measure the 
average time for 8 byte gets 2 at a time, 4 at a time
etc. 

quorumi: prun -N2 pgping -f get -b2 8
  1:        8 bytes      1.32 uSec     6.07 MB/s
quorumi: prun -N2 pgping -f get -b4 8
  1:        8 bytes      1.04 uSec     7.66 MB/s
quorumi: prun -N2 pgping -f get -b8 8
  1:        8 bytes      0.84 uSec     9.47 MB/s
quorumi: prun -N2 pgping -f get -b16 8
  1:        8 bytes      0.82 uSec     9.79 MB/s
quorumi: prun -N2 pgping -f get -b32 8
  1:        8 bytes      0.79 uSec    10.18 MB/s

The limiting factor is the rate at which the remote NIC
can read data over the PCI bus. 

Best Wishes
Duncan Roweth
Quadrics Limited


P.S. Clearly our sales people focus on the current product 
(Elan4 NICs) but we will be supporting the installed base
of Elan3 systems for some years yet. Most of the big systems
have extended warranties, so we keep a stock of spares, but 
there are a few hundred adapters and associated switches. 
Drop us some mail if you are interested.  


From duncan.roweth at quadrics.com  Mon Feb  7 02:01:26 2005
From: duncan.roweth at quadrics.com (duncan.roweth at quadrics.com)
Date: Mon, 7 Feb 2005 10:01:26 -0000
Subject: [Beowulf] Home beowulf - NIC latencies
Message-ID: <30062B7EA51A9045B9F605FAAC1B4F627D505E@exch01.quadrics.com>

Patrick

Thanks for your mail. 

> How can you do one-sided operations without a DMA on the 
> target side ?!?

Gets are done by telling the remote adapter to perform 
a put back to the source. This can be a request to start
a DMA (for large transfers) or it can be a request to the 
the Short Transaction ENgine (STEN).

The STEN is a fast path for short puts that can be used 
from either the main CPU or from the adapter. It can 
generate network packets from a stream of commands and
data written either by the main CPU (as PIO writes) or
directly by the adapter. 

There are more details are in the "Hot Chips" paper that 
we wrote with Fabrizio Petrini of Los Alamos. 

http://www.c3.lanl.gov/~fabrizio/papers/hot03.pdf


Best Wishes
Duncan Roweth
Quadrics Limited


From rcmanglekar at rediffmail.com  Mon Feb  7 06:17:05 2005
From: rcmanglekar at rediffmail.com (Rahul Manglekar)
Date: 7 Feb 2005 14:17:05 -0000
Subject: [Beowulf] How-TO Mysql on Lam-cluster?
Message-ID: <20050207141705.26471.qmail@webmail29.rediffmail.com>

  
hi all..,

i have setup up LAM-MPI cluster on 3 machine for testing.

i want do put mysql on cluster..,, 
such that if mysql need more processor power , 
it can use processor power of all nodes that are present in cluster.

i am using MySQL-4.0.

can u guide me please..

thank you in advance..


-- Rahul..
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050207/7f5ae763/attachment.html>

From mark.westwood at ohmsurveys.com  Mon Feb  7 06:39:42 2005
From: mark.westwood at ohmsurveys.com (Mark Westwood)
Date: Mon, 07 Feb 2005 14:39:42 +0000
Subject: [Beowulf] Newbie Question
In-Reply-To: <5dc04bbf050204112319fe7fbf@mail.gmail.com>
References: <5dc04bbf050204112319fe7fbf@mail.gmail.com>
Message-ID: <42077DAE.8020806@ohmsurveys.com>

Hi Monang

Here's my contribution to your decision about which language you program 
in for your cluster:

Suppose that you know Java well, but not C.  Suppose that it will take 
you 6 months to learn C well enough to be able to write your programs in 
it.  In those 6 months you can do an awful lot of computing in Java.  If 
your project is intended to last, say, 9 months, then you might decide 
that you will program in Java because you will get more computing done 
that way than by learning a new language.

If your project will last much longer then you might decide that 
learning C will be of benefit, because each program will be faster in C 
than in Java.  If you're doing some calculations then I'd suggest that 
you allow C to be 5 times faster than Java on average for cluster-type 
computing.  Some will tell you that it is more than 10 times as fast 
(and it is for some types of computation), others that it is no faster 
(which is true for some types of computation).

Another issue (or problem if you look at things that way ) with Java is 
that the implementations of MPI for Java are non-standard and not as 
widely used as the implementations for C.  You might find it difficult, 
therefore, to get good support from groups such as this one, for a Java 
/ MPI program.


To sum up:

If you can write good Java programs to solve your problems on your 
cluster then you should prefer that to writing bad C (or Fortran) 
programs.  If you find that your Java program is not fast enough then 
you might think about rewriting parts of it in C (or another compiled 
language) to achieve specific performance improvements.


Hope this helps

Mark


Monang Setyawan wrote:
> Hi. I'm a newbie in this parallel computing thing. 
> (sorry for my bad english, I'm Indonesian)
> 
> My current project is a software that analyze DNA/Protein sequence
> data that needs high performance aspect on it. I plan to deploy this
> software on network of workstations (mm, may be just about 10 PCs on
> the network). Am I in wrong place now?
> 
> I am going to use message passing paradigm (MPI) to write the
> software. I've read that there are several choice of MPI
> implementation. The problem is, I'm bad in both C or Fortran (I
> usually use Java as my favorite language). Some source said that Java
> (or it's MPI wrapper or pure MPI implementation) isn't good enough to
> implement a parallel computing solution. Is that right?
> 
> My third question is, is there any pdf/ps/one file version of
> "Engineering a Beowulf-style Compute Cluster''?
> 
> Thanks in advance.
> 

-- 
Mark Westwood
Parallel Programmer
OHM Ltd
The Technology Centre
Offshore Technology Park
Claymore Drive
Aberdeen
AB23 8GD
United Kingdom

+44 (0)870 429 6586
www.ohmsurveys.com


From deadline at clusterworld.com  Mon Feb  7 07:34:44 2005
From: deadline at clusterworld.com (Douglas Eadline, Cluster World Magazine)
Date: Mon, 7 Feb 2005 10:34:44 -0500 (EST)
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <20050204202023.GA32459@piskorski.com>
Message-ID: <Pine.LNX.4.44.0502071023020.11845-100000@boltzmann>

On Fri, 4 Feb 2005, Andrew Piskorski wrote:

> On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote:
> 
> > Please note MPI is probably what i'll use, though i keep finding
> > online information about 'gamma'. Is that faster latency than MPI
> > implementations?
> 
>   http://www.disi.unige.it/project/gamma/
> 
> Gamma is a non-TCP/IP Linux 2.6.x network driver for Intel Pro/1000
> gigabit ethernet cards, for use with MPI.  It offers much better
> latency (11 us or so) than TCP/IP over ethernet (maybe 60 or 100 us),
> but worse than the specialized HPC interconnects (maybe 3 us).

The "60-100 us" is incorrect. With proper tuning an e1000 can get 25us 
latency (using netpipe). (see Jossip's post about tunning parameters)
Oh, and by the way this was using a 32 PCI desk top card. 

A low latency number is not the whole story however, processor load
is another issue. The point is that tuning can make a difference. Default
values are usually set for maximum throughput and low CPU overhead.

It all depends on what your application needs. If you need GAMMA,
then that is a good choice, but many applications may work well
with proper tuning of NIC parameters.

As an aside, Netgear used to sell a low cost desktop NIC
(GA302T-tigon3/Broadcom) which had very good numbers as well.
I profiled this NIC in the first issue of ClusterWorld. 


Doug

> 
> The attraction of GAMMA, is that Intel Pro/1000 cards can be had for
> $11 to $60 or so each (depending on exact model, etc.), and gigabit
> switches are also pretty cheap, while SCI or Myrinet is somewhere in
> the $500 to $1500 per node range (I don't keep track).
> 
> So if your application can benefit from lower latency, but you want
> something really cheap, GAMMA should be well worth trying.
> 
> 

-- 
----------------------------------------------------------------
Editor-in-chief                   ClusterWorld Magazine
Desk: 610.865.6061                            
Fax:  610.865.6618                www.clusterworld.com


From rokrau at yahoo.com  Mon Feb  7 09:01:38 2005
From: rokrau at yahoo.com (Roland Krause)
Date: Mon, 7 Feb 2005 09:01:38 -0800 (PST)
Subject: [Beowulf] memory allocation on x86_64 returning huge addresses
Message-ID: <20050207170138.72473.qmail@web52907.mail.yahoo.com>

I am trying to dynamically allocate memory for a Fortran-77 code that
is supposed to run in I4 R4 mode on an x86_64 running SuSE-9.2 with a
kernel.org 2.6.9 kernel. The machine has 8GB memory and memory has to
be allocated in one large chunk. 

The problem is that malloc returns an address that is way beyond
8billion which is not what I had expected.

Does anybody why Linux gives me an address that is outside the physical
memory range? 

Does anybody whether there are any kernel parameter that affect this
behavior? 

Any pointers to some good reading about the Linux VM would also be
appreciated. 


Regards
Roland


__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - now with 250MB free storage. Learn more.
http://info.mail.yahoo.com/mail_250


From kus at free.net  Mon Feb  7 09:37:17 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Mon, 07 Feb 2005 20:37:17 +0300
Subject: [Beowulf] How-TO Mysql on Lam-cluster?
In-Reply-To: <20050207141705.26471.qmail@webmail29.rediffmail.com>
Message-ID: <web-511567@free.net>

In message from "Rahul Manglekar" <rcmanglekar at rediffmail.com> (7 Feb 
2005 14:17:05 -0000):
>  
>hi all..,
>
>i have setup up LAM-MPI cluster on 3 machine for testing.
>
>i want do put mysql on cluster..,, 
>such that if mysql need more processor power , 
>it can use processor power of all nodes that are present in cluster.
   No. Usual MySQL isn't capable to use cluster nodes in parallel.
But you  may work w/special software which allows to split your 
database between cluster nodes. You may find the corresponding
information at mysql site or also search Beowulf maillist archive
(if I remember right, it was some discussion of databases in cluster
here).

Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow  
>
>i am using MySQL-4.0.
>
>can u guide me please..
>
>thank you in advance..
>
>
>-- Rahul..


From James.P.Lux at jpl.nasa.gov  Mon Feb  7 09:55:34 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Mon, 07 Feb 2005 09:55:34 -0800
Subject: [Beowulf] Newbie Question
In-Reply-To: <42077DAE.8020806@ohmsurveys.com>
References: <5dc04bbf050204112319fe7fbf@mail.gmail.com>
	<42077DAE.8020806@ohmsurveys.com>
Message-ID: <6.1.1.1.2.20050207093914.027efd38@mail.jpl.nasa.gov>

At 06:39 AM 2/7/2005, Mark Westwood wrote:
>Hi Monang
>
>Here's my contribution to your decision about which language you program 
>in for your cluster:
>
>Suppose that you know Java well, but not C.  Suppose that it will take you 
>6 months to learn C well enough to be able to write your programs in 
>it.  In those 6 months you can do an awful lot of computing in Java.  If 
>your project is intended to last, say, 9 months, then you might decide 
>that you will program in Java because you will get more computing done 
>that way than by learning a new language.
>
>If your project will last much longer then you might decide that learning 
>C will be of benefit, because each program will be faster in C than in 
>Java.  If you're doing some calculations then I'd suggest that you allow C 
>to be 5 times faster than Java on average for cluster-type 
>computing.  Some will tell you that it is more than 10 times as fast (and 
>it is for some types of computation), others that it is no faster (which 
>is true for some types of computation).

I would agree with Mark.  I've been faced by a similar decision.. do we do 
the calculations in Excel using Visual Basic for Applications (VBA), Visual 
Basic, or C++, or Matlab, or something else.  Various pieces of the puzzle 
exist in all of these, so the problem is do we translate (for example) the 
VB into C, or, glue it all together with scripts, or rewrite from 
scratch.  Complicating this is that the people available to work on it have 
various skill sets which don't map well to any of the approaches (how many 
people do YOU know who are equally facile in VBA, C++, and Matlab??).

In our case, the goal was to demonstrate that a particular capability can 
exist at all, versus making it really fly, so we went with the cobbled 
together scripts.  It might turn out, after all, that the speed of the 
software isn't the "rate determining" factor, but that availability of 
staff is.


James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From rene at renestorm.de  Mon Feb  7 09:53:02 2005
From: rene at renestorm.de (rene)
Date: Mon, 7 Feb 2005 18:53:02 +0100
Subject: [Beowulf] mpich future
Message-ID: <200502071853.02126.rene@renestorm.de>

Hi folks,

there are many mpi implementations out there, but which one ist "the best"?
As far as I know, there are commercial prodcuts which support different 
hardware in one library (eg myrinet + ethernet).  Which is a nice feature.

Is there a working mpich which unites the common channels?
Score did that once, but it's a year ago, since I've worked with it.

In addition to that I've ran into trouble with the different standarts (1.2, 
2.0). 
It seems to me that Openmpi gets more influence. Is that right?

I dont feel like put 20 different preprocessor variables on my applications, 
like
#if MPI_VERSION > 1
for each of that implementation.

So my question is:
In which direction goes mpi tomorrow?

Cu

-- 
Rene Storm
@Cluster


From daniel.kidger at quadrics.com  Mon Feb  7 10:18:33 2005
From: daniel.kidger at quadrics.com (daniel.kidger at quadrics.com)
Date: Mon, 7 Feb 2005 18:18:33 -0000
Subject: [Beowulf] memory allocation on x86_64 returning huge addresses
Message-ID: <30062B7EA51A9045B9F605FAAC1B4F62812104@exch01.quadrics.com>

Roland,

Sigh!  :-)

malloc can return any address it so wishes. 
Don't forget that this is a *virtual* address and so is not bounded by physical memory.

A 64-bit O/s with say 8GB RAM can easily have stack addresses in the window 1TB - 2TB, and heap address even higher (!)

I guess your real problem is that you are porting a (Fortran) program whose authors did not understand that it might ever run on a 64-bit machine. Your code does a malloc and then tries to store this in an I4 Fortran Integer. This would only be gaurunteed to work on a 32-bit architecture like say a Pentium.

So Solutions?
  1. since this is x86_64 simply run your compile your program with a 32-bit compiler
You can still run under under the 64-bit O/S
  2. Mend your application to store addresses in I8 variables, but keep I4 for other stuff if you wish. 
  3. (dubious) only save the lower 32-bits of the addresses in your I4 variables and then when being used add the known offset to yield the original 64-bit address. The offset is likely to be constant for all variables in your programs but ymmv.
  4. Port your code away from using malloc() altogether. Recently (well make that 15 years), Fortran has had its own dynamic memory allocation- the allocate() function.
 

Hope this helps,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


> -----Original Message-----
> From: Roland Krause [mailto:rokrau at yahoo.com]
> Sent: 07 February 2005 17:02
> To: beowulf at beowulf.org
> Subject: [Beowulf] memory allocation on x86_64 returning huge 
> addresses
> 
> 
> I am trying to dynamically allocate memory for a Fortran-77 code that
> is supposed to run in I4 R4 mode on an x86_64 running SuSE-9.2 with a
> kernel.org 2.6.9 kernel. The machine has 8GB memory and memory has to
> be allocated in one large chunk. 
> 
> The problem is that malloc returns an address that is way beyond
> 8billion which is not what I had expected.
> 
> Does anybody why Linux gives me an address that is outside 
> the physical
> memory range? 
> 
> Does anybody whether there are any kernel parameter that affect this
> behavior? 
> 
> Any pointers to some good reading about the Linux VM would also be
> appreciated. 
> 
> 
> Regards
> Roland
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Mail - now with 250MB free storage. Learn more.
> http://info.mail.yahoo.com/mail_250
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


From daniel.kidger at quadrics.com  Mon Feb  7 10:34:28 2005
From: daniel.kidger at quadrics.com (daniel.kidger at quadrics.com)
Date: Mon, 7 Feb 2005 18:34:28 -0000
Subject: [Beowulf] Home beowulf - NIC latencies
Message-ID: <30062B7EA51A9045B9F605FAAC1B4F62812105@exch01.quadrics.com>

Duncan wrote (in reply to Patrick)
> > The good news is that you can pipeline it very well. 
> 
> Indeed. There is lots of parallelism in the hardware
> so you can me processing multiple requests at the same
> time. In this sequence of short jobs I measure the 
> average time for 8 byte gets 2 at a time, 4 at a time
> etc. 
> 
> quorumi: prun -N2 pgping -f get -b2 8
>   1:        8 bytes      1.32 uSec     6.07 MB/s
> quorumi: prun -N2 pgping -f get -b4 8
>   1:        8 bytes      1.04 uSec     7.66 MB/s
> quorumi: prun -N2 pgping -f get -b8 8
>   1:        8 bytes      0.84 uSec     9.47 MB/s
> quorumi: prun -N2 pgping -f get -b16 8
>   1:        8 bytes      0.82 uSec     9.79 MB/s
> quorumi: prun -N2 pgping -f get -b32 8
>   1:        8 bytes      0.79 uSec    10.18 MB/s


Or for those that distrust quoting pure powers of two in benchmarks and/or know too much bash:

[dan at quorumi]$ for ((i=1,j=1;$i<999;i=$i+$j,j=$i)) ;do echo -ne "pipelining \t$i:\t"; prun -N2 pgping -f get -b$i 64|cut -c20-35; done
pipelining      1:            2.39 uSec
pipelining      2:            1.38 uSec
pipelining      3:            1.24 uSec
pipelining      5:            1.05 uSec
pipelining      8:            0.92 uSec
pipelining      13:           0.91 uSec
pipelining      21:           0.86 uSec
pipelining      34:           0.80 uSec
pipelining      55:           0.78 uSec
pipelining      89:           0.79 uSec
pipelining      144:          0.77 uSec
pipelining      233:          0.73 uSec
pipelining      377:          0.77 uSec
pipelining      610:          0.78 uSec
pipelining      987:          0.77 uSec

Note that the above is for 64 *byte* reads which iirc is what Vincent was targetting.

Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


From lindahl at pathscale.com  Mon Feb  7 10:41:08 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Mon, 7 Feb 2005 10:41:08 -0800
Subject: [Beowulf] memory allocation on x86_64 returning huge addresses
In-Reply-To: <30062B7EA51A9045B9F605FAAC1B4F62812104@exch01.quadrics.com>
References: <30062B7EA51A9045B9F605FAAC1B4F62812104@exch01.quadrics.com>
Message-ID: <20050207184108.GA1364@greglaptop.internal.keyresearch.com>

> The problem is that malloc returns an address that is way beyond
> 8billion which is not what I had expected.

This e-vile hack makes it produce something lower in memory. What it does
is turns off glibc's malloc algorithm's feature that has it mmap() large
malloc()s. Stuff into a .c, link the .o into your application.

-- greg

#include <stdio.h>
#include <malloc.h>

static void mem_init_hook(void);
static void *mem_malloc_hook(size_t, const void *);
static void *(*glibc_malloc)(size_t, const void *);
void (*__malloc_initialize_hook)(void) = mem_init_hook;

static void mem_init_hook(void)
{
  mallopt (M_MMAP_MAX, 0);
}


From rross at mcs.anl.gov  Mon Feb  7 11:07:43 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Mon, 7 Feb 2005 13:07:43 -0600 (CST)
Subject: [Beowulf] mpich future
In-Reply-To: <200502071853.02126.rene@renestorm.de>
References: <200502071853.02126.rene@renestorm.de>
Message-ID: <Pine.LNX.4.58.0502071240400.29537@terra.mcs.anl.gov>

Hi Rene,

You are right that there are a decent number of MPI implementations out 
there, all with their pros and cons.  There is no "best" implementation, 
and in fact I would say that the existence of multiple implementations is 
helpful to the community by providing (a) multiple takes on how to build 
these libraries, and (b) competition between the implementations to be the 
"best" at what they think is most important.

I'm not sure what you mean by "trouble with the different standards"?  All 
implementations should at this point be striving for complete 2.0 
compliance, and there are very few things from 1.x that won't work in a 
2.0 compliant system (the group defining the standard went to great pains, 
as do the developers, to maintain this compatibility).  So you shouldn't 
need those preprocessor variables.  What functionality are you finding 
that you need to test for?

I would say that at this time MPICH2 has as much influence as any 
implementation, because it is being used as the basis for multiple Cray 
platform implementations, the IBM BG/L implementation, the OSU IB 
implementation, and of course as-is on Windows, OS X, and Linux clusters.  

Of course I am part of the MPICH2 team, so I am biased :).

OpenMPI will undoubtedly be an influential member of the MPI community 
once the software is made widely available.  That group also has a 
collection of developers with very good track records in this area, and I 
look forward to being able to compare and contrast the designs and 
resulting performance.

The big buzz in the MPI world right now is fault tolerance.  I think this 
topic is going to be a hot one for some time, and there are definitely 
differences of opinion on how the MPI implementation should deal with 
faults and to what degree and how users should be made aware of failures, 
both transient and catastrophic.

Less visible, but at least as important, is figuring out how best to 
implement the one-sided (RMA) operations that are part of MPI 2.0.  My 
colleague Rajeev Thakur has (in my opinion) done an excellent job of 
these, building in part on concepts from the BSP system of old.

Figuring out how to make collectives as efficient as possible on new, very 
large machines is also extremely important for those that have access to 
these new machines.  Gheorghe Almasi from IBM had an excellent paper 
discussing collectives on the BG/L machine in last year's EuroPVM/MPI 
conference.

Rolf Rabenseifner and Jesper Traff both presented improvements to
collective algorithms as well.  These two were iterative improvements I'd
say, so less exciting in some sense, but it is critical that we make these
algorithms as efficient as possible, given the scale of upcoming systems.

If you are really interested in what is happening in MPI, the best place
by far to look is the EuroPVM/MPI series of conferences and their
proceedings.  This is where everyone that is serious about MPI
implementations is publishing and going to talk with colleagues, and every 
year the conference attendee list is literally a list of the most 
knowledgable MPI developers in the world (and hangers-on such as myself).

Regards,

Rob
---
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab


On Mon, 7 Feb 2005, rene wrote:

> there are many mpi implementations out there, but which one ist "the best"?
> As far as I know, there are commercial prodcuts which support different 
> hardware in one library (eg myrinet + ethernet).  Which is a nice feature.
> 
> Is there a working mpich which unites the common channels?
> Score did that once, but it's a year ago, since I've worked with it.
> 
> In addition to that I've ran into trouble with the different standarts (1.2, 
> 2.0). 
> It seems to me that Openmpi gets more influence. Is that right?
> 
> I dont feel like put 20 different preprocessor variables on my applications, 
> like
> #if MPI_VERSION > 1
> for each of that implementation.
> 
> So my question is:
> In which direction goes mpi tomorrow?
> 
> Cu
> 
> -- 
> Rene Storm
> @Cluster


From mwill at penguincomputing.com  Mon Feb  7 11:11:46 2005
From: mwill at penguincomputing.com (Michael Will)
Date: Mon, 07 Feb 2005 11:11:46 -0800
Subject: [Beowulf] How-TO Mysql on Lam-cluster?
In-Reply-To: <web-511567@free.net>
References: <web-511567@free.net>
Message-ID: <4207BD72.1070505@penguincomputing.com>

MySQL-4.1 has cluster support according to 
http://dev.mysql.com/downloads/cluster/
but I have not checked out how and what. In any case I would expect it 
to NOT use
MPI for anything.

Michael

Mikhail Kuzminsky wrote:

> In message from "Rahul Manglekar" <rcmanglekar at rediffmail.com> (7 Feb 
> 2005 14:17:05 -0000):
>
>>  
>> hi all..,
>>
>> i have setup up LAM-MPI cluster on 3 machine for testing.
>>
>> i want do put mysql on cluster..,, such that if mysql need more 
>> processor power , it can use processor power of all nodes that are 
>> present in cluster.
>
>   No. Usual MySQL isn't capable to use cluster nodes in parallel.
> But you  may work w/special software which allows to split your 
> database between cluster nodes. You may find the corresponding
> information at mysql site or also search Beowulf maillist archive
> (if I remember right, it was some discussion of databases in cluster
> here).
>
> Yours
> Mikhail Kuzminsky
> Zelinsky Institute of Organic Chemistry
> Moscow 
>
>>
>> i am using MySQL-4.0.
>>
>> can u guide me please..
>>
>> thank you in advance..
>>
>>
>> -- Rahul..
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf


From rokrau at yahoo.com  Mon Feb  7 12:18:00 2005
From: rokrau at yahoo.com (Roland Krause)
Date: Mon, 7 Feb 2005 12:18:00 -0800 (PST)
Subject: [Beowulf] memory allocation on x86_64 returning huge addresses
In-Reply-To: <20050207184108.GA1364@greglaptop.internal.keyresearch.com>
Message-ID: <20050207201800.97502.qmail@web52909.mail.yahoo.com>

Greg,
thanks a lot for this hint. I will try it. 

Quick question: So this will let me sbrk all the available memory then?
Is there a way to tell it to allocate all available memory with mmap? I
used to hack the kernel and change TASK_UNMAPPED_BASE in the kernel in
order to get all memory from the box in one large chunk. I guess I
should have instead lowering it raised the value.

I really would like to actually find some docs about this...

Again thanks!
Roland

--- Greg Lindahl <lindahl at pathscale.com> wrote:

> > The problem is that malloc returns an address that is way beyond
> > 8billion which is not what I had expected.
> 
> This e-vile hack makes it produce something lower in memory. What it
> does
> is turns off glibc's malloc algorithm's feature that has it mmap()
> large
> malloc()s. Stuff into a .c, link the .o into your application.
> 
> -- greg
> 
> #include <stdio.h>
> #include <malloc.h>
> 
> static void mem_init_hook(void);
> static void *mem_malloc_hook(size_t, const void *);
> static void *(*glibc_malloc)(size_t, const void *);
> void (*__malloc_initialize_hook)(void) = mem_init_hook;
> 
> static void mem_init_hook(void)
> {
>   mallopt (M_MMAP_MAX, 0);
> }
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Helps protect you from nasty viruses. 
http://promotions.yahoo.com/new_mail


From rene at renestorm.de  Mon Feb  7 13:56:27 2005
From: rene at renestorm.de (rene)
Date: Mon, 7 Feb 2005 22:56:27 +0100
Subject: [Beowulf] How-TO Mysql on Lam-cluster?
In-Reply-To: <4207BD72.1070505@penguincomputing.com>
References: <web-511567@free.net> <4207BD72.1070505@penguincomputing.com>
Message-ID: <200502072256.27614.rene@renestorm.de>

HI,
> MySQL-4.1 has cluster support according to
> http://dev.mysql.com/downloads/cluster/
As far as I know they used the nbd daemon to generate the db-nodes, but
every node has the full database access. It isnt shared over the disks.
Just in case you have a really huge db.

http://www.emicnetworks.com/
has an own implementation too

-- 
Rene Storm
@Cluster


From rene at renestorm.de  Mon Feb  7 15:37:05 2005
From: rene at renestorm.de (rene)
Date: Tue, 8 Feb 2005 00:37:05 +0100
Subject: [Beowulf] mpich future
In-Reply-To: <Pine.LNX.4.58.0502071605380.31726@terra.mcs.anl.gov>
References: <200502071853.02126.rene@renestorm.de>
	<200502072246.02830.rene@renestorm.de>
	<Pine.LNX.4.58.0502071605380.31726@terra.mcs.anl.gov>
Message-ID: <200502080037.05409.rene@renestorm.de>

Hi Rob,
> The MQbench project does look interesting.  Sort of a GUI version of
> SkaMPI?
It's something like the Pallas benchmark. But there aren't all Mpi Calls 
implemented yet.
But its nice to choose a bunch of nodes and then a second one in the same 
application and see the differences.
Its ordinary C mpi surrounded by a C++ Qt gui.

> If you write an MPI program, it should work with all MPI implementations
> (modulo missing MPI-2 features).  It will not necessarily cleanly link
> with any arbitrary library, in the same way that a C program will not
> dynamically link with any arbitrary C library.
>
> So there is always going to be an issue of recompilation; is that your
> second concern?
Yes it is.
Its probably possible to make software packages available for common linux 
distributions. But if you have to consider several mpi implementations extra,
that could be lot of packages.

I've you have written a major application like ls-dyna you can say:
Take these linux, these compiler and this mpi and you get our compiled 
version, but nobody will alter their cluster for an add-on program like a 
mpi-copy tool.
So the only choice is to go opensource.

But in some areas it is important that (small or large) companies make 
professional, supported software available. But this isn't easy with mpi.

Regards,
Rene


From hasan at grant.phys.subr.edu  Mon Feb  7 18:15:25 2005
From: hasan at grant.phys.subr.edu (Saleem Hasan)
Date: Mon, 7 Feb 2005 20:15:25 -0600 (CST)
Subject: [Beowulf] Newbie question on mpich2 installation
Message-ID: <Pine.SGI.3.96.1050207201438.48884G-100000@grant.phys.subr.edu>


Hello all,

I apologise for what may be a very simple issue but is giving me trouble.
I would really appreciate some advice.

For learning the setup of a cluster, I have installed mpich2 on a linux
machine with Red Hat 8.0. I have a second machine RH 8.0. w2 is the master
and w1 is the slave. I have installed mpich2 on w2 (/home/mpi) and used
nfs to share /home with w1. I have also setup passwordless ssh between w1
and w2. 

I am able to bring up mpd on the local machine (w2) and do mpdtrace and
mpdallexit. I am following the installation procedure from the MPICH2
home.

I am unable to boot mpd on the slave. The first time I ran
mpdboot -n 2 -f /home/mpi/mpd.hosts, 
I got the message that there was no mpd.conf file in w1 and that could be
a reason for the mpd not coming up the slave. I added an mpd.conf
(secretword) to /etc in the slave also. Now I get a different message

[root at w2 mpich2-1.0]# mpdboot -n 2 -f /home/mpi/mpd.hosts
mpdboot_w2.maverick.net_0 (mpdboot 357): error trying to start mpd(boot)
at 1 w1.maverick.net; output:
mpdboot_w1_1 (err_exit 379): mpd failed to start correctly on w1
  reason: 1: invalid msg from mpd :{}:
mpdboot_w1_1 (err_exit 385):   contents of mpd logfile in /tmp:
     logfile for mpd with pid 1654
mpdboot_w2.maverick.net_0 (err_exit 379): mpd failed to start correctly on
w2.maverick.net

Even though the message says mpd failed to start coorectly on w2 (last
line), mpdtrace gives w2. 

The log file in w1 (slave) states the following
logfile for mpd with pid 1654
w1_1060 failed ; cause: unable to obtain socket for rhs in ring
    traceback: [('/home/mpi/mpich2-install/bin/mpd.py', '1192',
'_enter_existing_ring'), ('/home/mpi/mpich2-install/bin/mpd.py', '173',
'_mpd_init'), ('/home/mpi/mpich2-install/bin/mpd.py', '1374', '?')]

Thank you very much.

Saleem Hasan


From list-beowulf at onerussian.com  Tue Feb  8 06:16:19 2005
From: list-beowulf at onerussian.com (Yaroslav Halchenko)
Date: Tue, 8 Feb 2005 09:16:19 -0500
Subject: [Beowulf] cheap 48 port gigabit ethernet switch w/ jumbo frames?
In-Reply-To: <41EFF701.60905@pa.msu.edu>
References: <200501201700.j0KH0PfQ032360@bluewest.scyld.com>
	<41EFF701.60905@pa.msu.edu>
Message-ID: <20050208141619.GM2996@washoe.rutgers.edu>

On my latest researches on switches I've found SMC8648T (48 ports) which
does support jumbo 9K and cost 2400$ and is managed

Does anyone has experience with such thing or I should check out also
Nortell switches which are approx 50% more expensive

-- 
Yarik


On Thu, Jan 20, 2005 at 01:22:57PM -0500, Tom Rockwell wrote:
> Hi,

> I'm looking for a switch that will be used for NFS traffic on a cluster 
> of about 40 nodes.  The nodes will have Broadcom 5704 ethernet.  From 
> what I've read, jumbo frames is important for getting the best NFS 
> performance over gigabit ethernet.

> D-link and Netgear have newer 48 port switches priced below managed 
> switches.  The D-link is model DGS-1248T 
> http://dlink.com/products/?sec=2&pid=367 and the Netgear is model GS748T 
> http://netgear.com/products/details/GS748T.php.  Each are about $1200 or 
> so.  I'm unable to find info on their websites specifying whether these 
> switches support jumbo frames.  Anyone know?

> Thanks,
> Tom Rockwell
> Michigan State University
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]
             Key  http://www.onerussian.com/gpg-yoh.asc
GPG fingerprint   3BB6 E124 0643 A615 6F00  6854 8D11 4563 75C0 24C8

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050208/759c457c/attachment.sig>

From rossen at VerariSoft.Com  Tue Feb  8 06:23:57 2005
From: rossen at VerariSoft.Com (Rossen Dimitrov)
Date: Tue, 08 Feb 2005 09:23:57 -0500
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl>
References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl>
Message-ID: <4208CB7D.6070309@verarisoft.com>

Vincent,

Your questions related to the actual cost (in terms of processor 
overhead) of achieving the latency numbers that are posted by the 
network vendors are very interesting and have important aspects, which 
are often overlooked or paid little attention to.

Warning: This posting is long and may be boring.

The ping-pong tests that are often used for measuring the communication 
latency (from user level) are an extreme and often unrealistic mode of 
operation of the parallel system. Sending bytes across the software 
layers and over the network is a fundamental factor for contributing to 
fast computation but without looking at the cost and the likelihood (as 
Patrick mentioned "crossing the fingers") of producing the best quoted 
latencies, you don't usually get the whole picture.

Besides the network hardware/firmware, the implementation (and use) of 
the low-level network messaging layer (GM, ELAN, VERBS, etc) and the MPI 
library are also of a big importance. The design space of parallel 
applications is quite large (size of messages, frequency of messages, 
regularity in space and time, synchrony, communication pattern, etc) in 
order to hope that any single mode of the entire system would be always 
optimal. In this regard, the ping-pong latency test, exercising only one 
of these modes, obviously gives you insufficient information on how to 
predict the behavior of the communication sub-system in realistic scenarios.

In order to address this issue, our MPI/Pro implementation (plug!) has 
long had different modes of using the network and the low-level 
messaging layer for all major high-speed networks as well as for TCP/IP 
communication. We usually support at least 2 modes - one that optimizes 
short message latency (as many of the other MPI implementations do), at 
the expense of increased CPU overhead, and one that trades some latency 
(communication overhead) for low CPU overhead, higher predictability, 
and much better opportunity for overlapping and pipelining. We have 
carried out studies for quantifying the degree of overlapping that these 
different modes can achieve (using only our MPI implementation, e.g., 
comparing apples to apples) and we have obtained some interesting results.

When you combine all of the complexities of the communication sub-system 
(network hardware/firmware, messaging layer, MPI library), the 
application, and the OS (let's only take the virtual memory system, 
process/thread scheduling, and interrupt/signal handling) you get a 
highly probabilistic system, which is hard to quantify and predict by a 
single ping-pong latency number.

Our experiments have shown that using a different MPI/Pro mode on the 
same application code, executed on the same parallel system, can yield 
sometimes substantially different performance results. This shows that 
the implementation and the use of the middleware alone can have a 
substantial impact on your performance and scalability. Further, the 
application code can be written (not always but often) to take advantage 
of asynchrony, pipelining, and overlapping. Implementing these 
mechanisms in your code (using MPI) often doesn't cost much, but can 
speed up your application quite a bit on many parallel systems (running 
middleware with the right design) and in the worst case give you no 
benefit (on systems that don't provide adequate support for these 
mechanisms).

So, if you really want to optimize the use of your cluster resources, in 
addition to the network and compute nodes, you will need to also 
consider the communication middleware and the design of your application 
and how they all work together.

-- 
Rossen Dimitrov
Verari Systems Software, Inc.
http://www.verarisoft.com

Vincent Diepeveen wrote:
> At 21:27 5-2-2005 -0500, Patrick Geoffray wrote:
> 
>>Hi Vincent,
>>
>>Vincent Diepeveen wrote:
>>
>>>>>CPU's are 100% busy and after i know how many times a second the network
>>>>>can handle in theory requests i will do more probes per second to the
>>>>>hashtable. The more probes i can do the better for the game tree search.
>>>>
>>>>With a gigE network that sounds like 40us or so.  With Myrinet or IB
>>>>it's in the 4-6us range.  If you bought dual opterons with the special
>>>
>>>
>>>At the quadrics and dolphin homepage they both claim 12+ us for Myrinet.
>>
>>Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), 
>>that includes fibers and a switch in the middle:
>>
>>   Length   Latency(us)    Bandwidth(MB/s)
>>        0       2.684          0.000
>>        1       2.874          0.336
>>        2       2.898          0.690
>>        4       2.978          1.343
>>        8       2.965          2.699
>>       16       2.993          5.347
>>       32       3.409          9.388
>>       64       3.563         17.960
>>      128       3.977         32.185
>>      256       5.699         44.916
>>
>>Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I 
>>didn't hear about noticeable SCI clusters in a long time.
>>
>>
>>>I am very impressed by the quadrics and dolphin cards. Probably by
>>>infinipath too when i check them out. Will do. 
>>>
>>>I'm not so impressed yet by myrinet actually, but if cluster builders can
>>>earn a couple of hundreds of dollars more on each node i'm sure they'll
> 
> do it.
> 
>>I don't think Myrinet would be the cheapest, I am sure you can get a 
>>better deal from desperate interconnect vendors.
>>
>>What does not impress you in Myrinet ?
> 
> 
> Thanks for your kind answer Patrick,
> 
> Obviously i mentionned that number because i read it elsewhere.
> 
> Well a number of points bother my mind from which majority is true for
> others as well. But first let me note that i'm not against myrinet in
> general. I am just trying to solve a very specific case. For that specific
> case i'm not so impressed.
> 
> Note that so far i didn't find any desperate vendor. For sure quadrics
> doesn't look desperate to me, they aren't even selling old cards anymore
> though they must have still thousands of them lying at home from returned
> upgraded networks. Finding second hand highend cards seems to be very seldom.
> 
> First of all i'm interested in how quick i can get 4-64 bytes from remote
> memory. So not from some kind of network card cache, as myrinet doesn't
> have some megabytes on chip, but just a few tens of kilobytes. The memory
> has to come therefore from the remote nodes main memory, at a random adress
> in the main memory. No streaming at all happens. that 400 ns extra that the
> TLB gives is definitely not the problem i guess. 
> 
> The problem for me is to understand: "how do you get that memory at a
> cluster?"
> 
> A latency on paper says of course nothing when you can't actually get it
> within that time.
> 
> "Paper supports everything."
>     Arturo Ochoa (Caracas, Venezuela)
> 
> I hope everyone realizes that an important consequence from beowulf
> clusters is that you actually want to *use* all those cpu's you have to
> your avail.
> 
> So every cpu has a program running that eats 100% system time. Because if
> it wouldn't use 100% system time, you wouldn't need a cluster!
> 
>>From that 100% system time obviously you must be prepared to give away some
> to serve other nodes as quickly as possible doing a read. 
> 
> All latencies i see quoted at all hardware sites, it is very hard to figure
> for me out whether that's a latency that is supported by paper, or whether
> it's a practical latency i can take into account as a programmer with all
> software layers overhead when each cpu is 100% running a program.
> 
> Secondly, but as i'm not a cluster expert i don't know how to avoid that,
> it's of course a big LOSS in sequential speed if my program each few
> instructions must check whether there is some MPI message to get handled.
> If i check a lot that will slow down my program 20 times. If i don't check
> a lot, other cpu's will have to wait longer and that defeats the purpose of
> a fast network card.
> 
> Factor 20 is about the slowdown of the average 'old' supercomputer
> chessprograms which use MPI type solutions. Zugzwang (Paderborn-Siemens),
> P.Conners (Paderborn-Siemens), cilkchess (MIT). I've been playing with my
> own eyes against those programs in world champs and despite that it has
> happened that i played at the same hardware with a similar amount of cpu's
> and a program having factor 100 more chessknowledge (which slows down the
> program *considerable*), the actual speed at which the program searches
> nodes was up to factor 5-10 faster. 
> 
> Now a few years ago this was not a major problem because for example
> Cilkchess which obviously ran factor 20-40 times slower than it could, used
> 1800 processors for example in world champs 1995 (Hong kong) and 512
> processors in world champs 1999 (Paderborn). Of course because 1 processor
> was real real fast compared to the speed of 1 pc processor in those days,
> they practical were searching a lot deeper than pc programs (and both
> played excellent for its days, especially Don Dailey needs to get a big
> compliment for that). 
> 
> However if i show up with 2 pc's and 2 network cards, then it sure matters
> when i lose a lot of speed. 
> 
> Obviously for embarassingly parallel software this is no issue, but usually
> for embarrassingly parallel software all you need is gigabit ethernet. 
> 
> There is so many MPI applications which are not exactly embarassingly
> parallel from which you see that a decent programmer single cpu would be
> doing that 20 times faster. Or to quote someone who has been doing such
> rewriting work for some physical applications that run here and there: "I
> didn't blink my eyes when i managed to speedup an application factor 1000".
> 
> So it is very interesting for us all and me especially to understand how
> *fast* you can get that memory under full load of all the logical cpu's.
> 
> Third each pc has 2 cheapo k7 processors which are a lot slower than opterons.
> 
> Second problem i have is that i can get easily dual k7 pc's from
> chessplayers and they can get bought cheap still. Dual k7 is practical same
> speed like a dual xeon 3.06Ghz Northwood with all memory slots filled with
> 2-2-2 DIMMS for DIEP. So just compare the price of such a system with a
> cheapo dual k7 with registered cas3 RAM. 
> 
> Those dual k7's have 64 bits 66Mhz slots, not pci-x as far as i know and
> also those who do have A64's or P4's usually don't have pci-x onboard
> either. Sure there is boards that have them and i'm sure that if you make a
> network
> 
> Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX
> mainboards and claim somewhere a paper latency of 1.x us. 
> 
> What is the achieved read speed to remote memory myrinet gets at 64 bits /
> 66Mhz in software, so ready to use 4-64 bytes for applications? 
> 
> I'm not asking it to be accurate within 400ns, as that's the delay you'll
> have from TLB trashing the remote node. But accuracy within 1.5 us would be
> quite nice.
> 
> First of all for integer intensive applications i'm doing fastest processor
> is opteron, k7 comes second and P4 comes third. Exception is a P4 machine
> equipped with the most expensive stuff (2-2-2 ram and all banks filled)
> good mainboard and northwoods and overclocked at the mainboard. However for
> that price a dual opteron can get bought and it just blows away that P4
> bigtime.
> 
> Every year that new software gets released of course that P4 gets slower,
> because newer software only gets more and more complex with more options
> and will fit less perfectly in P4's small tiny caches, let alone when we
> get a lot of 64 bits programs. They won't fit at all in those tiny slow
> caches.
> 
> So until the dual core opterons arrive at low cost, obviously you can make
> dual k7 nodes for just a few hundreds of dollar a node. 
> 
> When adding new nodes which in the future no doubt are dual opteron, you
> still run further with those dual k7 nodes and want to mix them obviously
> with dual opterons. Is that possible?
> 
> 
> 
> 
>>Patrick
>>-- 
>>
>>Patrick Geoffray
>>Myricom, Inc.
>>http://www.myri.com
>>
>>
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From josip at lanl.gov  Tue Feb  8 08:54:07 2005
From: josip at lanl.gov (Josip Loncaric)
Date: Tue, 08 Feb 2005 09:54:07 -0700
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <4208CB7D.6070309@verarisoft.com>
References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl>
	<4208CB7D.6070309@verarisoft.com>
Message-ID: <4208EEAF.105@lanl.gov>

Rossen Dimitrov wrote:
> 
> So, if you really want to optimize the use of your cluster resources, in 
> addition to the network and compute nodes, you will need to also 
> consider the communication middleware and the design of your application 
> and how they all work together.

Are there any projects that would expand the ability of MPI application 
programmers to provide performance hints to the MPI library?  For 
example, hints indicating that certain messages are latency sensitive 
whereas others need optimal bandwidth and low CPU overhead?

One can already obtain some MPI performance data through the PMPI 
mechanism, and Rossen is helping develop MPI PERUSE 
(http://www.mpi-peruse.org/) intended to provide even more detail.  I'm 
asking about the other direction of information flow, i.e. performance 
hints from the application to the MPI layer...

Ideally, such hints would be propagated fairly close to the actual 
hardware, e.g. application hints would guide the MPI library in 
selecting improved interrupt mitigation strategies used by the network 
interfaces (assuming that a suitable API exists for the underlaying 
hardware).

Sincerely,
Josip


From twilcox at terrascale.com  Tue Feb  8 09:35:27 2005
From: twilcox at terrascale.com (Tim Wilcox)
Date: Tue, 8 Feb 2005 12:35:27 -0500
Subject: [Beowulf] Call for participation StorCloud at SC2005
Message-ID: <011001c50e04$9757f830$a201a8c0@deepthoughthp>

Hi all,

The StorCloud applications and Challenge submission form is now online at 
http://www.sc-submissions.org and we are currently accepting submissions the 
instructions are posted at 
http://www.vtksolutions.com/StorCloud/2005/StorCloudAppFormHelp.html.

The deadline is March 31st.

Tim Wilcox
Applications Challenge Committee


From natorro at fisica.unam.mx  Tue Feb  8 10:27:39 2005
From: natorro at fisica.unam.mx (Carlos Lopez Nataren)
Date: Tue, 08 Feb 2005 12:27:39 -0600
Subject: [Beowulf] G5 beowulf cluster
Message-ID: <1107887259.2124.16.camel@natorro>

Hello, we, at the physics institute in Mexico have got four Xserve G5
and we would like to use them as a beowulf, my first doubt is about what
operating system to use, we've been using linux for our other clusters,
even a G4 one, but I haven't seen anything about G5, is there a linux
distribution that runs well on this type of machines? or do I better use
the operating system they came with? or are there any documentation out
there outlining the way they should be configured to be used as a
beowulf?

Thank you very much for any help.
natorro

-- 
Carlos Lopez Nataren <natorro at fisica.unam.mx>
Instituto de Fisica, UNAM


From dag at sonsorol.org  Tue Feb  8 10:51:25 2005
From: dag at sonsorol.org (Chris Dagdigian)
Date: Tue, 08 Feb 2005 13:51:25 -0500
Subject: [Beowulf] G5 beowulf cluster
In-Reply-To: <1107887259.2124.16.camel@natorro>
References: <1107887259.2124.16.camel@natorro>
Message-ID: <42090A2D.4090204@sonsorol.org>


The Mac OS X server OS that came with your Xserve G5s is quite good and 
you'll find all the developer tools, compilers, cluster scheduler 
systems like Grid Engine or LSF are all working and well supported. The 
scientific community of G5/Xserve users is growing quite rapidly.

If you are familiar with Linux the learning curve for OS X is not all 
that bad.

http://www.apple.com/science/  -- may help
http://www.apple.com/server/macosx/ -- also

You may want to make the OS decision based on what physics apps you need 
to run and how *they* are supported on OS X vs Linux. I can't help you 
there as I'm a life sciences / biology person.

In our lab we've installed both Gentoo Linux as well as Yellow Dog on 
Xserve G5s. Both seemed to install and run smoothly but for production 
clustering work we still use the OS X Server OS.

-Chris


Carlos Lopez Nataren wrote:

> Hello, we, at the physics institute in Mexico have got four Xserve G5
> and we would like to use them as a beowulf, my first doubt is about what
> operating system to use, we've been using linux for our other clusters,
> even a G4 one, but I haven't seen anything about G5, is there a linux
> distribution that runs well on this type of machines? or do I better use
> the operating system they came with? or are there any documentation out
> there outlining the way they should be configured to be used as a
> beowulf?
> 
> Thank you very much for any help.
> natorro
> 

-- 
Chris Dagdigian, <dag at sonsorol.org>
BioTeam  - Independent life science IT & informatics consulting
Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E iChat/AIM: bioteamdag  Web: http://bioteam.net


From hahn at physics.mcmaster.ca  Tue Feb  8 11:26:32 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Tue, 8 Feb 2005 14:26:32 -0500 (EST)
Subject: [Beowulf] G5 beowulf cluster
In-Reply-To: <1107887259.2124.16.camel@natorro>
Message-ID: <Pine.LNX.4.44.0502081415060.14421-100000@coffee.psychology.mcmaster.ca>

> Hello, we, at the physics institute in Mexico have got four Xserve G5
> and we would like to use them as a beowulf, my first doubt is about what
> operating system to use, we've been using linux for our other clusters,

I would strongly encourage you to try both Linux and Mac OS/X
because doing so would permit a VERY interesting and useful comparison.
the comparison is interesting because Mac OS/X is tuned very differently
from Linux - even more than the differences it inherits from its *BSD 
heritage.  for instance, if you run LMBench on the two machines, you'll
see that certain syscalls are drastically different in speed.

obviously, Macophiliacs and Apple sales reps would be scandalized
at this idea.  but the truth is that the Xserve hardware is reasonably
competive with dual-xeon alternatives, but in a cluster, no one really
cares about PDF imaging models or other traditional Apple qualities.
what matters is things like TCP stack efficiency, syscall overheads, etc.

> even a G4 one, but I haven't seen anything about G5, is there a linux
> distribution that runs well on this type of machines? or do I better use

I've heard of yellowdog linux; there are probably many other flavors
(perhaps even a fedora version?).  ultimately, the distro is almost 
irrelevant to a cluster, since it's the kernel, booting and FS that 
matter, not .999 of userspace.

incidentally, I've measured the power consumption of a ppc970fx (90nm, 2.0
GHz) system, under load, and found it to be marginally cooler than, say
a similar-speed HP DL145 (dual-opteron).  we're talking 200 vs 220W.
this is old news; what's new is that up-coming 90nm Opterons appear to 
change the picture fairly dramatically, since the drop the TDP from 
89 to 65W.  and of course, for those of you who are cache-friendly,
dual-core opterons at 95W TDP is rather attractive.  (ie, dual ppc970/2.0's 
with a 3.2 GB/s apiece vs four DC opteron/2.2's with 3.2 GB/s apiece.
100W/p vs maybe 60W/p, hmmm.)

regards, mark hahn.


From dtj at uberh4x0r.org  Tue Feb  8 11:35:49 2005
From: dtj at uberh4x0r.org (Dean Johnson)
Date: Tue, 08 Feb 2005 13:35:49 -0600
Subject: [Beowulf] G5 beowulf cluster
In-Reply-To: <1107887259.2124.16.camel@natorro>
References: <1107887259.2124.16.camel@natorro>
Message-ID: <1107891349.5042.10.camel@terra>

On Tue, 2005-02-08 at 12:27 -0600, Carlos Lopez Nataren wrote:
> Hello, we, at the physics institute in Mexico have got four Xserve G5
> and we would like to use them as a beowulf, my first doubt is about what
> operating system to use, we've been using linux for our other clusters,
> even a G4 one, but I haven't seen anything about G5, is there a linux
> distribution that runs well on this type of machines? or do I better use
> the operating system they came with? or are there any documentation out
> there outlining the way they should be configured to be used as a
> beowulf?
> 

The native OSX should be fine. It sort depends on the applications that
you intend on using. Lots of the major apps seem to have efforts to make
them work well on the altivec machines. There was a problem, something
about semaphores I believe, that caused problems with MPI apps. I ran
into it trying to get benchmark numbers for Amber and Gromacs on 4 G5
towers that I was playing with.

	-Dean


From idooley at isaacdooley.com  Tue Feb  8 13:55:12 2005
From: idooley at isaacdooley.com (Isaac Dooley)
Date: Tue, 08 Feb 2005 15:55:12 -0600
Subject: [Beowulf] G5 beowulf cluster
In-Reply-To: <200502082000.j18K09iS022564@bluewest.scyld.com>
References: <200502082000.j18K09iS022564@bluewest.scyld.com>
Message-ID: <42093540.6010801@isaacdooley.com>

I've used the new ~600 node G5 Xserve cluster named turing: 
http://www.cse.uiuc.edu/turing/

It works, and is using OSX 10.3. I've used YellowDog Linux personally on 
a few G3 and G4 machines, and have had good experiences. If you want to 
do very fine grained parallel computation, one important thing to do is 
to disable all unneeded system daemons. There are a bunch of these in 
YDL and OSX. Also, depending on your needs for 64-bit addressing, you 
may need YDL until OSX 10.4 is released(it is available to developers 
now if you really want it). I'm not sure if you can disable the GUI for 
OSX, which may be a minor resource waster. Also you may want to consider 
Darwin without OSX. Darwin is the open source kernel used by OSX.

One thing we've noticed with our OSX is that connect() sometimes takes 
too long to complete. Hopefully I can figure out why this is.

Isaac Dooley

>Hello, we, at the physics institute in Mexico have got four Xserve G5
>and we would like to use them as a beowulf, my first doubt is about what
>operating system to use, we've been using linux for our other clusters,
>even a G4 one, but I haven't seen anything about G5, is there a linux
>distribution that runs well on this type of machines? or do I better use
>the operating system they came with? or are there any documentation out
>there outlining the way they should be configured to be used as a
>beowulf?
>  
>


From ole at scali.com  Wed Feb  9 04:24:15 2005
From: ole at scali.com (Ole W. Saastad)
Date: Wed, 09 Feb 2005 13:24:15 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <200502070951.j179oSDB010742@bluewest.scyld.com>
References: <200502070951.j179oSDB010742@bluewest.scyld.com>
Message-ID: <1107951855.5682.31.camel@pc-2.office.scali.no>

Dear all,
this thread reminded us, that we promised to post HPCC numbers 
depicting differences between interconnects, not interconnects 
and software stacks in combination.  The numbers below stems 
from a fairly old system (400MHz FSB, PCI-X, etc.) and does 
not reflect the absolute performance achievable on modern hardware.
Similar, the NICs used are _not_ the latest and greatest.

The intent is simply to show the effect of different interconnects, 
on the four simple (excluding PTRANS etc) communication metrics
measured by HPCC. (see web page http://icl.cs.utk.edu/hpcc/)


                       Gigabit Eth.   SCI    Myrinet  InfiniBand      
Max Ping Pong Latency :   36.32       4.44     8.65      7.36
Min Ping Pong Bandw.  :  117.01     121.31   245.31    359.21 
Random Ring Bandw.    :   37.59      47.70    69.30     18.02
Random Ring Latency   :   42.17       8.91    19.02      9.94


Latency in microseconds and bandwidth in MBytes/s. (1e6 bytes/s). 
The HPCC version is 0.8 and the very same binary (and Scali MPI
Connect library) is used for all interconnects (change of 
interconnect is done by -net tcp|sci|gm0|ib0 on the command 
line).


Cluster information :

16 x Dell PowerEdge 2650 2.4 GHz
Dell PowerConnect 5224 GBE switch.
Mellanox HCA
Infinicon InfiniIO 3000
Myrinet 2000 
Dolphin SCI 4x4 Torus

Scali MPI Connect version : scampi-3.3.7-2.rhel3 
Mellanox IB driver version : thca-linux-3.2-build-024
GM version : 2.0.14


-- 
Ole W. Saastad, Dr.Scient. 
Manager Cluster Expert Center
dir. +47 22 62 89 68
fax. +47 22 62 89 51
mob. +47 93 05 74 87
ole at scali.com

Scali - www.scali.com
High Performance Clustering


From joachim at ccrl-nece.de  Thu Feb 10 01:15:54 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Thu, 10 Feb 2005 10:15:54 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <420580AD.5050003@myri.com>
References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl>
	<420580AD.5050003@myri.com>
Message-ID: <420B264A.7050004@ccrl-nece.de>

Patrick Geoffray wrote:
> Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), 
> that includes fibers and a switch in the middle:
> 
>    Length   Latency(us)    Bandwidth(MB/s)
>         0       2.684          0.000
[...]

Nice work, Patrick - but such numbers are of little value if the 
benchmark used to get them is not stated. I'd recommend mpptest (from 
MPICH). Plus, the compiler etc. is also of interest when it comes to 
latencies.

   Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From joachim at ccrl-nece.de  Thu Feb 10 01:28:27 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Thu, 10 Feb 2005 10:28:27 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <4208EEAF.105@lanl.gov>
References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl>	<4208CB7D.6070309@verarisoft.com>
	<4208EEAF.105@lanl.gov>
Message-ID: <420B293B.9060604@ccrl-nece.de>

Josip Loncaric wrote:
> Are there any projects that would expand the ability of MPI application 
> programmers to provide performance hints to the MPI library?  For 
> example, hints indicating that certain messages are latency sensitive 
> whereas others need optimal bandwidth and low CPU overhead?

MPI offers a lot of different send modes already. If you use a ready 
send, the MPI library can assume that you are interested in low-latency 
delivery; if you use a non-blocking send, it should be o.k. for the 
library to assume that you are interested in overlapping computation and 
communication and so on. On the receiving side, a hybrid 
polling-blocking approach for receiving can be applied.

I do not think that there is serious demand for more explicit "steering" 
of the MPI library. User's make much to little use of the existing ways 
(that I described above). But, if you really want to do such stuff, you 
could use (implementation-specific) attributes which you assign to 
different communicators, one for "low-latency" delivery and one for 
"low-cpu", or whatever. But this has more effect on the sending side 
than on the receiving side. I wouldn't invest work into this unless you 
have very good reasons. Esp. as this would be non-portable, few users 
would ever take notice.

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From landman at scalableinformatics.com  Thu Feb 10 13:40:52 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 10 Feb 2005 16:40:52 -0500
Subject: [Beowulf] A thread-safe PRNG for an OpenMP progra
Message-ID: <420BD4E4.4080401@scalableinformatics.com>

Hi folks:

    I need to get a thread-safe pseudo-random number generator.  All I 
have found online was SPRNG which is set up for MPI.  Anyone have a 
quick pointer to their favorite thread safe PRNG that works well in OpenMP?

    Thanks.

Joe

-- 
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com


From maurice at harddata.com  Thu Feb 10 12:35:06 2005
From: maurice at harddata.com (Maurice Hilarius)
Date: Thu, 10 Feb 2005 13:35:06 -0700
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <200502102000.j1AK0Eb7016772@bluewest.scyld.com>
References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com>
Message-ID: <420BC57A.5060007@harddata.com>

----------------------------------------------------------------------

>Message: 1
>Date: Thu, 10 Feb 2005 10:15:54 +0100
>From: Joachim Worringen <joachim at ccrl-nece.de>
>Subject: Re: [Beowulf] Home beowulf - NIC latencies
>
>Patrick Geoffray wrote:
>  
>
>>Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), 
>>that includes fibers and a switch in the middle:
>>
>>   Length   Latency(us)    Bandwidth(MB/s)
>>        0       2.684          0.000
>>    
>>
>[...]
>
>Nice work, Patrick - but such numbers are of little value if the 
>benchmark used to get them is not stated. I'd recommend mpptest (from 
>MPICH). Plus, the compiler etc. is also of interest when it comes to 
>latencies.
>
>   Joachim
>
>  
>
True, but it does not change the facts.
Further, all of these lovely benchmarks lack one really important detail:
Comparisons between different interfaces and drivers MUST show CPU usage 
while running them.
If I have a fantastic device that uses infinitely small time (latency) 
and moves huge amounts of data (bandwidth) but in doing so it takes 80% 
of a CPU, we do not have a useful solution..
That is where Myrinet and Quadrics shine, and also this is the detail 
that the various OB vendors carefully dance around.
All the communications performance in the world does not matter if it 
consumes a large amount of CPU cycles.

A further test that some vendors artfully avoid is the actual latency of 
all nodes in a cluster across the switching device.
I have seen a number of "benchmarks" showing great numbers, but on 
looking closer a great number of them are either on two computers, 
directly connected, or are on switching networks that use a number of 
small switches, and they do not show the worst case latency across all 
the switches, on the greater number of hops.

So, your points are excellent, Joachim,  but I have to say that even 
greater degrees of information are needed before any meaningful 
conclusions may be drawn.

What we all need is some form of useful standardized benchmarks that 
looks like real world code from a number of different disciplines, that 
we can use to test the hardware, so we may compare results in a 
meaningful manner.


With our best regards,

Maurice W. Hilarius        Telephone: 01-780-456-9771
Hard Data Ltd.  FAX:       01-780-456-9772
11060 - 166 Avenue         email:maurice at harddata.com
Edmonton, AB, Canada       http://www.harddata.com/
   T5X 1Y3

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050210/feb53a3e/attachment.html>

From lindahl at pathscale.com  Thu Feb 10 18:36:20 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Thu, 10 Feb 2005 18:36:20 -0800
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <420BC57A.5060007@harddata.com>
References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com>
	<420BC57A.5060007@harddata.com>
Message-ID: <20050211023619.GB5174@greglaptop.internal.keyresearch.com>

On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote:

> Further, all of these lovely benchmarks lack one really important detail:
> Comparisons between different interfaces and drivers MUST show CPU usage 
> while running them.

No. If you want to look at that, run a real application and watch the
wall time. It's extremely hard to get a good estimate of cpu usage out
of a microbenchmark, and running "top" or /bin/time to do it is
definitely bogus.

> If I have a fantastic device that uses infinitely small time (latency) 
> and moves huge amounts of data (bandwidth) but in doing so it takes 80% 
> of a CPU, we do not have a useful solution..

If large cpu usage is a problem, it will show up nicely in real
application benchmarks.

> What we all need is some form of useful standardized benchmarks that 
> looks like real world code from a number of different disciplines, that 
> we can use to test the hardware, so we may compare results in a 
> meaningful manner.

Amen. So use the MM5 t3a benchmark, maybe even SPEC HPC, the canned
benchmarks for Amber, Charmm, DL_POLY, etc. The NAS Parallel
Benchmarks are also good, they are much closer to real apps than
microbenchmarks.

-- greg


From eugen at leitl.org  Fri Feb 11 06:06:43 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Fri, 11 Feb 2005 15:06:43 +0100
Subject: [Beowulf] more details on Cell emerge
Message-ID: <20050211140642.GV1404@leitl.org>


http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT021005084318&mode=print

By: David T. Wang (dwang at realworldtech.com)
	
Updated: 02-10-2005
Back to Basics

The fundamental task of a processor is to manage the flow of data through its
computational units. However in the past two decades, each successive
generation of processors for personal computers has added more transistors
dedicated to increasing the performance of spaghetti-like integer code. For
example, it is well known that typical integer codes are branchy and that
branch mispredict penalties are expensive; in an effort to minimize the
impact of branch instructions, transistors were used to develop highly
accurate branch predictors. Aside from branch predictors, sophisticated cache
hierarchies with large tag arrays and predictive cache prefetch units attempt
to hide the complexity of data movement from the software, and further
increase the performance of single threaded applications. The pursuit of
single threaded performance can be observed in recent years in the proposal
of extraordinarily deeply pipelined processors designed primarily to increase
the performance of single threaded applications, at the cost of higher power
consumption and larger transistor budgets.

The fundamental idea of the CELL processor project is to reverse this trend
and give up the pursuit of single threaded performance, in favor of
allocating additional hardware resources to perform parallel computations.
That is, minimal resources are devoted toward the execution of single
threaded workloads, so that multiple DSP-like processing elements can be
added to perform more parallelizable multimedia-type computations. In the
examination of the first implementation of the CELL processor, the theme of
the shift in focus from the pursuit of single threaded integer performance to
the pursuit of multiply threaded, easily parallelizable multimedia-type
performance is repeated throughout.
CELL Basics

The CELL processor is a collaboration between IBM, Sony and Toshiba. The CELL
processor is expected by this consortium to provide computing power an order
of magnitude above and beyond what is currently available to its competitors.
The International Solid-State Circuits Conference (ISSCC) 2005 was chosen by
the group as the location to describe the basic hardware architecture of the
processor and announce the first incarnation of the CELL processor family.

Members of the CELL processor family share basic building blocks, and
depending on the requirement of the application, specific versions of the
CELL processor can be quickly configured and manufactured to meed that need.
The basic building blocks shared by members of the CELL family of processor
are the following:

    * The PowerPC Processing Element (PPE)
    * The Synergistic Processing Element (SPE)
    * The L2 Cache
    * The internal Element Interconnect Bus(EIB)
    * The shared Memory Interface Controller (MIC) and
    * The FlexIO interface

Each SPE is in essence a private system-on-chip (SoC), with the processing
unit connected directly to 256KB of private Load Store (LS) memory. The PPE
is a dual threaded (SMT) PowerPC processor connected to the SPE's through the
EIB. The PPE and SPE processing elements access system memory through the
MIC, which is connected to two independent channels of Rambus XDR memory,
providing 25 GB/s of memory bandwidth. The connection to I/O is done through
the FlexIO interface, also provided by Rambus, providing 44.8 GB/s of raw
outbound BW and 32 GB/s of raw inbound bandwidth for total I/O bandwidth of
76.8 GB/s. At ISSCC 2005, IBM announced that the first implementation of the
CELL processor has been tested to operate at frequencies above 4 GHz. In the
CELL processor, each SPE is capable of sustaining 4 FMADD operations per
cycle. At an operating frequency of 4 GHz, the CELL processor is thus capable
of achieving a peak throughput rate of 256 GFlops from the 8 SPE's. Moreover,
the PPE can contribute some amount of additional compute power with its own
FP and VMX units.
Processor Overview

Figure 1 - Die photo of CELL processor with block diagram overlay

Figure 1 shows the die photo of the first CELL processor implementation with
8 SPE.s. The sample processor tested was able to operate at a frequency of 4
GHz with Vdd of 1.1V. The power consumption characteristics of the processor
were not disclosed by IBM. However, estimates in the range of 50 to 80 Watts
@ 4 GHz and 1.1 V were given. One unconfirmed report claims that at the
extreme end of the frequency/voltage/power spectrum, one sample CELL
processor was observed to operate at 5.6 GHz with 1.4 V Vdd and consumed 180
W of power.

As described previously, the CELL processor with 8 SPE.s operating at 4 GHz
has a peak throughput rate of over 256 GFlops. To provide the proper balance
between processing power and data bandwidth, an enormously capable system
interconnects and memory system interface is required for the CELL processor.
For that task, the CELL processor was designed as a Rambus Sandwich, with
Redwood Rambus Asic Cell (RRAC) acting as the system interface on one end of
the CELL processor, and the XDR (formerly Yellowstone) high bandwidth DRAM
memory system interface on the other end of the CELL processor. Finally, the
CELL processor has 2954 C4 contacts to the 3-2-3 organic package, and the BGA
package is 42.5 mm by 42.5 mm in size. The BGA package contains 1236
contacts, 506 of which are signal interconnects and the remainder are devoted
to power and ground interconnects.
Logic Depth, Circuit Design, Die Size and Process Shrink

Figure 2 - Per stage circuit delay depth of 11 FO4 often left only 5~8 FO4
for logic flow

The first incarnation of the CELL processor is implemented in a 90nm SOI
process. IBM claims that while the logic complexity of each pipeline stage is
roughly comparable to other processors with a per stage logic depth of 20
FO4, aggressive circuit design, efficient layout and logic simplification
enabled the circuit designers of the CELL processor to reduced the per stage
circuit delay to 11 FO4 throughout the entire design. The design methodology
deployed for the CELL processor project provides an interesting contrast to
that of other IBM processor projects in that the first incarnation of the
CELL processor makes use of fully custom design. Moreover, the full custom
design includes the use of dynamic logic circuits in critical data paths. In
the first implementation of the CELL processor, dynamic logic was deployed
for both area minimization as well as performance enhancement to reach the
aggressive goal of 11 FO4 circuit delay per stage. Figure 2 shows that with
the circuit delay depth of 11 FO4, oftentimes only 5~8 FO4 are left for
inter-latch logic flow.

The use of dynamic logic presents itself as an interesting issue in that
dynamic logic circuits rely on the capability of logic transistors to retain
a capacitive load as temporary storage. The decreasing capacitance and
increasing leakage of each successive process generation means that dynamic
logic design becomes more challenging with each successive process
generation. In addition, dynamic circuits are reportedly even more
challenging on SOI based process technologies. However, circuit design
engineers from IBM believe that the use of dynamic logic will not present
itself as an issue in the scalability of the CELL processor down to 65 nm and
below. The argument was put forth that since the CELL processor is a full
custom design, the task of process porting with dynamic circuits is no more
and no less challenging than the task of process porting on a design without
dynamic circuits. That is, since the full custom design requires the
re-examination and re-optimization of transistor and circuit characteristics
for each process generation, if a given set of dynamic logic circuits become
impractical for specific functions at a given process node, that set of
circuits can be replaced with static circuits as needed.

The process portability of the CELL processor design is an interesting topic
due to the fact that the prototype CELL processor is a large device that
occupies 221 mm2 of silicon area on the 90 nm process. Comparatively, the IBM
PPC970FX processor has a die size of 62 mm2 on the 90 nm process. The natural
question then arises as to whether Sony will choose to reduce the number of
SPE.s to 4 for the version of the CELL processor to appear in the next
generation Playstation, or keep the 8 SPE.s and wait for the 65 nm process
before it ramps up the production of the next generation Playstation.
Although no announcements or hints have been given, IBM.s belief in regards
to the process portability of the CEL
Figure 6 - SPE pipeline diagram


Table 1 - Unit latencies for SPE instructions.

Figure 6 shows the pipeline diagram of the SPE and Table 1 shows the unit
latency of the SPE. Figure 6 shows that the SPE pipeline makes heavy use of
the forward-and-delay concept to avoid the access latency of a register file
access in the case of dependent instructions that flow through the pipeline
in rapid succession.

One interesting aspect of the floating point pipeline is that the same arrays
are used for floating point computation as well as integer multiplication. As
a result, integer multiplies are sent to the floating point pipeline, and the
floating point pipeline bypasses the FP handling and computes the integer
multiply.
SPE Schmoo Plot

Figure 7 - Schmoo plot for the SPE

Figure 7 shows the schmoo plot for the SPE. The schmoo plot shows that the
SPE can comfortably operate at a frequency of 4 GHz with Vdd of 1.1 V,
consuming approximately 4 W. The schmoo plot also reveals that due to the
careful segmentation of signal path lengths, the design is far from being
wire delay limited. Frequency scaling relative to voltage continues past 1.3
V. This schmoo plot also contributes to the plausibility of the unconfirmed
report that the CELL processor could operate at upwards of 5.6 GHz.
.Unknown. Functional Units: ATO and RTB

Oftentimes when a paper relating to a complex project is written
collaboratively by a group of people, details are lost. Still, it appeared as
rather humorous that of the six design engineers and architects from the CELL
processor project present at Tuesday evening.s chat session, no one could
recall what the acronyms ATO and RTB stood for. ATO and RTB are functional
blocks labeled in the floorplan of the SPE. However, the functionality of
these functional blocks or the meaning of the acronym were neither noted on
the floorplan, nor explained in the paper, nor mentioned in the technical
presentation. In an effort to cover all the corners, this author placed the
question on a list of questions to be asked of the CELL project team members.
Hilarity thus ensued as slightly embarrassed CELL project members stared
blankly at each other in an attempt to recall the functionality or definition
of the acronyms.

In all fairness, since the SPE was presented on Monday and the CELL processor
itself was presented on Tuesday, CELL project members responsible for the SPE
were not present for Tuesday evening.s chat sessions. As a result, the team
members responsible for the overall CELL processor and internal system
interconnects were asked to recall the meaning of acronyms of internal
functional units within the SPE. Hence, the task was unnecessarily
complicated by the absence of key personnel that would have been able to
provide the answer faster than the CELL processor can rotate a million
triangles by 12 degrees about the Z axis.

After some discussion (and more wine), it was determined that the ATO unit is
most likely the Atomic (memory) unit responsible for coherency
observation/interaction with dataflow on the EIB. Then, after the injection
of more liquid refreshments (CH3CH2OH), it was theorized that the RTB most
likely stood for some sort of Register Translation Block whose precise
functionality was unknown to those outside of the SPE. However, this theory
would turn out to be incorrect.

Finally, after sufficient numbers of hydrocarbon bonds have been broken down
into H-OH on Wednesday, a member of the CELL processor team member tracked
down the relevant information and he writes:

The R in RTB is an internal 1 character identifier that denotes that the RTB
block is a unit in the SPE. The TB in RTB stands for "Test Block". It
contains the ABIST (Array Built In Self Test) engines for the Local Store and
other arrays in the SPE, as well as other test related control functions for
the SPE.
Element Interconnect Bus

The element interconnect bus is the on chip interconnect that ties together
all of the processing, memory, and I/O elements on the CELL processor. The
EIB is implemented as a set of four concentric rings that is routed through
portions of the SPE, where each ring is a 128 bit wide interconnect. To
reduce coupling noises, the wires are arranged in groups of four and
interleaved with ground and power shields. To further reduce coupling noises,
the direction of data flow alternates between each adjacent ring pair. Data
travels on the EIB through staged buffer/repeaters at the boundaries of each
SPE. That is, data is driven by one set of staged buffer and latched by the
buffer at the next stage every clock cycle. Data moving from one SPE through
other SPE.s requires the use of repeaters in the intermediary SPE.s for the
duration of the transfer. Independently from the buffer/repeater elements,
separate data on/off ramps exist in the BIU of the SPE, as data targeted for
the LS unit of a given SPE can be off-loaded at the BIU. Similarly, outgoing
data can be placed onto the EIB by the BIU.

Figure 8 - Counter rotational rings of the EIB - 4 SPE.s shown

The design of the EIB is specifically geared toward the scalability of the
CELL processor. That is, signal path lengths on the EIB do not change
regardless of the number of SPE.s in a given CELL processor configuration.
Since the data travels no more than the width of one SPE, more SPE.s on a
given CELL processor simply means that the data transport latency increases
by the number of additional hops through those SPE.s. Data transfer through
the EIB is controlled by the EIB controller, and the EIB controller works
with the DMA engine and the channel controllers to reserve the buffers
drivers for certain number of cycles for each data transfer request. The data
transfer algorithm works by reserving channel capacity for each data
transfer, thus providing support for real time applications. Finally, the
design and implementation of the EIB has a curious side effect in that it
limits the current version of the CELL processor to expand only along the
horizontal axis. Thus, the EIB enables the CELL processor to be highly
configurable and SPE.s can be quickly and easily added or removed along the
horizontal axis, and the maximum number of SPE.s that can be added is set by
the maximum width of the chip allowable by the reticule size of the
fabrication equipment.
The POWERPC Processing Element

Neither microarchitectural details nor the performance characteristics of the
POWERPC Processing Element were disclosed by IBM during ISSCC 2005. However,
what is known is that the PPE processor core is a new core that is fully
compliant with the POWERPC instruction set, the VMX instruction set extension
inclusive. Additionally, the PPE core is described as a two issue, in-order,
64 bit processor that supports 2 way SMT. The L1 cache sizes of the PPE is
reported to be 32KB each, and the unified L2 cache is 512 KB in size.
Furthermore, the lineage of the PPE can be traced to a research project
commissioned by IBM to examine high speed processor design with aggressive
circuit implementations. The results of this research project were published
by IBM first in the Journal of Solid State Circuits (JSSC) in 1998, then
again in ISSCC 2000.

The paper published in JSSC in 1998 described a processor implementation that
supported a subset of the POWERPC instruction set, and the paper published in
ISSCC 2000 described a processor that supported the complete POWERPC
instruction set and operated at 1 GHz on a 0.25?m process technology. The
microarchitecture of the research processor was disclosed in some detail in
the ISSCC 2000 paper. However, that processor was a single issue processor
whose design goal was to reach high operating frequency by limiting pipestage
delay to 13 FO4, and power consumption limitations were not considered. For
the PPE, several major changes in the design goal dictated changes in the
microarchitecture from the research processor disclosed at ISSCC in 2000.
Firstly, to further increase frequency, the per stage circuit delay design
target was lowered from 13 FO4 to 11 FO4. Secondly, limiting power
consumption and minimize leakage current were added as high priority design
goals for the PPE. Collectively, these changes limited the per stage logic
depth, and the pipeline was lengthened as a result. The addition of SMT and
the two issue design goal completed the metamorphosis of the research
processor to the PPE. The result is a processing core that operates at a high
frequency with relatively low power consumption, and perhaps relatively
poorer scalar performance compared to the beefy POWER5 processor core.
Rambus XDR Memory System

Figure 9 - The two channel XDR Memory System

To provide machine balance and support the peak rating of more than 256 SP
GFlops (or 25-30 DP GFlops), the CELL processor requires an enormously
capable memory system. For that reason, two channels of Rambus XDR memory are
used to obtain 25.2 GB/s of memory bandwidth. In the XDR memory system, each
channel can support a maximum of thirty-six devices connected to the same
command and address bus. The data bus for each device connects to the memory
controller through a set of bi-directional point-to-point connections. In the
XDR memory system, addresses and commands are sent on the address and command
bus at a rate of 800 Mbits per second (Mbps), and the point to point
interface operates at a datarate of 3.2 Gbps. Using DRAM devices with 16 bit
wide data busses, each channel of XDR memory can sustain a maximum bandwidth
of 102.4 Gbps (2 x 16 x 3.2), or 12.6 GB/s. The CELL processor can thus
achieve a maximum bandwidth of 25.2 GB/s with a 2 channel, 4 device
configuration.

The obvious advantage of the XDR memory system is the bandwidth that it
provides to the CELL processor. However, in the configuration illustrated in
figure 9, the maximum of 4 DRAM devices means that the CELL processor is
limited to 256 MB of memory, given that the highest capacity XDR DRAM device
is currently 512 Mbits. Fortunately, XDR DRAM devices could in theory be
reconfigured in such a way so that more than 36 XDR devices can be connected
to the same 36 bit wide channel and provide 1 bit wide data bus each to the
36 bit wide point-to-point interconnect. In such a configuration, a two
channel XDR memory can support upwards of 16 GB of ECC protected memory with
256 Mbit DRAM devices or 32 GB of ECC protected memory with 512 Mbit DRAM
devices. As a result, the CELL processor could in theory address a large
amount of memory if the price premium of XDR DRAM devices could be minimized.
IBM did not release detailed information about the configuration of the XDR
memory system. One feature to watch for in the future is ECC support in the
DRAM memory system. Since ECC support is clearly not a requirement of a
processor to be used in a game machine, the presence of ECC support would
likely indicate IBM.s ambition to promote the use of CELL processors in
applications that require superior reliability, availability and
serviceability, such as HPC, workstation or server systems.

Incidentally, Toshiba is a manufacturer of XDR DRAM devices. Presumably it
brought the XDR memory controller and memory system design expertise to the
table, and could ramp up production of XDR DRAM devices as needed.
FlexIO System Interface
At ISSCC 2005, Rambus presented a paper on the FlexIO interface used on the
CELL processor. However, the presentation was limited to describing the
physical layer interconnect. Specifically, the difficulties of implementing
the Redwood Rambus ASIC Cell on IBM.s 90nm SOI process were examined in some
detail. While circuit level issues regarding the challenges of designing high
speed I/O interfaces on an SOI based process are in their own right extremely
intriguing topics, the focus of this article is geared toward the
architectural implications of the high bandwidth interface. As a result, the
circuit level details will not be covered here. Interested readers are
encouraged to seek out details on Rambus.s Redwood technology separately.

What is known about the system interface of the CELL processor is that the
FlexIO consists of 12 byte lanes. Each byte lane is a set of 8 bit wide,
source synchronous, unidirectional, point-to-point interconnects. The FlexIO
makes use of differential signaling to achieve the data rate of 6.4 Gb per
second per signal pair, and that data rate in turn translates to 6.4 GB/s per
byte lane. The 12 byte lanes are asymmetric in configuration. That is, 7 byte
lanes are outbound from the CELL processor, while 5 byte lanes are inbound to
the CELL processor. The 12 byte lanes thus provide 44.8 GB/s of raw outbound
bandwidth and 32 GB/s of raw inbound bandwidth for total I/O bandwidth of
76.8 GB/s. Furthermore, the byte lanes are arranged into two groups of ports:
one group of ports are dedicated to non-coherent off-chip traffic, while the
other group of ports are usable for coherent off-chip traffic. It seems clear
that Sony itself is unlikely to make use of a coherent, multiple CELL
processor configuration for Playstation 3. However, the fact that the PPE and
the SPE.s can snoop traffic transported through the EIB, and that coherency
traffic can be sent to other CELL processors via a coherent interface, means
that the CELL processor can indeed be an interesting processor. If nothing
else, the CELL processor should enable startups that propose to build FlexIO
based coherency switches to garner immediate interest from venture
capitalists.
Summary

The CELL processor presents an intriguing alternative in its pursuit of
performance. It seems to be a forgone conclusion that the CELL processor will
be an enormously successful product, and that millions of CELL processors
will be sold as the processors that power the next generation Sony
Playstation. However, IBM has designed some features into the CELL processor
that clearly reveals its ambition in seeking new applications for the CELL
processor. At ISSCC 2005, much fanfare has been generated by the rating of
256 GFlops @ 4 GHz for the CELL processor. However, it is the little
mentioned double precision capability and the yet undisclosed system level
coherency mechanism that appear to be the most intriguing aspects that could
enable the CELL processor to find success not just inside the Playstation,
but outside of it as well.
References

[1] J. Silberman et. al., .A 1.0- GHz Single-Issue 64-Bit PowerPC Integer
Processor., IEEE Journal of Solid-State Circuits, Vol 33, No.11, Nov 1998.
[2] P. Hofstee et. al., .A 1 GHz Single-Issue 64b PowerPC Processor.,
International Solid-State Circuits Conference Technical Digest, Feb. 2000.
[3] N. Rohrer et. al. .PowerPC in 130nm and 90nm Technologies., International
Solid-State Circuits Conference Technical Digest, Feb. 2004.
[4] B. Flachs et. al. .A Streaming Processing Unit for A CELL Processor.,
International Solid-State Circuits Conference Technical Digest, Feb. 2005.
[5] D. Pham et. al. .The Design and Implementation of a First-Generation CELL
Processor., International Solid-State Circuits Conference Technical Digest,
Feb. 2005.
[6] J. Kuang et. al. .A Double-Precision Multiplier with Fine-Grained
Clock-Gating Support for a First-Generation CELL Processor., International
Solid-State Circuits Conference Technical Digest, Feb. 2005.
[7] S. Dhong et. al. .A 4.8 GHz Fully Pipelined Embedded SRAM in the
Streaming Processor of a CELL Processor., International Solid-State Circuits
Conference Technical Digest, Feb. 2005.
[8] K. Chang et. al. .Clocking and Circuit Design for a Parallel I/O on a
First-Generation CELL Processor., International Solid-State Circuits
Conference Technical Digest, Feb. 2005. 

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050211/c7cfa2c4/attachment.sig>

From mathog at mendel.bio.caltech.edu  Fri Feb 11 08:17:34 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Fri, 11 Feb 2005 08:17:34 -0800
Subject: [Beowulf] cooling question:  cfm per rack?
Message-ID: <E1CzdU6-0002Gr-00@mendel.bio.caltech.edu>

In designing a computer room two key factors are:

1.  Power in   (electricity)
2.  Power out  (A/C)

The second term really has two parts: 

  A.  the amount of air moved
  B.  the reduction in temperature of that air across the A/C unit

The latter part is specified in tons.  The A/C guys I've spoken
with recently utilize some more or less standard relationship
between cubic feet per minute (cfm) and A/C tons for the units they
maintain.  These run off the campus cold water supply, so
it makes sense that heat out is proportional to flow across, assuming
that the cold water has a very large heat capacity.

However, in terms of cooling the units themselves, the amount of
air flow through the racks is also important.  That flow is
also in cfm.  Ideally cfm through the racks would be equal to cfm
through the A/C, ie, all air goes once through the racks and then
directly through the A/C.  Even more ideally cfm through _each_ rack
could be modulated somehow, since some racks move much more 
air than others and putting a low flow rack next to a high flow rack
might drive the air the wrong way through the low flow unit.

How does one calculate an optimal cfm through a rack? 

For a specific example with round numbers, let's say it's a
25U rack, dissipates 10kW, and has a single 50 cfm per minute output
fan per 1U node.  (Ie, all air out must go through that path.)

There seem to be a bunch of variables that are hard to deal with. 
For instance, adding the exhaust fans would be 50*25 = 1250 cfm.
Is that all there is to it? But that type of fan only runs at
the stated flow rate if the pressures are exactly as specified. 
Without incredibly careful balancing of the pressure across the
rack it won't generally run at 50 cfm.

Is cfm the key unit here or should one think in terms of pressure
at various points in the room?

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From joachim at ccrl-nece.de  Fri Feb 11 10:49:48 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Fri, 11 Feb 2005 19:49:48 +0100
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <20050211023619.GB5174@greglaptop.internal.keyresearch.com>
References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com>	<420BC57A.5060007@harddata.com>
	<20050211023619.GB5174@greglaptop.internal.keyresearch.com>
Message-ID: <420CFE4C.6050003@ccrl-nece.de>

Greg Lindahl wrote:
> On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote:
>>If I have a fantastic device that uses infinitely small time (latency) 
>>and moves huge amounts of data (bandwidth) but in doing so it takes 80% 
>>of a CPU, we do not have a useful solution..
> 
> If large cpu usage is a problem, it will show up nicely in real
> application benchmarks.

True. I always wonder what the low-CPU-usage-advocates want the MPI 
process to do while i.e. an MPI_Send() is executed. For small messages 
(which are critical for many applications), it's somewhat like 
requesting that a local memory-write has to show low CPU usage.

Of course, I can think of scenarios in which data transfers w/o CPU 
usage do promise advantages, and I have implemented and evaluated such 
techniques myself. But in the end (for the application), it always 
boiled down to latency and bandwidth as most applications don't honor 
"true" asynchronous communication.

The latest unsuccessful case of uncoupling computation and MPI 
communication I read about was BG/L when using the second CPU as a 
message processor. Maybe Myrinet MX will behave differently by making 
the MPI itself more concurrent on hardware level (is this a correct 
description, Patrick?) - but it will need matching applications, too.

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From rgb at phy.duke.edu  Fri Feb 11 11:02:23 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 11 Feb 2005 14:02:23 -0500 (EST)
Subject: [Beowulf] cooling question:  cfm per rack?
In-Reply-To: <E1CzdU6-0002Gr-00@mendel.bio.caltech.edu>
References: <E1CzdU6-0002Gr-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.58.0502111350350.3583@ganesh.phy.duke.edu>

On Fri, 11 Feb 2005, David Mathog wrote:

> In designing a computer room two key factors are:
> 
> 1.  Power in   (electricity)
> 2.  Power out  (A/C)
> 
> The second term really has two parts: 
> 
>   A.  the amount of air moved
>   B.  the reduction in temperature of that air across the A/C unit
> 
> The latter part is specified in tons.  The A/C guys I've spoken
> with recently utilize some more or less standard relationship
> between cubic feet per minute (cfm) and A/C tons for the units they
> maintain.  These run off the campus cold water supply, so
> it makes sense that heat out is proportional to flow across, assuming
> that the cold water has a very large heat capacity.
> 
> However, in terms of cooling the units themselves, the amount of
> air flow through the racks is also important.  That flow is
> also in cfm.  Ideally cfm through the racks would be equal to cfm
> through the A/C, ie, all air goes once through the racks and then
> directly through the A/C.  Even more ideally cfm through _each_ rack
> could be modulated somehow, since some racks move much more 
> air than others and putting a low flow rack next to a high flow rack
> might drive the air the wrong way through the low flow unit.
> 
> How does one calculate an optimal cfm through a rack? 
> 
> For a specific example with round numbers, let's say it's a
> 25U rack, dissipates 10kW, and has a single 50 cfm per minute output
> fan per 1U node.  (Ie, all air out must go through that path.)
> 
> There seem to be a bunch of variables that are hard to deal with. 
> For instance, adding the exhaust fans would be 50*25 = 1250 cfm.
> Is that all there is to it? But that type of fan only runs at
> the stated flow rate if the pressures are exactly as specified. 
> Without incredibly careful balancing of the pressure across the
> rack it won't generally run at 50 cfm.
> 
> Is cfm the key unit here or should one think in terms of pressure
> at various points in the room?

I can't answer all your questions here, but you've pointed out a lot of
the problems.  You have to arrange for the blower to deliver chilled air
to the right places in the room, and you ALSO have to arrange for a warm
air return that picks up the warmed air (after it has passed through the
systems and cooled them) and returns it to be cooled and cycled again.

The overall airflow is determined by those two things -- cool air being
delivered at an overpressure, warm air being returned at an
underpressure, and the intermediate pressure gradient (interacting with
intervening obstacles such as the racks full of equipment) determining
the flow pattern.  That flow pattern needs to avoid things like "hot
spots" that are isolated from the overall cooling flow, especially hot
spots that ultimately feed rack intake, and flow that feeds the warmed
exhaust from one or more units back into the cool air intake of others.
Ultimately, this is a nonlinear problem with turbulence and other
factors and hence difficult to make pronouncements on without knowing
the geometry of your rack layout and other stuff.

One reason that raised floor designs are popular is that it makes
establishing a clean circulation pattern a bit simpler -- feed cold air
from the botton right into the rack intakess, vent the warmed air from
their outflow directly into a warm air return.  The cooling air mixes
minimally with ambient room air and is relatively easy to balance.

In a simpler overhead cold air delivery, warm air return system you'll
need to be able to balance the cold air delivery at several points in
the room, perhaps blowing it down directly into the front (intake) faces
of opposing racks, while letting the warm air get pulled along the
ceiling to one or more major return vents.  That way you can get a
delivered cold-air-down, in-through-rack, out-from-rack, warm-air-up,
warm-air-along-ceiling and returned sort of pattern established that is
consistent and balancable among the delivery registers throughout the
room.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From idooley at isaacdooley.com  Fri Feb 11 12:39:29 2005
From: idooley at isaacdooley.com (Isaac Dooley)
Date: Fri, 11 Feb 2005 14:39:29 -0600
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
Message-ID: <420D1801.9090206@isaacdooley.com>


>True. I always wonder what the low-CPU-usage-advocates want the MPI 
>process to do while i.e. an MPI_Send() is executed. 
>
They don't want the process to do anything when the call MPI_Send, 
however carefully using asynchronous or non-blocking messaging ideally 
would not use the CPU. Using MPI_ISend() allows programs to not waste 
CPU cycles waiting on the completion of a message transaction. This is 
critical for some tightly coupled fine grained applications. Also it 
allows for overlapping computation and communication, which is beneficial.

Isaac Dooley


From lindahl at pathscale.com  Fri Feb 11 13:03:35 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Fri, 11 Feb 2005 13:03:35 -0800
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <420CFE4C.6050003@ccrl-nece.de>
References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com>
	<420BC57A.5060007@harddata.com>
	<20050211023619.GB5174@greglaptop.internal.keyresearch.com>
	<420CFE4C.6050003@ccrl-nece.de>
Message-ID: <20050211210335.GE1256@greglaptop.internal.keyresearch.com>

On Fri, Feb 11, 2005 at 07:49:48PM +0100, Joachim Worringen wrote:

> The latest unsuccessful case of uncoupling computation and MPI 
> communication I read about was BG/L when using the second CPU as a 
> message processor.

Yep, "offload" that improves performance is more complicated than it
seems. The new InfiniPath adapter aims at raw latency and bandwidth
excellence, because this is always helpful. It's also frequently
helpful to be able to send directly out of cache, for medium-sized
packets, instead of using send dma, which has to flush cache to main
memory. Memory bandwidth isn't free.

Getting more concurrency, by the way, is as much a hardware issue as a
software issue. InfiniPath's hardware is dumb, but highly pipelined.
Most offload engines seem to have less pipelining. And cpu software
overhead generally scales nicely with additional cpus...

-- greg


From lindahl at pathscale.com  Fri Feb 11 13:21:38 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Fri, 11 Feb 2005 13:21:38 -0800
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <420D1801.9090206@isaacdooley.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
Message-ID: <20050211212137.GA2278@greglaptop.internal.keyresearch.com>

On Fri, Feb 11, 2005 at 02:39:29PM -0600, Isaac Dooley wrote:

> Using MPI_ISend() allows programs to not waste 
> CPU cycles waiting on the completion of a message transaction. This is 
> critical for some tightly coupled fine grained applications.

We do pretty much the same thing for MPI_Send and MPI_ISend for small
packets: they're nearly on the wire when the routine returns, and
the subsequent MPI_Wait is a no-op. This is actually pretty common
among MPI implementations.

The problem with trying to generalize about what MPI calls do is that
different implementations do different things with them. Reading
the standard won't teach you much about implementations.

-- greg


From rross at mcs.anl.gov  Fri Feb 11 13:47:39 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Fri, 11 Feb 2005 15:47:39 -0600 (CST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <420D1801.9090206@isaacdooley.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
Message-ID: <Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>

On Fri, 11 Feb 2005, Isaac Dooley wrote:

> >True. I always wonder what the low-CPU-usage-advocates want the MPI 
> >process to do while i.e. an MPI_Send() is executed. 
> >
> They don't want the process to do anything when the call MPI_Send, 
> however carefully using asynchronous or non-blocking messaging ideally 
> would not use the CPU.

Unless your code is multi-threaded, why do you care what the CPU 
utilization is during MPI_Send()?  Saving on the power bill?

When you call MPI_Send() semantically you've said "Hey, send this, and 
btw I can't do anything else until you are done."  Likewise for 
MPI_Recv().  So the implementation will be built to get things done as 
quickly as possible.

Often the path to lowest latency leads to polling, which leads to the high
CPU utilization.  Same issue with interrupt mitigation, as mentioned
earlier in the thread; you can save CPU by coalescing, or you can get 
better performance.

> Using MPI_ISend() allows programs to not waste CPU cycles waiting on the
> completion of a message transaction.

No, it allows the programmer to express that it wants to send a message 
but not wait for it to complete right now.  The API doesn't specify the 
semantics of CPU utilization.  It cannot, because the API doesn't have 
knowledge of the hardware that will be used in the implementation.

> This is critical for some tightly coupled fine grained applications.

What exactly is critical for tightly coupled, fine grained applications?  

I would think that extremely low latency communication would be the most 
important factor, not whether or not we crank on the CPU to get that.

> Also it allows for overlapping computation and communication, which is
> beneficial.

Sure!

Rob
---
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab


From rbw at ahpcrc.org  Fri Feb 11 14:11:14 2005
From: rbw at ahpcrc.org (Richard Walsh)
Date: Fri, 11 Feb 2005 16:11:14 -0600
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <20050211212137.GA2278@greglaptop.internal.keyresearch.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>
	<20050211212137.GA2278@greglaptop.internal.keyresearch.com>
Message-ID: <420D2D82.5050609@ahpcrc.org>

Greg Lindahl wrote:

>On Fri, Feb 11, 2005 at 02:39:29PM -0600, Isaac Dooley wrote:
>
>  
>
>>Using MPI_ISend() allows programs to not waste 
>>CPU cycles waiting on the completion of a message transaction. This is 
>>critical for some tightly coupled fine grained applications.
>>    
>>
>
>We do pretty much the same thing for MPI_Send and MPI_ISend for small
>packets: they're nearly on the wire when the routine returns, and
>the subsequent MPI_Wait is a no-op. This is actually pretty common
>among MPI implementations.
>
>The problem with trying to generalize about what MPI calls do is that
>different implementations do different things with them. Reading
>the standard won't teach you much about implementations.
>
>-- greg
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>
Right. Small messages are where latency matters anyway. As the message size
dwindles, the remaining overhead is mostly intrinsic to the subroutine call
and unavoidable. What is to be done?  The only choice is to squeeze out the
subroutine call itself with a different programming model (say UPC) and a
memory and instruction set architecture that supports single instruction
(preferably pipeline with a block/vector length and stride option to hide 
latency) remote memory addressing. Additions like the STEN on the Quadrics Elan4
and Hypertransport directly from remote processor cache are cluster hardware
morphs taking things the direction of GAS systems like the Cray X1 and SGI
Altix.  

rbw


From mathog at mendel.bio.caltech.edu  Fri Feb 11 14:59:55 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Fri, 11 Feb 2005 14:59:55 -0800
Subject: Fw: Re: [Beowulf] cooling question:  cfm per rack?
Message-ID: <E1CzjlT-000305-00@mendel.bio.caltech.edu>

Mike,

I've been trying to pick the brains of other folks on the
beowulf list who have computer rooms with modern equipment.

One problem with the existing air, with regards to future
expansion, is apparently the total amount of air that the
current A/C can move.  This is all horrendously complicated
and needs to be looked at carefully by a HVAC consultant.
Pretty sure we have enough tons and flow for now, meaning
my rack and Deshaies and everything else I know is going in there
in a couple of months.  More and more convinced that we don't
have enough to handle multiple full racks of the next generation
of computers.

Jim Lux from JPL answered my questions as attached after
my signature.  His back of the envelope
calculations for a 5kW rack (roughly 
equal to what I have now) give a requirement for 1800 cfm flow through
the rack.  The current A/C is, according to the A/C guy who was
here, good for only 5500 cfm.  However, since I don't know what the
inlet or outlet temperatures on the rack are going to be (ie,
the temperature of the air the A/C returns to the room and how
hot the air is coming out the back) the required cfm may be
quite different than this.  Hmm, let me go borrow a thermometer
and measure it, 22 C in, 32 C out the back, on the node
in the middle of the  rack.  So there's 10 degrees across my rack
and he assumes 15.  Anyway, a safer estimate for the next generation
is 10kW, and there are people who predict 20kW, so total
airflow through the A/C seems unlikely to be sufficient a few
years down the road.  Assuming that people put this equipment 
in the room.  

Sorry, to be vague, there are just so many unknowns.

I also talked to Darryl Willick, who runs a bunch of machine rooms
on campus for Chemistry and some of Rees, Bjorkman and Mayo's
stuff.  His main room is about at capacity now with
6 full racks and a few odds and ends.  He has 2 x 250A panels
in there and apparently only a 45kW A/C unit.  That second
number is really odd because they aren't usually rated that
way, but that's the number he remembered. If he's right that's
45000/3500=12 tons, roughly the same as the unit currently
in the Rees area.  He said his had to be serviced
recently because they were having overheating problems, but only
a belt was changed.  Unknown how many cfm it is.  He has a small
workstation area that is somehow or other connected to his machine
room ventilation wise, and apparently when they prop the door open
in the workstation area it causes problems in the machine room.
So maybe it would make sense to put a small separate A/C unit
in the proposed classroom to avoid those sorts of complications
in the future.  Or maybe it can tap off building air.

Darryl did say something interesting though, he said that for
some units the A/C people can increase the capacity by changing
the pulleys around.  Apparently this blows more air, and the
cold water isn't limiting, so it effectively upgrades the unit
without changing very much.  Darryl said that this was done
at some point for Mayo's computer room in the subbasement
of the BI.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

------------- Forwarded message follows -------------

At 08:17 AM 2/11/2005, you wrote:
>In designing a computer room two key factors are:
>
>1.  Power in   (electricity)
>2.  Power out  (A/C)
>
>The second term really has two parts:
>
>   A.  the amount of air moved
>   B.  the reduction in temperature of that air across the A/C unit
>
>The latter part is specified in tons.  The A/C guys I've spoken
>with recently utilize some more or less standard relationship
>between cubic feet per minute (cfm) and A/C tons for the units they
>maintain.  These run off the campus cold water supply, so
>it makes sense that heat out is proportional to flow across, assuming
>that the cold water has a very large heat capacity.
>
>However, in terms of cooling the units themselves, the amount of
>air flow through the racks is also important.  That flow is
>also in cfm.  Ideally cfm through the racks would be equal to cfm
>through the A/C, ie, all air goes once through the racks and then
>directly through the A/C.  Even more ideally cfm through _each_ rack
>could be modulated somehow, since some racks move much more
>air than others and putting a low flow rack next to a high flow rack
>might drive the air the wrong way through the low flow unit.
>
>How does one calculate an optimal cfm through a rack?

Decide on a maximum outlet temperature (say, 30C)
Find your inlet air temperature (say, 15C)
You know your dissipation.. (say, 5kW)

Calculate how much air you need to move using the specific heat of air.
(about 1 kJ/(kg K))

5 kJ/sec means you'd need 5 kg/sec for a 1 degree rise, but here, with a 15 
degree rise, you can get by with .33 kg/sec.  Turn the kg/sec into cfm... 
.33 kg * 1.3 m3/kg = .43
cubic meters/sec.  There's about 35 cubic feet in a cubic meter, so we need 
about 15 cubic feet per second.  Multiply by 60 and you get a bit more than 
900 cfm.

Now.. that's idealized, so double it.  1800 cfm or so.


Step 2: How big is the duct?  Generally, you don't want to go any faster 
than 1000 linear feet per minute, so your duct will need to be about 2 
square feet.  (you begin to see why you don't want some little 6" diameter 
blower...)


>For a specific example with round numbers, let's say it's a
>25U rack, dissipates 10kW, and has a single 50 cfm per minute output
>fan per 1U node.  (Ie, all air out must go through that path.)
>
>There seem to be a bunch of variables that are hard to deal with.
>For instance, adding the exhaust fans would be 50*25 = 1250 cfm.
>Is that all there is to it? But that type of fan only runs at
>the stated flow rate if the pressures are exactly as specified.
>Without incredibly careful balancing of the pressure across the
>rack it won't generally run at 50 cfm.


This is precisely the case.  And, of course, the actual circumstances will 
be nothing like what the design specs are.


>Is cfm the key unit here or should one think in terms of pressure
>at various points in the room?

Trying to come up with an accurate aerodynamic model is a worthy challenge 
for a very large cluster (computational challenge, not thermal).

It's all done by rules of thumb and adding lots of margin.

Use the rough sizing technique to get an approximate air flow.  Use 
reasonable sized ducts and air speeds. Measure the actual outlet
temperatures.

Actually, what most people do is a rough sizing, then call in someone who 
actually does this for a living (a HVAC contractor) and use their rough 
sizing to validate what the contractor tells you you should have.


>Thanks,
>
>David Mathog
>mathog at caltech.edu
>Manager, Sequence Analysis Facility, Biology Division, Caltech
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From mathog at mendel.bio.caltech.edu  Fri Feb 11 15:06:49 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Fri, 11 Feb 2005 15:06:49 -0800
Subject: [Beowulf] Oops
Message-ID: <E1Czjs9-00030u-00@mendel.bio.caltech.edu>

Sorry about that message addressed to "Mike", it wasn't
supposed to go to the list.

Please cancel it if that's possible.

Apologies otherwise.

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From rross at mcs.anl.gov  Fri Feb 11 18:47:22 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Fri, 11 Feb 2005 20:47:22 -0600 (CST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <420D54DA.8000904@uiuc.edu>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
Message-ID: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>

Hi Isaac,

On Fri, 11 Feb 2005, Isaac Dooley wrote:

> >>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the
> >>completion of a message transaction.
> >
> >No, it allows the programmer to express that it wants to send a message 
> >but not wait for it to complete right now.  The API doesn't specify the 
> >semantics of CPU utilization.  It cannot, because the API doesn't have 
> >knowledge of the hardware that will be used in the implementation.
> >
> That is partially true.  The context for my comment was under your 
> assumption that everyone uses MPI_Send(). These people, as I stated 
> before, do not care about what the CPU does during their blocking calls.

I think that it is completely true.  I made no assumption about everyone 
using MPI_Send(); I'm a late-comer to the conversation. 

I was not trying to say anything about what people making the calls care
about; I was trying to clarify what the standard does and does not say.  
However, I agree with you that it is unlikely that someone calling
MPI_Send() is too worried about what the CPU utilization is during the
call.

> I was trying to point out that programs utilizing non-blocking IO may 
> have work that will be adversely impacted by CPU utilization for 
> messaging. These are the people who care about CPU utilization for 
> messaging. This I hopes answers your prior question, at least partially.

I agree that people using MPI_Isend() and related non-blocking operations 
are sometimes doing so because they would like to perform some 
computation while the communication progresses.  People also use these 
calls to initiate a collection of point-to-point operations before 
waiting, so that multiple communications may proceed in parallel.  The 
implementation has no way of really knowing which of these is the case.

Greg just pointed out that for small messages most implementations will do
the exact same thing as in the MPI_Send() case anyway.  For large messages
I suppose that something different could be done.  In our implementation
(MPICH2), to my knowledge we do not differentiate.

You should understand that the way MPI implementations are measured is by 
their performance, not CPU utilization, so there is pressure to push the 
former as much as possible at the expense of the latter.

> Perhaps your applications demand low latency with no concern for the CPU 
> during the time spent blocking. That is fine. But some applications 
> benefit from overlapping computation and communication, and the cycles 
> not wasted by the CPU on communication can be used productively.

I wouldn't categorize the cycles spent on communication as "wasted"; it's 
not like we code in extraneous math just to keep the CPU pegged :).

Regards,

Rob
---
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab


From hahn at physics.mcmaster.ca  Fri Feb 11 21:41:41 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Sat, 12 Feb 2005 00:41:41 -0500 (EST)
Subject: [Beowulf] cooling question:  cfm per rack?
In-Reply-To: <E1CzdU6-0002Gr-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.44.0502111239430.30468-100000@coffee.psychology.mcmaster.ca>

> The second term really has two parts: 
>   A.  the amount of air moved
>   B.  the reduction in temperature of that air across the A/C unit
> 
> The latter part is specified in tons.  The A/C guys I've spoken

well, I usually think of temperature as a side-effect of the more 
direct measure, movement of energy.  hence, I always think of the 
tidy relation of 3.517 KW = 1 ton.  I usually skip any BTUs...

> with recently utilize some more or less standard relationship
> between cubic feet per minute (cfm) and A/C tons for the units they
> maintain.

CFM and delta-t across the machine-to-be-cooled are convolved to give 
you how much heat you're extracting.  no doubt both pressure and humidity
are involved to some degree as well, and I don't have a good equation
for this.  the good thing is that turning down the temperature can partly
mitigate minor airflow problems.

some reasonable discussion from Intel, (a bit axe-grinding, though):
http://www.7x24nw.org/Presentations-folder/Air%20Cooling%20in%20Servers%20and%20IT%20Facilities.pdf

a dell 1855 blade chassis spec's 400 CFM for ~4KW.  they're talking 6 of 
those chassis in a rack (24 KW!).  then again, that's assuming an unrealistic
power-per blade (>400W), which sounds like corporate CYA to me:
http://www.dell.com/downloads/global/products/pedge/en/PowerEdge%201855%20DC%20Whitepaper.pdf

this is a good overall discussion, though perhaps a bit pessimistic
about "typical" machinerooms:
http://www.chatsworth.com/uploads/pdf/best_practices_cooling_wp.pdf
http://www.chatsworth.com/uploads/pdf/increase_computerrm_cooling_wp.pdf

sun recommends 21-23C, 45-50%.  35% min, ESD critical at 30%:
http://www.sun.com/products-n-solutions/hardware/docs/html/817-4137-10/2__EnvReq.html

to complicate matters, HVAC folk always bring up the issue of "sensible
load".  as near as I can tell, this is just a way of saying that if you try
to impose too much delta-T on humid air, you wind up wasting a lot of energy
dehumidifying it...

tiles between 500-2000 CFM:
http://h200005.www2.hp.com/bc/docs/support/SupportManual/c00064724/c00064724.pdf
that also gives:
	CFM = btu/hr / (1.08 * dT)
so for 1 ton = 12000 BTU/hr and 70->90, 555 CFM per ton of cooling.
HVAC folk also tend to say 1 tile/ton, which seems about right.

> These run off the campus cold water supply, so
> it makes sense that heat out is proportional to flow across, assuming
> that the cold water has a very large heat capacity.

our experience with CW has been disasterous, but we made the huge mistake
of not using precision/machineroom chillers (fancoils, actually).
our old/existing machineroom, for instance, is supposed to have 2x8ton
fancoils, but combined they never moved more than about 20 KW (should be 56).

unless you have pretty extreme assurances about WC quality (flow, temp),
I would only consider using dual-cool machineroom chillers (DX + CW, usually
adds about 15% to price.)

> directly through the A/C.  Even more ideally cfm through _each_ rack
> could be modulated somehow, since some racks move much more 
> air than others and putting a low flow rack next to a high flow rack
> might drive the air the wrong way through the low flow unit.

well, the stuff in racks does probably have quite a few fans,
which could ideally modulate themselves.  my current-gen clusters 
certainly don't do that, but I'd be quite happy if next-gen did...

> How does one calculate an optimal cfm through a rack? 
> 
> For a specific example with round numbers, let's say it's a
> 25U rack, dissipates 10kW, and has a single 50 cfm per minute output
> fan per 1U node.  (Ie, all air out must go through that path.)

that sounds reasonable to me - 10KW is ~3 tons, and the formula above
relates your 1250 CFM total to about 3 tons as well.  for my 10KW racks,
I'm hoping to push the temperature down a bit (60-65), keep the humidity
low to avoid "sensible" wastage, and hope for the best with our tiles.

> Is cfm the key unit here or should one think in terms of pressure
> at various points in the room?

I think the answer is yes.  with a good raised floor, you seem to be 
able to expect fairly even pressure distribution.  we turned on our 
new machineroom yesterday, and the pressure feels similar everywhere 
(16" raised floor, though with some conduits down there, and 3x30T
Liebert deluxe system 3's.)

if your pressure is reasonably even, the same tiles should flow the 
same CFM.  I'd LOVE to find some way to measure airflow, since I'd 
actually consider doing things like adding patches of duct tape to 
the underside of too-high-flow tiles.  I suppose that the empiricist
approach is just to sample all your system temperatures, and if some 
are too high, reduce the airflow to racks which are "too cool".


From james.p.lux at jpl.nasa.gov  Sat Feb 12 05:45:02 2005
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Sat, 12 Feb 2005 05:45:02 -0800
Subject: [Beowulf] cooling question:  cfm per rack?
References: <Pine.LNX.4.44.0502111239430.30468-100000@coffee.psychology.mcmaster.ca>
Message-ID: <000401c51109$0df9a6d0$19f29580@LAPTOP152422>


> CFM and delta-t across the machine-to-be-cooled are convolved to give
> you how much heat you're extracting.  no doubt both pressure and humidity
> are involved to some degree as well, and I don't have a good equation
> for this.

Indeed.. there is no "nice simple" equation for the general case, because of
the problem with humidity.  You really need to be worrying about enthalpy,
etc., and with any sort of significant temperature change, it's neither
constant pressure, nor constant volume, not to mention mechanical
turbulence, etc..  All that icky thermodynamics stuff.  I once spend several
weeks trying to figure out if one could make theatrical fog without using
liquid nitrogen. They do it by having a big tank of water about half full at
around 160-180F, and then they inject liquid nitrogen into the headspace
above the water.  Turns out that the heat of vaporization of the LN2 is
almost exactly balanced by the heat of condensation of the saturated water
vapor, and that the volume of nitrogen gas produced, etc, works out to the
outlet stream being around 38F, with the water droplets at the same
temperature.  Very, very tough to do this with mechanical refrigeration for
a variety of reasons.

So, as you say, unless you're airconditioning a huge building (where the
cost of excess capacity is significant, and where there all those hot, water
exhaling people inside), you can just do some quasi-worst case
approximating.

  the good thing is that turning down the temperature can partly
> mitigate minor airflow problems.
>

> to complicate matters, HVAC folk always bring up the issue of "sensible
> load".  as near as I can tell, this is just a way of saying that if you
try
> to impose too much delta-T on humid air, you wind up wasting a lot of
energy
> dehumidifying it...

Yes.. this is especially true if you're not recirculating, but chilling
fresh air from "outside".  If you've got a reasonably closed system and
there's no people inside, it's less of an issue.

>
> tiles between 500-2000 CFM:
>
http://h200005.www2.hp.com/bc/docs/support/SupportManual/c00064724/c00064724
.pdf
> that also gives:
> CFM = btu/hr / (1.08 * dT)
> so for 1 ton = 12000 BTU/hr and 70->90, 555 CFM per ton of cooling.
> HVAC folk also tend to say 1 tile/ton, which seems about right.
>
> > These run off the campus cold water supply, so
> > it makes sense that heat out is proportional to flow across, assuming
> > that the cold water has a very large heat capacity.

Yes, in a theoretical sense.  However, there are two factors to be aware of:
1) run the air too fast past the coils and it doesn't have time to exchange
the heat; 2) run the air too fast and you consume power (and make heat) in
compressing it to overcome the pressure drop.  There's also a practical
limit on just how much delta T you can get in one pass through the chiller
coils.

>
> our experience with CW has been disasterous, but we made the huge mistake
> of not using precision/machineroom chillers (fancoils, actually).
> our old/existing machineroom, for instance, is supposed to have 2x8ton
> fancoils, but combined they never moved more than about 20 KW (should be
56).
>
> unless you have pretty extreme assurances about WC quality (flow, temp),
> I would only consider using dual-cool machineroom chillers (DX + CW,
usually
> adds about 15% to price.)
>
> > directly through the A/C.  Even more ideally cfm through _each_ rack
> > could be modulated somehow, since some racks move much more
> > air than others and putting a low flow rack next to a high flow rack
> > might drive the air the wrong way through the low flow unit.
>
>
> > Is cfm the key unit here or should one think in terms of pressure
> > at various points in the room?
>
>
> if your pressure is reasonably even, the same tiles should flow the
> same CFM.  I'd LOVE to find some way to measure airflow, since I'd
> actually consider doing things like adding patches of duct tape to
> the underside of too-high-flow tiles.  I suppose that the empiricist
> approach is just to sample all your system temperatures, and if some
> are too high, reduce the airflow to racks which are "too cool".

Hie thee to a company called Dwyer, who make equipment specifically designed
to measure airflow.  There are several approaches..
One is using a pitot tube with a Magnehelic differential pressure gauge.
Another is to measure the pressure drop across a calibrated orifice (again,
using a sensitive pressure gauge). http://www.dwyer-inst.com/   Another is
to use a airspeed probe (looks like a wand with a little fan in a hole on
the end).  The fancy ones will average a bunch of readings over an opening
and do the calculation to turn area*average speed into CFM.


You can find Magnehelic gauges surplus all the time.. keep your eyes open
and when one turns up for $15-20, grab it.  They're handy devices that can
measure fairly small pressures (few inches of water column), and come with
all sorts of weird scales (including some already calibrated in feet per
minute or m/sec, all ready for use with a pitot tube).  Interesting to
measure the pressure in a room (or your house) and see what happens when the
heater turns on, or the kids open and close the doors, etc.


Some time spent with the Mc-Master Carr catalog (http://www.mcmaster.com/)
or the Grainger catalog (http://www.grainger.com/) (both are large suppliers
of stuff mechanical, materials, etc.. everyone should have a copy of the
several thousand page yellow  McMaster Carr catalog on their desk...).
Omega (usually associated with temperature measuring) has a fair number of
airspeed and volume measuring devices. http://www.omega.com

However, your empirical approach of reducing the flow through the coldest
racks is probably as good as anything.

Jim Lux


From rgb at phy.duke.edu  Sat Feb 12 06:36:17 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sat, 12 Feb 2005 09:36:17 -0500 (EST)
Subject: Fw: Re: [Beowulf] cooling question:  cfm per rack?
In-Reply-To: <E1CzjlT-000305-00@mendel.bio.caltech.edu>
References: <E1CzjlT-000305-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.58.0502120905410.7797@lilith.rgb.private.net>

On Fri, 11 Feb 2005, David Mathog wrote:

> Sorry, to be vague, there are just so many unknowns.

Always.:-)

> 
> I also talked to Darryl Willick, who runs a bunch of machine rooms
> on campus for Chemistry and some of Rees, Bjorkman and Mayo's
> stuff.  His main room is about at capacity now with
> 6 full racks and a few odds and ends.  He has 2 x 250A panels
> in there and apparently only a 45kW A/C unit.  That second
> number is really odd because they aren't usually rated that
> way, but that's the number he remembered. If he's right that's
> 45000/3500=12 tons, roughly the same as the unit currently
> in the Rees area.  He said his had to be serviced
> recently because they were having overheating problems, but only
> a belt was changed.  Unknown how many cfm it is.  He has a small
> workstation area that is somehow or other connected to his machine
> room ventilation wise, and apparently when they prop the door open
> in the workstation area it causes problems in the machine room.
> So maybe it would make sense to put a small separate A/C unit
> in the proposed classroom to avoid those sorts of complications
> in the future.  Or maybe it can tap off building air.
> 
> Darryl did say something interesting though, he said that for
> some units the A/C people can increase the capacity by changing
> the pulleys around.  Apparently this blows more air, and the
> cold water isn't limiting, so it effectively upgrades the unit
> without changing very much.  Darryl said that this was done
> at some point for Mayo's computer room in the subbasement
> of the BI.

I'm sure you probably remember this from my posts on this topic before,
but there are lots of bad experiences we and others on the list have had
with AC that you can profit from.  Don't forget things like:

  * Kill switch for room for the day the AC fails altogether at 2:30
a.m.

  * Automated monitoring and (if you've got one) a call cycle so that
maybe somebody can get there in time to shut things down before the kill
switch kicks in EVEN at 2:30 a.m.

  * The fact that at many places, the physical plant people have this
annoying tendency to try to save energy by throttling down the A/C to a
standby mode (where the chilled water is allowed to warm up to maybe
18C) in the winter because hey, it's cold outside, right?  Often this is
done automatically, without human thought or control.  Often this
triggers events for which the first two interventions are required when
it does.  This may not apply to you in your generally warm clime
(compared to here, anyway) but is worth checking, for sure.

  * When computing the cost/benefit of power vs AC, be aware (to put
into words what you're working toward anyway) that the true optimum is
going to be biased towards an excess of AC capacity.  This is for
several reasons, once you think about it.  The most important one is
that adding new/additional power is relatively cheap whenever you do it;
adding new/additional AC capacity later can be VERY expensive -- as
expensive as adding AC at all in the first place.

  * Surplus capacity can also keep room ambient colder (generally
better) while operating in the normal load range and may be cheaper in
terms of operating efficiency, as AC COP depends on temperature
differentials between delivery and returned chiller water (although the
blowers and pumps draw too -- don't know how this all works out in the
wash).

  * Redundancy is good, if you've got the space.  If one blower out of
three goes, the remaining two may be able to keep the space operational
while service is performed, or at least keep it cool enough to avoid an
involuntary kill or midnight call.

  * As you note -- it really helps to get professional advice on this
from an engineer or architect who specializes in server room
infrastructure design and support.  Not that you shouldn't educate
yourself in it too -- it's just that they SHOULD have a broad base of
personal professional experience to draw on as well as some classroom
education on the issues to be faced.  Worth paying for.

As you note, it is very difficult to know exactly where future power
requirements and node densities will go per rack.  Maybe blades will
take over the universe, and racks will suddenly become very hot indeed.
Some non-blade racks can achieve close to double the standard node/CPU
densities in terms of floorspace footprint (e.g. Rackable, IIRC).
Multiple core CPUs are at the threshold of appearing, and although they
also look like they might be power/clock limited BECAUSE of the heat
problem, there is still going to be some sort of scaling of power per
compute capacity per cubic foot of rack space as the latter goes up.
Alternatively, some room designs might install the DUCTWORK now that can
support a (say) doubling of future AC capacity in the future and reserve
space for the local units to drive this capacity in the facility but
leave that space empty.  Then you can (eventually) add the units without
having to necessarily rip everything apart.  This probably works best
with raised floor designs (where you just duct per rack location) but
one would expect that they could manage it for other kinds of ducted
delivery and return if they try.

In any infrastructure project, it really pays to think about this stuff
ahead of time, as you are.

  rgb

> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
> ------------- Forwarded message follows -------------
> 
> At 08:17 AM 2/11/2005, you wrote:
> >In designing a computer room two key factors are:
> >
> >1.  Power in   (electricity)
> >2.  Power out  (A/C)
> >
> >The second term really has two parts:
> >
> >   A.  the amount of air moved
> >   B.  the reduction in temperature of that air across the A/C unit
> >
> >The latter part is specified in tons.  The A/C guys I've spoken
> >with recently utilize some more or less standard relationship
> >between cubic feet per minute (cfm) and A/C tons for the units they
> >maintain.  These run off the campus cold water supply, so
> >it makes sense that heat out is proportional to flow across, assuming
> >that the cold water has a very large heat capacity.
> >
> >However, in terms of cooling the units themselves, the amount of
> >air flow through the racks is also important.  That flow is
> >also in cfm.  Ideally cfm through the racks would be equal to cfm
> >through the A/C, ie, all air goes once through the racks and then
> >directly through the A/C.  Even more ideally cfm through _each_ rack
> >could be modulated somehow, since some racks move much more
> >air than others and putting a low flow rack next to a high flow rack
> >might drive the air the wrong way through the low flow unit.
> >
> >How does one calculate an optimal cfm through a rack?
> 
> Decide on a maximum outlet temperature (say, 30C)
> Find your inlet air temperature (say, 15C)
> You know your dissipation.. (say, 5kW)
> 
> Calculate how much air you need to move using the specific heat of air.
> (about 1 kJ/(kg K))
> 
> 5 kJ/sec means you'd need 5 kg/sec for a 1 degree rise, but here, with a 15 
> degree rise, you can get by with .33 kg/sec.  Turn the kg/sec into cfm... 
> .33 kg * 1.3 m3/kg = .43
> cubic meters/sec.  There's about 35 cubic feet in a cubic meter, so we need 
> about 15 cubic feet per second.  Multiply by 60 and you get a bit more than 
> 900 cfm.
> 
> Now.. that's idealized, so double it.  1800 cfm or so.
> 
> 
> Step 2: How big is the duct?  Generally, you don't want to go any faster 
> than 1000 linear feet per minute, so your duct will need to be about 2 
> square feet.  (you begin to see why you don't want some little 6" diameter 
> blower...)
> 
> 
> 
> >For a specific example with round numbers, let's say it's a
> >25U rack, dissipates 10kW, and has a single 50 cfm per minute output
> >fan per 1U node.  (Ie, all air out must go through that path.)
> >
> >There seem to be a bunch of variables that are hard to deal with.
> >For instance, adding the exhaust fans would be 50*25 = 1250 cfm.
> >Is that all there is to it? But that type of fan only runs at
> >the stated flow rate if the pressures are exactly as specified.
> >Without incredibly careful balancing of the pressure across the
> >rack it won't generally run at 50 cfm.
> 
> 
> This is precisely the case.  And, of course, the actual circumstances will 
> be nothing like what the design specs are.
> 
> 
> >Is cfm the key unit here or should one think in terms of pressure
> >at various points in the room?
> 
> Trying to come up with an accurate aerodynamic model is a worthy challenge 
> for a very large cluster (computational challenge, not thermal).
> 
> It's all done by rules of thumb and adding lots of margin.
> 
> Use the rough sizing technique to get an approximate air flow.  Use 
> reasonable sized ducts and air speeds. Measure the actual outlet
> temperatures.
> 
> Actually, what most people do is a rough sizing, then call in someone who 
> actually does this for a living (a HVAC contractor) and use their rough 
> sizing to validate what the contractor tells you you should have.
> 
> 
> 
> >Thanks,
> >
> >David Mathog
> >mathog at caltech.edu
> >Manager, Sequence Analysis Facility, Biology Division, Caltech
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit 
> >http://www.beowulf.org/mailman/listinfo/beowulf
> 
> James Lux, P.E.
> Spacecraft Radio Frequency Subsystems Group
> Flight Communications Systems Section
> Jet Propulsion Laboratory, Mail Stop 161-213
> 4800 Oak Grove Drive
> Pasadena CA 91109
> tel: (818)354-2075
> fax: (818)393-6875
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Sat Feb 12 06:49:03 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sat, 12 Feb 2005 09:49:03 -0500 (EST)
Subject: [Beowulf] cooling question:  cfm per rack?
In-Reply-To: <Pine.LNX.4.44.0502111239430.30468-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0502111239430.30468-100000@coffee.psychology.mcmaster.ca>
Message-ID: <Pine.LNX.4.58.0502120945380.7797@lilith.rgb.private.net>

> if your pressure is reasonably even, the same tiles should flow the 
> same CFM.  I'd LOVE to find some way to measure airflow, since I'd 
> actually consider doing things like adding patches of duct tape to 
> the underside of too-high-flow tiles.  I suppose that the empiricist
> approach is just to sample all your system temperatures, and if some 
> are too high, reduce the airflow to racks which are "too cool".

Relative airflow can probably be measured with a kid's toy -- one of the
little pinwheels -- and counting revolutions with a stopwatch.
Normalizing that to absolute airflow in CFM is a bit tricky (since the
result depends to some extent on the resistance imposed by the measuring
apparatus) but somebody out there may have designed a version of this
with a real fan and magnets set so that the counting is done
electronically.  In fact, I could build something to do this out of OTC
parts if I had any way to normalize the count.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From james.p.lux at jpl.nasa.gov  Sat Feb 12 07:47:17 2005
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Sat, 12 Feb 2005 07:47:17 -0800
Subject: [Beowulf] cooling question:  cfm per rack?
References: <Pine.LNX.4.44.0502111239430.30468-100000@coffee.psychology.mcmaster.ca>
	<Pine.LNX.4.58.0502120945380.7797@lilith.rgb.private.net>
Message-ID: <001401c5111a$3c447b30$32a8a8c0@LAPTOP152422>

Sure, one could build it.. but one can probably buy it cheaper/easier
Omega: http://www.omega.com/ppt/pptsc.asp?ref=HHF82&Nav=grec06  $89
They have others.
Similar devices abound: http://www.nkhome.com/ww/1000/1000.html
http://www.windandweather.com/store/Weather_Instruments___Wind_Gauges?Args=&
page_number=1  (check out the first one.. $49)

Your local sporting goods place (REI, Sport Chalet, Big 5) might have
something like this too.  So might Sharper Image or Brookstone, or one of
those gadget stores

Heck, Harbor Freight Salvage, a big retailer of inexpensive moderate quality
imported stuff might have them..next time you're down buying cheap imported
Chinese machine tools...check that bargain bin next to the register.

Other approaches..small propellor on a small DC motor run as a generator
(only works for fairly fast flows >several m/sec) run to a DVM.
Small propellor and magnet/reedswitch driving a counter (as in your
inexpensive DMM). (this is what the commercial units are)

The challenge in home fabrication of such devices is getting it to be
reasonably orientation insensitive, which implies pretty good balance, and
to work in very low flows (<1 m/sec), which implies fairly low friction.

I imagine, if you had a LOT of time on your hands, you could probably modify
the heated film/wire sensor from an automotive mass air flow sensor for this
purpose.

(I spent the better part of a year trying to come up with a low budget way
to measure velocity profiles across large (decameter scale) artificial
tornadoes.. We eventually settled on a pitot tube rake with water manometers
using video to do data logging.)


----- Original Message -----
From: "Robert G. Brown" <rgb at phy.duke.edu>
To: "Mark Hahn" <hahn at physics.mcmaster.ca>
Cc: "David Mathog" <mathog at mendel.bio.caltech.edu>; <beowulf at beowulf.org>
Sent: Saturday, February 12, 2005 6:49 AM
Subject: Re: [Beowulf] cooling question: cfm per rack?


> > if your pressure is reasonably even, the same tiles should flow the
> > same CFM.  I'd LOVE to find some way to measure airflow, since I'd
> > actually consider doing things like adding patches of duct tape to
> > the underside of too-high-flow tiles.  I suppose that the empiricist
> > approach is just to sample all your system temperatures, and if some
> > are too high, reduce the airflow to racks which are "too cool".
>
> Relative airflow can probably be measured with a kid's toy -- one of the
> little pinwheels -- and counting revolutions with a stopwatch.
> Normalizing that to absolute airflow in CFM is a bit tricky (since the
> result depends to some extent on the resistance imposed by the measuring
> apparatus) but somebody out there may have designed a version of this
> with a real fan and magnets set so that the counting is done
> electronically.  In fact, I could build something to do this out of OTC
> parts if I had any way to normalize the count.
>
>    rgb


From Toufeeq_Hussain at infosys.com  Thu Feb 10 20:01:55 2005
From: Toufeeq_Hussain at infosys.com (Toufeeq Hussain)
Date: Fri, 11 Feb 2005 09:31:55 +0530
Subject: [Beowulf] Porting lam-7.1 to Cygwin (Win 2K)
Message-ID: <557E17BE74D22143B7BE70EB60E33E9915BD81F8@shlmsg01.ad.infosys.com>

Hi,

Trying to compile lam-7.1 on Cygwin.
Make fails at this point:

make[2]: Entering directory `/lam-7.1.1/otb/lamgrow'
/bin/bash ../../libtool --mode=link gcc  -O3     -o lamgrow.exe
lamgrow.o ../../share/liblam/liblam.la ../../share/libltdl/libltdlc.la
-lutil    
gcc -O3 -o lamgrow.exe lamgrow.o  ../../share/liblam/.libs/liblam.a
../../share/libltdl/.libs/libltdlc.a -lutil
../../share/liblam/.libs/liblam.a(ssi_boot_slurm.o)(.text+0x3c8):ssi_boo
t_slurm.c: undefined reference to `_inet_ntop'
collect2: ld returned 1 exit status
make[2]: *** [lamgrow.exe] Error 1
make[2]: Leaving directory `/lam-7.1.1/otb/lamgrow'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/lam-7.1.1/otb'
make: *** [all-recursive] Error 1

Is there a cygwin port available ?

Any suggestions to the above problem.

Regards,
Toufeeq Hussain


From rwm at absoft.com  Fri Feb 11 07:00:00 2005
From: rwm at absoft.com (Rodney Mach)
Date: Fri, 11 Feb 2005 10:00:00 -0500
Subject: [Beowulf] Re: thread safe PRNG
In-Reply-To: <200502111409.j1BE8vAY013737@bluewest.scyld.com>
References: <200502111409.j1BE8vAY013737@bluewest.scyld.com>
Message-ID: <420CC870.6020307@absoft.com>


 > Hi folks:
 >
 >     I need to get a thread-safe pseudo-random number generator.  All I
 > have found online was SPRNG which is set up for MPI.  Anyone have a
 > quick pointer to their favorite thread safe PRNG that works well in
 > OpenMP?
 >
 >     Thanks.
 >
 > Joe
 >

Hey Joe,

Intel MKL has various thread-safe prng that will work with OpenMP. IMSL 
also has thread-safe prng, as does IBM ESSL, ditto for AMD ACML.

-Rod


From henry.gabb at intel.com  Fri Feb 11 07:04:31 2005
From: henry.gabb at intel.com (Gabb, Henry)
Date: Fri, 11 Feb 2005 07:04:31 -0800
Subject: [Beowulf] RE: A thread-safe PRNG for an OpenMP program
Message-ID: <CBEB78A9FCF04346A0F0A34F53AD230A083A62F2@fmsmsx403.amr.corp.intel.com>

Hi Joe,
The Intel Math Kernel Library (specifically the Vector Statistical
Library within MKL) contains threadsafe random number functions. The
following web site has a full description:
http://www.intel.com/software/products/mkl/features/vsl.htm. There's an
article "Making the Monte Carlo Approach Even Easier and Faster" on
Intel Developer Services that describes how to use VSL functions with
OpenMP. It's available here:
http://www.intel.com/cd/ids/developer/asmo-na/eng/95573.htm.

Best regards,

Henry Gabb
Intel Parallel Applications Center


> Hi folks:
> 
>     I need to get a thread-safe pseudo-random number generator.  All I

> have found online was SPRNG which is set up for MPI.  Anyone have a 
> quick pointer to their favorite thread safe PRNG that works well in
OpenMP?
> 
>     Thanks.
> 
> Joe
> 
> -- 
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com


From diep at xs4all.nl  Fri Feb 11 08:59:56 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Fri, 11 Feb 2005 17:59:56 +0100
Subject: [Beowulf] A thread-safe PRNG for an OpenMP progra
Message-ID: <3.0.32.20050211175956.0102c6c0@pop.xs4all.nl>

Perhaps use a local PRNG as that can serve roughly at 2 nanoseconds a
number to each cpu.

Here is what i modified to 64 bits it's real fast at processors that are 64
bits and have rotating instruction (itanium doesn't have it, but still is
faster than k7 here as it's 64 bits). Even at itanium you can consider this
a fast PRNG.

/* define parameters (R1 and R2 must be smaller than the integer size): */
#define UNIX 1   // otherwise windows

#if UNIX
  #include <time.h>
  #define FORCEINLINE       __inline
  /* UNIX and such this is 64 bits unsigned variable: */
  #define BITBOARD                     unsigned long long
#else
  #define FORCEINLINE       __forceinline
  /* in WINDOWS we also want to be 64 bits: */
  #define BITBOARD                     unsigned _int64
#endif

#define KK  17
#define JJ  10
#define R1   5
#define R2   3

/* global variables Ranrot */
BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random
numbers */
0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,
0x195e36fe715fad23,
0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,
0x5db2d651a7bdf825,
0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,
0x00d47d10ffdc8a9f,
0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,
0x43d64ed75a9ad5d9
};
int r_p1, r_p2;          /* indexes into history buffer */

 /******************************************************** AgF 1999-03-03 *
 *  Random Number generator 'RANROT' type B                               *
 *  by Agner Fog                                                          *
 *                                                                        *
 *  This is a lagged-Fibonacci type of random number generator with       *
 *  rotation of bits.  The algorithm is:                                  *
 *  X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b               *
 *                                                                        *
 *  The last k values of X are stored in a circular buffer named          *
 *  randbuffer.                                                           *
 *                                                                        *
 *  This version works with any integer size: 16, 32, 64 bits etc.        *
 *  The integers must be unsigned. The resolution depends on the integer  *
 *  size.                                                                 *
 *                                                                        *
 *  Note that the function RanrotAInit must be called before the first    *
 *  call to RanrotA or iRanrotA                                           *
 *                                                                        *
 *  The theory of the RANROT type of generators is described at           *
 *  www.agner.org/random/ranrot.htm                                       *
 *                                                                        *
 * Optimized for 64 bits usage by Vincent Diepeveen                       *
 * diep at xs4all.nl                                                         *
 *************************************************************************/

FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<<r)|(x>>(64-r));}

/* returns a random number of 64 bits unsigned */
FORCEINLINE BITBOARD RanrotA(void) {
  /* generate next random number */
  BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) +
rotl(randbuffer[r_p1], R2);
  /* rotate list pointers */
  if( --r_p1 < 0)
    r_p1 = KK - 1;
  if( --r_p2 < 0 )
    r_p2 = KK - 1;
  return x;
}

/* this function initializes the random number generator.      */
void RanrotAInit(void) {
  int i;

  /* one can fill the randbuffer here with possible other values here */
  randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber;
  randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12);

  /* initialize pointers to circular buffer */
  r_p1 = 0;
  r_p2 = JJ;

  /* randomize */
  for( i = 0; i < 3000; i++ )
    (void)RanrotA();
}


At 16:40 10-2-2005 -0500, Joe Landman wrote:
>Hi folks:
>
>    I need to get a thread-safe pseudo-random number generator.  All I 
>have found online was SPRNG which is set up for MPI.  Anyone have a 
>quick pointer to their favorite thread safe PRNG that works well in OpenMP?
>
>    Thanks.
>
>Joe
>
>-- 
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web  : http://www.scalableinformatics.com
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From wrankin at ee.duke.edu  Fri Feb 11 09:51:42 2005
From: wrankin at ee.duke.edu (Bill Rankin)
Date: Fri, 11 Feb 2005 12:51:42 -0500
Subject: [Beowulf] cooling question:  cfm per rack?
In-Reply-To: <E1CzdU6-0002Gr-00@mendel.bio.caltech.edu>
References: <E1CzdU6-0002Gr-00@mendel.bio.caltech.edu>
Message-ID: <1108144302.3042.27.camel@localhost.localdomain>

 
> Is cfm the key unit here or should one think in terms of pressure
> at various points in the room?

The other factor in heat removal (both within the rack as well as within
the air chiller) is the intake air temps.   The larger the temperature
difference the more efficient the heat transfer becomes.

Essentially, 50cfm of 20C air cools a lot better than 50cfm of 30C air. 
Also (as we are currently experiencing) the air handlers are much more
efficient at cooling really HOT air, versus warm air.

-bill


-- 
bill rankin, ph.d. ........ director, cluster and grid technology group
wrankin at ee.duke.edu .......................... center for computational
duke university ...................... science engineering and medicine
http://www.ee.duke.edu/~wrankin .............. http://www.csem.duke.edu


From maurice at harddata.com  Fri Feb 11 11:15:05 2005
From: maurice at harddata.com (Maurice Hilarius)
Date: Fri, 11 Feb 2005 12:15:05 -0700
Subject: [Beowulf] Re: Re: Re: Re: Home beowulf - NIC latencies (Greg
	Lindahl)
In-Reply-To: <200502111409.j1BE8vAY013737@bluewest.scyld.com>
References: <200502111409.j1BE8vAY013737@bluewest.scyld.com>
Message-ID: <420D0439.3000304@harddata.com>

Greg Lindahl wrote:

>Amen. So use the MM5 t3a benchmark, maybe even SPEC HPC, the canned
>benchmarks for Amber, Charmm, DL_POLY, etc. The NAS Parallel
>Benchmarks are also good, they are much closer to real apps than
>microbenchmarks.
>
>-- greg
>
Double Amen. ( is that a long Amen??)
;-)

Now if we could only get all those benchmarks to agree with each other a 
bit!
It's classic. Pick your arch, chipset, amount of RAM, clockspeed, NIC, 
switch, and so on, and you can make a selective case for almost anything..

Although on SMP the Opterons are mainly kicking butt lately due to the 
fact that their SMP performance is so superior..

And that brings up another can 'o worms:
SMP or uni ?
One can make a great performance case for either/both depending on your 
goals.
<sigh...>


With our best regards,

Maurice W. Hilarius        Telephone: 01-780-456-9771
Hard Data Ltd.  FAX:       01-780-456-9772
11060 - 166 Avenue         email:maurice at harddata.com
Edmonton, AB, Canada       http://www.harddata.com/
   T5X 1Y3


This email, message, and content, should be considered confidential,
and is the copyrighted property of Hard Data Ltd., unless stated otherwise.


From rbbrigh at valeria.mp.sandia.gov  Fri Feb 11 12:14:11 2005
From: rbbrigh at valeria.mp.sandia.gov (Ron Brightwell)
Date: Fri, 11 Feb 2005 13:14:11 -0700
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <420CFE4C.6050003@ccrl-nece.de>
References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com>
	<420BC57A.5060007@harddata.com>
	<20050211023619.GB5174@greglaptop.internal.keyresearch.com>
	<420CFE4C.6050003@ccrl-nece.de>
Message-ID: <20050211201411.GA10732@ratbert.mp.sandia.gov>

On Fri Feb 11, 2005 11:49:48... Joachim Worringen wrote
> Greg Lindahl wrote:
> >On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote:
> >>If I have a fantastic device that uses infinitely small time (latency) 
> >>and moves huge amounts of data (bandwidth) but in doing so it takes 80% 
> >>of a CPU, we do not have a useful solution..
> >
> >If large cpu usage is a problem, it will show up nicely in real
> >application benchmarks.
> 
> True. I always wonder what the low-CPU-usage-advocates want the MPI 
> process to do while i.e. an MPI_Send() is executed. For small messages 
> (which are critical for many applications), it's somewhat like 
> requesting that a local memory-write has to show low CPU usage.

For blocking operations with short messages, low CPU usage shouldn't be the
main concern.  Measuring latency relative to CPU usage doesn't make much sense.

> 
> Of course, I can think of scenarios in which data transfers w/o CPU 
> usage do promise advantages, and I have implemented and evaluated such 
> techniques myself. But in the end (for the application), it always 
> boiled down to latency and bandwidth as most applications don't honor 
> "true" asynchronous communication.

Yep.  We seem to have several micro-benchmarks that determine what the overlap
potential of the network is, but I've never seen anything that determines
what the overlap potential of an application is.  It would be interesting
to see what the overlap potential of real applications is.

> 
> The latest unsuccessful case of uncoupling computation and MPI 
> communication I read about was BG/L when using the second CPU as a 
> message processor. Maybe Myrinet MX will behave differently by making 
> the MPI itself more concurrent on hardware level (is this a correct 
> description, Patrick?) - but it will need matching applications, too.
> 

BG/L is unique is many ways.  For example, using the second processor for
communications doesn't actually help with progress -- the application still has
to make MPI library calls to make progress on outstanding posted operations.
So, even if the application was coded to take advantage of overlap, it
probably wouldn't gain much by using the second processor.

MX should be able to provide overlap and progress, like Quadrics and a few
other technologies do.

-Ron


From bushnell at ultra.chem.ucsb.edu  Fri Feb 11 15:57:27 2005
From: bushnell at ultra.chem.ucsb.edu (John Bushnell)
Date: Fri, 11 Feb 2005 15:57:27 -0800 (PST)
Subject: [Beowulf] cooling question:  cfm per rack?
In-Reply-To: <E1CzjlT-000305-00@mendel.bio.caltech.edu>
Message-ID: <Pine.GSO.4.10.10502111533250.23231-100000@ultra.chem.ucsb.edu>


A few comments below...

On Fri, 11 Feb 2005, David Mathog wrote:

> Mike,
> 
> I've been trying to pick the brains of other folks on the
> beowulf list who have computer rooms with modern equipment.
> 
> One problem with the existing air, with regards to future
> expansion, is apparently the total amount of air that the
> current A/C can move.  This is all horrendously complicated
> and needs to be looked at carefully by a HVAC consultant.
> Pretty sure we have enough tons and flow for now, meaning
> my rack and Deshaies and everything else I know is going in there
> in a couple of months.  More and more convinced that we don't
> have enough to handle multiple full racks of the next generation
> of computers.

  We learned about this after putting in a big new A/C (adding to an
old but still functioning one) in our server room.  The problem was
mitigated by having the vents on the old AC replaced with flanges
attached to large flexible vents.  They hang near the top/front of
two racks, and this has helped quite a bit.  Air flow is important!
 
> Jim Lux from JPL answered my questions as attached after
> my signature.

  Thanks go out to Jim for the useful numbers.

> Darryl did say something interesting though, he said that for
> some units the A/C people can increase the capacity by changing
> the pulleys around.  Apparently this blows more air, and the
> cold water isn't limiting, so it effectively upgrades the unit
> without changing very much.  Darryl said that this was done
> at some point for Mayo's computer room in the subbasement
> of the BI.

  Sounds like a pretty cheap upgrade.  It would certainly be nice
if we could do that here, as we've been running on the edge in terms
of cooling for some time now.

  Our industial chilled water loop runs at around 16C, so obviously
the chilled water is simply acting as a resevoir for dumping heat
from a compressor rather than being the direct source of cooling.
So the limiting factor is likely the compressor/fluid/heat exchanger
with the chilled water, rather than the chilled water itself.  I
wonder what "changing pulleys around" is really doing?

    Stay cool  -  John


From idooley2 at uiuc.edu  Fri Feb 11 16:59:06 2005
From: idooley2 at uiuc.edu (Isaac Dooley)
Date: Fri, 11 Feb 2005 18:59:06 -0600
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
Message-ID: <420D54DA.8000904@uiuc.edu>

>
>
>>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the
>>completion of a message transaction.
>>    
>>
>
>No, it allows the programmer to express that it wants to send a message 
>but not wait for it to complete right now.  The API doesn't specify the 
>semantics of CPU utilization.  It cannot, because the API doesn't have 
>knowledge of the hardware that will be used in the implementation.
>  
>
That is partially true. The context for my comment was under your 
assumption that everyone uses MPI_Send(). These people, as I stated 
before, do not care about what the CPU does during their blocking calls. 
I was trying to point out that programs utilizing non-blocking IO may 
have work that will be adversely impacted by CPU utilization for 
messaging. These are the people who care about CPU utilization for 
messaging. This I hopes answers your prior question, at least partially.

Perhaps your applications demand low latency with no concern for the CPU 
during the time spent blocking. That is fine. But some applications 
benefit from overlapping computation and communication, and the cycles 
not wasted by the CPU on communication can be used productively.

Isaac Dooley


From rossen at VerariSoft.Com  Fri Feb 11 22:52:03 2005
From: rossen at VerariSoft.Com (Rossen Dimitrov)
Date: Sat, 12 Feb 2005 01:52:03 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
Message-ID: <420DA793.4000909@verarisoft.com>

I think that the mere definition of the term "MPI performance" and
focusing too much on it can potentially have a negative impact on the
overall discussion of parallel performance. Accepting the premise that
all MPI can do is push individual messages between user processes as
fast as possible, (as measured by ping pong) regardless of how this is
achieved, unnecessarily and, I'd say, unjustifiably restricts the field
of discussion. I agree that today MPI libraries are commonly measured by
their ping-pong "performance" and not by their CPU utilization or other
factors, but it does not necessarily make this form of performance
evaluation right.

I would support the idea of discussing isolated "MPI performance" but
only in the context of a broader performance parameter space, at least
including, communication overhead, communication bandwidth, processor
overhead, and ability to perform asynchronous communication (i.e.,
compliance to the MPI Progress Rule). Only in such a broader evaluation
space one can hope to fit the large number of combinations of
processor/memory/peripheral_fabric architectures, network interconnects,
system software/middleware, and application algorithms.

Of course, there is always the case of running the actual application
code and then evaluating the MPI performance by seeing which MPI library
(or library mode) makes the application run faster. Unfortunately, this
method for evaluating MPI often suffers from various efficiencies some
of which originate from the parallel algorithm developers, who thoughout
the years have sometimes adopted the most trivial ways of using MPI.

Here a couple of arguments for why it is important to look at MPI (and 
the whole communication system) from different angles. If certain MPI
optimizations are achieved at the cost of excessive use of resources
that otherwise could be used for computation or enabling the overall
"application_progress", the actual application performance may be below
its potential or even degrade. Here are some "application progress"
activities that can benefit of having these resources at their disposal:
OS/kernel processing, other communication, I/O operations, memory
operations (prefetching, etc.), peripheral bus/fabric operations. All of
these in one way or another depend on CPU processing. Also, today's 
processor architectures have many independent processing units and 
complex memory hierarchies. When the MPI library polls for completion of 
a communication request, most of this specialized hardware is virtually 
unused (wasted). The processor architecture trends indicate that this 
kind of internal CPU concurrency will continue to increase, thus making 
the cost of MPI polling even higher.

In this regard, a parallel application developer might actually very
much care what is actually happening in the MPI library even when he 
makes a call to MPI_Send. If he doesn't, he probably should.

Some related topics (not covered here because of bloviating) are:

- How an MPI library that maximizes MPI's ping-pong performance alone 
can cause unexpected behavior and a fully functional parallel system to 
work far below its realistic efficiency.

- What application algorithm developers experience when they attempt to
use the ever so nebulous "overlapping" with a polling MPI library and
how this experience has contributed to the overwhelming use of
MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or
(even better) persistent MPI calls, thus killing any hope that these
codes can run faster on systems that actually facilitate overlapping.

Rossen

Rob Ross wrote:
> Hi Isaac,
> 
> On Fri, 11 Feb 2005, Isaac Dooley wrote:
> 
> 
>>>>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the
>>>>completion of a message transaction.
>>>
>>>No, it allows the programmer to express that it wants to send a message 
>>>but not wait for it to complete right now.  The API doesn't specify the 
>>>semantics of CPU utilization.  It cannot, because the API doesn't have 
>>>knowledge of the hardware that will be used in the implementation.
>>>
>>
>>That is partially true.  The context for my comment was under your 
>>assumption that everyone uses MPI_Send(). These people, as I stated 
>>before, do not care about what the CPU does during their blocking calls.
> 
> 
> I think that it is completely true.  I made no assumption about everyone 
> using MPI_Send(); I'm a late-comer to the conversation. 
> 
> I was not trying to say anything about what people making the calls care
> about; I was trying to clarify what the standard does and does not say.  
> However, I agree with you that it is unlikely that someone calling
> MPI_Send() is too worried about what the CPU utilization is during the
> call.
> 
> 
>>I was trying to point out that programs utilizing non-blocking IO may 
>>have work that will be adversely impacted by CPU utilization for 
>>messaging. These are the people who care about CPU utilization for 
>>messaging. This I hopes answers your prior question, at least partially.
> 
> 
> I agree that people using MPI_Isend() and related non-blocking operations 
> are sometimes doing so because they would like to perform some 
> computation while the communication progresses.  People also use these 
> calls to initiate a collection of point-to-point operations before 
> waiting, so that multiple communications may proceed in parallel.  The 
> implementation has no way of really knowing which of these is the case.
> 
> Greg just pointed out that for small messages most implementations will do
> the exact same thing as in the MPI_Send() case anyway.  For large messages
> I suppose that something different could be done.  In our implementation
> (MPICH2), to my knowledge we do not differentiate.
> 
> You should understand that the way MPI implementations are measured is by 
> their performance, not CPU utilization, so there is pressure to push the 
> former as much as possible at the expense of the latter.
> 
> 
>>Perhaps your applications demand low latency with no concern for the CPU 
>>during the time spent blocking. That is fine. But some applications 
>>benefit from overlapping computation and communication, and the cycles 
>>not wasted by the CPU on communication can be used productively.
> 
> 
> I wouldn't categorize the cycles spent on communication as "wasted"; it's 
> not like we code in extraneous math just to keep the CPU pegged :).
> 
> Regards,
> 
> Rob
> ---
> Rob Ross, Mathematics and Computer Science Division, Argonne National Lab
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sadat_vit at yahoo.co.in  Fri Feb 11 23:38:32 2005
From: sadat_vit at yahoo.co.in (sadat khan)
Date: Sat, 12 Feb 2005 07:38:32 +0000 (GMT)
Subject: [Beowulf] BEOWULF vs NORMAL CLUSTER
Message-ID: <20050212073832.28306.qmail@web8310.mail.in.yahoo.com>


I am a new addition to this mailing list.Recently got interesterd in the field  of high performance computing. We had Mr.Anand Babu in our college in recently(the creator of the 5th fastest supercomputer in the world  THUNDER).

And he gave a really good talk on clustering....

First up i would like to enquire as to whether there is any difference between a beowulf and normal cluster??? Or is it jus another name for a cluster...

Another thing is what exactly do packages like MPI and PVM do ???

would be highly grateful for the help


Yahoo! India Matrimony: Find your life partneronline.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050212/3488d429/attachment.html>

From topa_007 at yahoo.com  Sat Feb 12 05:40:49 2005
From: topa_007 at yahoo.com (Toufeeq Hussain)
Date: Sat, 12 Feb 2005 05:40:49 -0800 (PST)
Subject: [Beowulf] Problem executing programs on lam-mpi
Message-ID: <20050212134049.70098.qmail@web30209.mail.mud.yahoo.com>

Hi,

I get the following message while running a MPI
program on a 2 node cluster*

mpirun: cannot start ./a.out on n0: No such file or
directory

I'm running mpirun as such : $ mpirun C ./a.out
compiled lam as such : ./configure --without-romio
--with-rsh="ssh -x"

*recon/lamboot execute successfully.
topa at debian:~$ lamboot -v hosts

LAM 7.1.1/MPI 2 C++ - Indiana University

n-1<32615> ssi:boot:base:linear: booting n0 (devian)
n-1<32615> ssi:boot:base:linear: booting n1 (debian)
n-1<32615> ssi:boot:base:linear: finished

*lamnodes gives the following output:

topa at debian:~/mpi_progs$ lamnodes
n0      devian:1:
n1      debian:1:origin,this_node

The MPI program is a simple one.
#include <stdio.h>
#include <mpi.h>

int
main(int argc, char *argv[])
{
  int rank, size;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf("Hello there");
  printf("Hello world! I am %d of %d\n", rank, size);

  MPI_Finalize();

  return 0;
}


Please help,
Toufeeq

=====
############################################
# ring me @ 98401-96690                    #
# mail me @ toufeeq at computer dot org    #
# Debian Sarge \w 2.6.10-ck5               #
############################################


From landman at scalableinformatics.com  Sat Feb 12 08:52:16 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Sat, 12 Feb 2005 11:52:16 -0500
Subject: [Beowulf] RE: A thread-safe PRNG for an OpenMP program
In-Reply-To: <CBEB78A9FCF04346A0F0A34F53AD230A083A62F2@fmsmsx403.amr.corp.intel.com>
References: <CBEB78A9FCF04346A0F0A34F53AD230A083A62F2@fmsmsx403.amr.corp.intel.com>
Message-ID: <420E3440.3080108@scalableinformatics.com>

Hi Henry:

   This is for two platforms that are not targets for Intel compilers.

   I have solved the problem by reworking tt800 a bit, and have that 
working nicely in OpenMP.  Thanks though.

Joe

Gabb, Henry wrote:
> Hi Joe,
> The Intel Math Kernel Library (specifically the Vector Statistical
> Library within MKL) contains threadsafe random number functions. The
> following web site has a full description:
> http://www.intel.com/software/products/mkl/features/vsl.htm. There's an
> article "Making the Monte Carlo Approach Even Easier and Faster" on
> Intel Developer Services that describes how to use VSL functions with
> OpenMP. It's available here:
> http://www.intel.com/cd/ids/developer/asmo-na/eng/95573.htm.
> 
> Best regards,
> 
> Henry Gabb
> Intel Parallel Applications Center
> 
> 
> 
>>Hi folks:
>>
>>    I need to get a thread-safe pseudo-random number generator.  All I
> 
> 
>>have found online was SPRNG which is set up for MPI.  Anyone have a 
>>quick pointer to their favorite thread safe PRNG that works well in
> 
> OpenMP?
> 
>>    Thanks.
>>
>>Joe
>>
>>-- 
>>Scalable Informatics LLC,
>>email: landman at scalableinformatics.com
>>web  : http://www.scalableinformatics.com
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615


From dtj at uberh4x0r.org  Sat Feb 12 08:53:47 2005
From: dtj at uberh4x0r.org (Dean Johnson)
Date: Sat, 12 Feb 2005 10:53:47 -0600
Subject: [Beowulf] cooling question:  cfm per rack?
In-Reply-To: <Pine.LNX.4.58.0502120945380.7797@lilith.rgb.private.net>
References: <Pine.LNX.4.44.0502111239430.30468-100000@coffee.psychology.mcmaster.ca>
	<Pine.LNX.4.58.0502120945380.7797@lilith.rgb.private.net>
Message-ID: <1108227228.3853.8.camel@terra>

On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote:
> 
> Relative airflow can probably be measured with a kid's toy -- one of the
> little pinwheels -- and counting revolutions with a stopwatch.
> Normalizing that to absolute airflow in CFM is a bit tricky (since the
> result depends to some extent on the resistance imposed by the measuring
> apparatus) but somebody out there may have designed a version of this
> with a real fan and magnets set so that the counting is done
> electronically.  In fact, I could build something to do this out of OTC
> parts if I had any way to normalize the count.
> 

Could you not use one of those cheapish wind speed devices that amateur
weather folks use? That would give you a rating, presumably in miles per
hour, and then figure backward based upon the area of the little fan
thingy. That would likely be not too expensive and a great deal easier,
and more accurate, to deal with than counting a pinwheel. ;-)

	-Dean


From atp at piskorski.com  Sat Feb 12 13:02:54 2005
From: atp at piskorski.com (Andrew Piskorski)
Date: Sat, 12 Feb 2005 16:02:54 -0500
Subject: [Beowulf] cooling question:  cfm per rack?
In-Reply-To: <1108227228.3853.8.camel@terra>
References: <1108227228.3853.8.camel@terra>
Message-ID: <20050212210254.GA66503@piskorski.com>

On Sat, Feb 12, 2005 at 10:53:47AM -0600, Dean Johnson wrote:
> On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote:
> > 
> > Relative airflow can probably be measured with a kid's toy -- one of the

> Could you not use one of those cheapish wind speed devices that amateur

> Could you not use one of those cheapish wind speed devices that amateur
> weather folks use? That would give you a rating, presumably in miles per

When I asked Jack Wathey (architect of the Ammonite cluster) about the
small hand-held anemometers intended for hikers and such, what he said
was:

On Wed, Nov 10, 2004 at 10:58:16AM -0800, Jack Wathey wrote:

> What I found most useful was the Kestrel 2000, which measures wind speed 
> and temperature.  The Kestrel 1000 is cheaper ($80 vs $100) and just 
> measures windspeed.  The Kestrel was the only windmeter I could find that 
> was sensitive and accurate enough for measuring the flowrate at ammonite's 
> filters (typically in the 120 to 200 feet per minute range).  They are 
> EXTREMELY DELICATE though! You can wreck the sapphire bearing just by 
> blowing on it hard (yes, I discovered this the hard way).  But the bearing 
> and impeller are replaceable for about $15, so it's not a disaster.
> 
> http://www.kestrelmeters.com

A while back, I purchased one here:

  http://store.botachtactical.com/ke20pothwime.html

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/


From emac at cybergps.net  Sat Feb 12 11:30:17 2005
From: emac at cybergps.net (Eric Machala)
Date: Sat, 12 Feb 2005 14:30:17 -0500
Subject: [Beowulf] Home Beowulf Intial Startup Question
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu><Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<420DA793.4000909@verarisoft.com>
Message-ID: <003501c51139$496fa9f0$6e45a8c0@masstivy>

    Hi all im semi new to beowulf's but very knowledge or computer and 
netowrk technologies i am building a 20 node dell optiplex 1.9 ghz 256 ram 
blah blah nodes wondering first off if master control is recommened to be 
same or better than nodes and what is recommened Linux O/s redhat or 
mandrake etc... or anyones recommendations
    Im also looking for some links or resources for tools aka software like 
parallel kernel upgrades moniter tools anything  for setting up Linux 
beowulf to make  this go smoothly


Eric M
Network Admin/CF
Emac at cybergps.net


From steve_heaton at ozemail.com.au  Sat Feb 12 15:41:22 2005
From: steve_heaton at ozemail.com.au (Fringe Dweller)
Date: Sun, 13 Feb 2005 10:41:22 +1100
Subject: [Beowulf] cooling question - dedicated infrastructure
In-Reply-To: <200502122000.j1CK096k019160@bluewest.scyld.com>
References: <200502122000.j1CK096k019160@bluewest.scyld.com>
Message-ID: <420E9422.7080002@ozemail.com.au>

An enlightening discussion re aircon peoples. Thanks.

A couple of "war stories" :)

I think RGB touched on problems with "defaults" on aircon behaviour in 
cooler climes.

We have similar problems in warmer part of this blue marble. Your 
typical default behaviour is to put aircon into standby overnight. Even 
in the middle of summer. I mean, nobody's there and it's cooler 
overnight anyway right? Well yes but if your pushing your IT hard 
overnight... you can see the consequences. Make sure you "own" your 
aircon :)

Another reason to ensure independence from anything related to the 
"building" is power. I had a customer in a very large building who's UPS 
would always trip every weekday morning at 6am and 6:30. Why? 6am => 
aircon up! 6:30 => lift motors up! The current draw for those two events 
is staggering.

That's why you spend big bucks on the supporting infrastructure =)

Stevo


From hahn at physics.mcmaster.ca  Sun Feb 13 12:41:51 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Sun, 13 Feb 2005 15:41:51 -0500 (EST)
Subject: [Beowulf] Home Beowulf Intial Startup Question
In-Reply-To: <003501c51139$496fa9f0$6e45a8c0@masstivy>
Message-ID: <Pine.LNX.4.44.0502131521220.16537-100000@coffee.psychology.mcmaster.ca>

> netowrk technologies i am building a 20 node dell optiplex 1.9 ghz 256 ram 

kinda low on ram there, but for a learning cluster, that's plenty.
(actually 20 is kinda big for such a cluster...)

> blah blah nodes wondering first off if master control is recommened to be 
> same or better than nodes and what is recommened Linux O/s redhat or 
> mandrake etc... or anyones recommendations

distros don't matter - none of them are significantly different,
and they all work.  people who care about distros are more interested
in desktop decor than getting work done ;)

admittedly, I am not a never-reinvent-the-wheel person.  
NRTW is worse than NIH, IMO.  (some wheels desperately need reinvention,
all progress comes from reinvention, etc).

>     Im also looking for some links or resources for tools aka software like 
> parallel kernel upgrades moniter tools anything  for setting up Linux 
> beowulf to make  this go smoothly

to me, "smooth" means "no extra load per node".  I strongly prefer
net-booting, or at least net-root setups.  people will tell you that 
using NFS for this is horribly inefficient, dangerous and causes warts.
but it works extremely well, at least for clusters of <= 96 nodes,
based on my experience so far.  things might be different if you're 
doing retrocomputing based on a half-duplex 10mbps network or have large
IO loads.  

the benefit is that your cluster acts like you have just one slave node.
the cost is that you have to do a pretty minor amount of work to hack 
something like Fedora to boot diskless (small changes to the initrd.)
and of course, it does mean that "incidental" file IO will cause network
traffic.  it's not clear to me that this is a problem, though, since:

	- nodes are normally configured to be fairly minimal - 
	you don't have 30 user logins on each one, with people running
	ls/bash/netscape/gcc all the time.

	- NFS is not that bad at caching, and you can help this out by
	upping the per-mount cache parameters a bit.`

	- it's awefully nice to have a nearly fully functional node
	even after its disk dies.

	- my "diskless" nodes actually do have local swap and /tmp.
	disks are cheap and handy, just don't *depend* on them.

	- you can easily imagine a hybrid system that boots somehow
	(PXE or from disk), and does an rsync or rpm/yum/systemimager
	equivalent.  I don't really see the point though.

	- having your root FS exported read-only is also kind of nice:
	good security is layered security...


From mathog at mendel.bio.caltech.edu  Sun Feb 13 13:50:34 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Sun, 13 Feb 2005 13:50:34 -0800
Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling?
Message-ID: <E1D0RdS-0007hy-00@mendel.bio.caltech.edu>

There are a series of white papers by APC here:

  http://www.apc.com/tools/mytools/index.cfm?action=wp

where they discuss various power and cooling factors.  They note
a disconnect between the higher densities achieved by blades and
similar high density racks and the practicality of actually
cooling these beasts.  Basically it comes down to you save space
on the rack and then give it all back on the cooling system.  Think
of it minimally in these terms - to move enough cfm at less than 30
feet per minute starts to require a duct larger than the rack itself!

In terms of TCO, at the moment, APC rejects the notion that
these ultra high density machines are cost effective because they
are so very difficult to cool.

It seems to me that at a certain power point the racks are going to
have to resort to water cooling.  Long ago the ECL mainframes were
cooled this way, but it's been a long time since most of us have
seen water pipes running into the computers in a machine room. 

Cooling a 10 kW rack well looks to be extremely tough with air,
and going much above that would seem to require something approaching
a dedicated wind tunnel.  Any opinions on how high the power
dissipation in racks will go  before the manufacturers throw
in the air cooling towel and start shipping them with water
connections?

If you were designing a computer room today (which I am) what would
you allow for the maximum power dissipation per rack _to_be_handled_
by_the_room_A/C.  The assumption being that in 8 years if somebody
buys a 40kW (heaven forbid) rack it will dump its heat through
a separate water cooling system. 

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From rgb at phy.duke.edu  Sun Feb 13 15:47:22 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sun, 13 Feb 2005 18:47:22 -0500 (EST)
Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling?
In-Reply-To: <E1D0RdS-0007hy-00@mendel.bio.caltech.edu>
References: <E1D0RdS-0007hy-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.58.0502131745001.6364@lilith.rgb.private.net>

On Sun, 13 Feb 2005, David Mathog wrote:

> There are a series of white papers by APC here:
> 
>   http://www.apc.com/tools/mytools/index.cfm?action=wp

That link doesn't work for me (apc's website barfs on it) but I googled
and worked through their gatekeeper to get access.  After "logging in"
(yuck) I'm going to try to download:

  WP-5 Cooling Imperatives for Data Centers and Network Rooms Effective
  next generation data centers and network rooms must address the known
  needs and problems relating to current and past designs. This paper
  presents a categorized and prioritized collection of cooling needs and
  problems as obtained through systematic user interviews.

which I'm hoping is the one you are referring to above.

> where they discuss various power and cooling factors.  They note
> a disconnect between the higher densities achieved by blades and
> similar high density racks and the practicality of actually
> cooling these beasts.  Basically it comes down to you save space
> on the rack and then give it all back on the cooling system.  Think
> of it minimally in these terms - to move enough cfm at less than 30
> feet per minute starts to require a duct larger than the rack itself!
> 
> In terms of TCO, at the moment, APC rejects the notion that
> these ultra high density machines are cost effective because they
> are so very difficult to cool.

>From what I learned of bladed systems back when I reviewed them for my
own purposes, this isn't terribly surprising, but it is really valuable
to have a well-researched document that explains how and why. 10 KW
(think 100 100W light bulbs) in what, 2 m^3 -- that's a lot of energy to
get rid of, and almost by definition you're removing it from components
that are packed as tightly as possible.

> It seems to me that at a certain power point the racks are going to
> have to resort to water cooling.  Long ago the ECL mainframes were
> cooled this way, but it's been a long time since most of us have
> seen water pipes running into the computers in a machine room. 
> 
> Cooling a 10 kW rack well looks to be extremely tough with air,
> and going much above that would seem to require something approaching
> a dedicated wind tunnel.  Any opinions on how high the power
> dissipation in racks will go  before the manufacturers throw
> in the air cooling towel and start shipping them with water
> connections?

I think you're within a factor of 2 or so of the SANE threshold at 10KW.
A rack full of 220 W Opterons is there already (~40 1U enclosures).  I'd
"believe" that you could double that with a clever rack design, e.g.
Rackable's, but somewhere in this ballpark...it stops being sane.

> If you were designing a computer room today (which I am) what would
> you allow for the maximum power dissipation per rack _to_be_handled_
> by_the_room_A/C.  The assumption being that in 8 years if somebody
> buys a 40kW (heaven forbid) rack it will dump its heat through
> a separate water cooling system.

This is a tough one.  For a standard rack, ballpark of 10 KW is
accessible today.  For a Rackable rack, I think that they can not quite
double this (but this is strictly from memory -- something like 4 CPUs
per U, but they use a custom power distribution which cuts power and a
specially designed airflow which avoids recycling used cooling air).  I
don't know what bladed racks achieve in power density -- the earlier
blades I looked at had throttled back CPUs but I imagine that they've
cranked them up at this point (and cranked up the heat along with them).

Ya pays your money and ya takes your choice.  An absolute limit of 25
(or even 30) KW/rack seems more than reasonable to me, but then, I'd
"just say no" to rack/serverroom designs that pack more power than I
think can sanely be dissipated in any given volume. Note that I consider
water cooled systems to be insane a priori for all but a small fraction
of server room or cluster operations, "space" generally being cheaper
than the expense associated with achieving the highest possible spatial
density of heat dissipating CPUs.  I mean, why stop at water?  Liquid
Nitrogen.  Liquid Helium.  If money is no option, why not?  OTOH, when
money matters, at some point it (usually) gets to be cheaper to just
build another cluster/server room, right?

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From james.p.lux at jpl.nasa.gov  Sun Feb 13 16:06:06 2005
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Sun, 13 Feb 2005 16:06:06 -0800
Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling?
References: <E1D0RdS-0007hy-00@mendel.bio.caltech.edu>
Message-ID: <001a01c51228$fb48ef20$1af69580@LAPTOP152422>


----- Original Message -----
From: "David Mathog" <mathog at mendel.bio.caltech.edu>
To: <beowulf at beowulf.org>
Sent: Sunday, February 13, 2005 1:50 PM
Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling?


> There are a series of white papers by APC here:
>
>   http://www.apc.com/tools/mytools/index.cfm?action=wp
>
> where they discuss various power and cooling factors.  They note
> a disconnect between the higher densities achieved by blades and
> similar high density racks and the practicality of actually
> cooling these beasts.  Basically it comes down to you save space
> on the rack and then give it all back on the cooling system.  Think
> of it minimally in these terms - to move enough cfm at less than 30
> feet per minute starts to require a duct larger than the rack itself!

I think that's 30 ft/second.. 1800 lfpm would be a reasonable duct speed...
30 lfpm is really really slow (that's 1/2 ft/sec, which is a pretty darn
gentle breeze)

>
> In terms of TCO, at the moment, APC rejects the notion that
> these ultra high density machines are cost effective because they
> are so very difficult to cool.
>
> It seems to me that at a certain power point the racks are going to
> have to resort to water cooling.  Long ago the ECL mainframes were
> cooled this way, but it's been a long time since most of us have
> seen water pipes running into the computers in a machine room.

High power density devices (like power electronics or high power vacuum
tubes) have always resorted to liquid cooling.  It's so much more efficient
than trying to cool with air.  For a variety of reasons, but primarily
because it separates the problem of physical device and radiator surface.
Consider liquid vs air cooled internal combustion engines.  Really high
power density often uses some sort of phase change (ebullient) cooling,
although the design challenges are significant.  Even some laptops have used
liquid or phase change cooling (heat pipes) to move the heat from the CPU to
the case.  An interesting exception to liquid cooling for high power devices
is big generators, which are cooled with hydrogen gas (low viscosity and
density, so low aerodynamic drag)

But liquid cooling, per se, isn't a crippling thing to work with.  And, it
actually allows certain design economies: no more do you have to constrain
the design for air flow, or conduction through the boards, nor do you have
to fool with an array of CPU fans, video card fans, etc.

>
> Cooling a 10 kW rack well looks to be extremely tough with air,
> and going much above that would seem to require something approaching
> a dedicated wind tunnel.  Any opinions on how high the power
> dissipation in racks will go  before the manufacturers throw
> in the air cooling towel and start shipping them with water
> connections?
Consider that 10kW is 5-10 times the power dissipation of a hair dryer.

Other solutions that might turn up are  an internal cooling loop to move
heat from inside to a big heatsink on the surface. Modern rack mounted PCs
aren't particularly designed for efficient thermal transfer with minimal air
flow. (there's no economic incentive for it)

There are economies of scale to a common chiller, though, because when you
get to large HVAC, cold water is what you get, rather than cold air, because
moving cold air is a LOT more expensive than moving cold water.

>
> If you were designing a computer room today (which I am) what would
> you allow for the maximum power dissipation per rack _to_be_handled_
> by_the_room_A/C.  The assumption being that in 8 years if somebody
> buys a 40kW (heaven forbid) rack it will dump its heat through
> a separate water cooling system.

There are such things as individual rack chillers, which you would bolt to a
rack and then hook up to a centralized cold water source.

>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From james.p.lux at jpl.nasa.gov  Sun Feb 13 19:33:50 2005
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Sun, 13 Feb 2005 19:33:50 -0800
Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling?
References: <E1D0RdS-0007hy-00@mendel.bio.caltech.edu>
	<Pine.LNX.4.58.0502131745001.6364@lilith.rgb.private.net>
Message-ID: <000401c51246$00751150$26f49580@LAPTOP152422>

>
> >
> I think you're within a factor of 2 or so of the SANE threshold at 10KW.
> A rack full of 220 W Opterons is there already (~40 1U enclosures).  I'd
> "believe" that you could double that with a clever rack design, e.g.
> Rackable's, but somewhere in this ballpark...it stops being sane.
>
> > If you were designing a computer room today (which I am) what would
> > you allow for the maximum power dissipation per rack _to_be_handled_
> > by_the_room_A/C.  The assumption being that in 8 years if somebody
> > buys a 40kW (heaven forbid) rack it will dump its heat through
> > a separate water cooling system.
>
> This is a tough one.  For a standard rack, ballpark of 10 KW is
> accessible today.  For a Rackable rack, I think that they can not quite
> double this (but this is strictly from memory -- something like 4 CPUs
> per U, but they use a custom power distribution which cuts power and a
> specially designed airflow which avoids recycling used cooling air).  I
> don't know what bladed racks achieve in power density -- the earlier
> blades I looked at had throttled back CPUs but I imagine that they've
> cranked them up at this point (and cranked up the heat along with them).
>
> Ya pays your money and ya takes your choice.  An absolute limit of 25
> (or even 30) KW/rack seems more than reasonable to me, but then, I'd
> "just say no" to rack/serverroom designs that pack more power than I
> think can sanely be dissipated in any given volume. Note that I consider
> water cooled systems to be insane a priori for all but a small fraction
> of server room or cluster operations, "space" generally being cheaper
> than the expense associated with achieving the highest possible spatial
> density of heat dissipating CPUs.  I mean, why stop at water?  Liquid
> Nitrogen.  Liquid Helium.  If money is no option, why not?  OTOH, when
> money matters, at some point it (usually) gets to be cheaper to just
> build another cluster/server room, right?

The speed of light starts to set another limit for the physical size, if you
want real speed.  There's a reason why the old Crays are compact and liquid
cooled.  It's that several nanoseconds per foot propagation delay.  Once you
get past a certain threshold, you're actually better off going to very dense
form factors and liquid cooling, in many areas.  I think that most clusters
haven't reached the performance point where it's worth liquid cooling the
processors, but it's probably pretty close to the threshold. Adding machine
room space is expensive for other reasons.  You've already got to have the
water chillers for any sort of major sized cluster (to cool the air), so the
incremental cost to providing an appropriate interface to the racks and
starting to build racks in liquid cooled configurations can't be far away.

Liquid cooling is MUCH more efficient than air cooling: better heat
transfer, better life (more even temperatures), less real estate required,
etc.  The hangup now is that nobody makes liquid cooled PCs as a commodity,
mass production item.  What you'll find is liquid cooling retrofits that
don't take advantage of what liquid cooling can get you. If you look at high
performance radar or sonar processors and such that use liquid cooling, the
layout and physical configuration is MUCH different (partly driven by the
fact that the viscosity of liquid is higher than air).

Wouldn't YOU like to have, say, 1000 processors in one rack, with a  2-3"
flexible pipe to somewhere else?  Especially if it was perfectly quiet? And
could sit next to your desk?  (1000 processors*100W each is 100kW).


From rene at renestorm.de  Sat Feb 12 20:29:58 2005
From: rene at renestorm.de (rene)
Date: Sun, 13 Feb 2005 05:29:58 +0100
Subject: [Beowulf] Block send mpi
Message-ID: <200502130529.58915.rene@renestorm.de>

Hi folks,

i know, this isn't a mpi forum, even so allow me a question about block 
sending.

i got some(times) nice SIGSEGVs with that code (C++ implementation).
Did I code something totally wrong?
I really don't understand this function.
// int MPI_Buffer_attach( void *buffer, int size )

 int packsize;
 MPI_Pack_size (bit, MPI_INT, newcomm, &packsize);
 int bufsize = packsize + (MPI_BSEND_OVERHEAD);
 void *buf = new (void (*[packsize]) ());
 MPI_Buffer_attach (buf, bufsize);
 ierr =MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0,  newcomm);
 MPI_Buffer_detach (&buf, &bufsize);

Thanks,
-- 
Rene Storm
@Cluster


From maurice at harddata.com  Sun Feb 13 11:30:43 2005
From: maurice at harddata.com (Maurice Hilarius)
Date: Sun, 13 Feb 2005 12:30:43 -0700
Subject: [Beowulf] cooling question:  cfm per rack?
In-Reply-To: <200502122000.j1CK096j019160@bluewest.scyld.com>
References: <200502122000.j1CK096j019160@bluewest.scyld.com>
Message-ID: <420FAAE3.9070108@harddata.com>

Dean Johnson <dtj at uberh4x0r.org> wrote:

> On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote:
>
>>> 
>>> Relative airflow can probably be measured with a kid's toy -- one of the
>>> little pinwheels -- and counting revolutions with a stopwatch.
>>> Normalizing that to absolute airflow in CFM is a bit tricky (since the
>>> result depends to some extent on the resistance imposed by the measuring
>>> apparatus) but somebody out there may have designed a version of this
>>> with a real fan and magnets set so that the counting is done
>>> electronically.  In fact, I could build something to do this out of OTC
>>> parts if I had any way to normalize the count.
>>> 
>
>
>Could you not use one of those cheapish wind speed devices that amateur
>weather folks use? That would give you a rating, presumably in miles per
>hour, and then figure backward based upon the area of the little fan
>thingy. That would likely be not too expensive and a great deal easier,
>and more accurate, to deal with than counting a pinwheel.  ;-) 
>
>	-Dean

One can also go to an auto wreckers, and from ,any newer models of cars get a Mass Air Flow sensor (MAF) from teh throttle body.
Modern cars use these, in  conjunction with an O2 sensor on the exhasut, to manage fuel injection.
The MAF returns a variable DC voltage, usually in the range of 0 to 5V (depending on air speed).
Make a tube, mount the MAF with the probe end in the tube, attach to back of device being measured.
Supply 12V DC,Connect to output for measurement.
Obviosly this would have to be calibrated.
It is cheap, and very accurate and very relaible..

If you want to make it more useful , a lot of modern cars also use a barometric pressure sensor, and the calucs can be done using bioth outputs. This helps a lot as things like current weather conditions and altitude have a large bearing on air pressure.
Measuring flow by speed only, and ignoring pressure is a fairly inaccurate method.

Lastly, one can measure the humidity, as this also has a pretty large influence on the cooling capacity of the air being moved.

For around $25 one can cannibalize the parts and cabling from a modern car wreck.
All that is left is to provide a DC 12V source, a computer with a 4 channel A/D chip on a proto board, and some calibration.
The calibration will be the toughest challenge as you will need accurate precalibrated instruments for a test session, but at least this is one time, and may be borrowed..


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050213/f138ce20/attachment.html>

From landman at scalableinformatics.com  Sun Feb 13 21:42:21 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Mon, 14 Feb 2005 00:42:21 -0500
Subject: [Beowulf] Block send mpi
In-Reply-To: <200502130529.58915.rene@renestorm.de>
References: <200502130529.58915.rene@renestorm.de>
Message-ID: <42103A3D.8020605@scalableinformatics.com>

Rene:

   More data.  Where exactly does it SEGV?  At the void *buf line? at 
the Pack? or the Bsend?  Did you compile with -g?   Do you have a core dump?

Joe

rene wrote:
> Hi folks,
> 
> i know, this isn't a mpi forum, even so allow me a question about block 
> sending.
> 
> i got some(times) nice SIGSEGVs with that code (C++ implementation).
> Did I code something totally wrong?
> I really don't understand this function.
> // int MPI_Buffer_attach( void *buffer, int size )
> 
>  int packsize;
>  MPI_Pack_size (bit, MPI_INT, newcomm, &packsize);
>  int bufsize = packsize + (MPI_BSEND_OVERHEAD);
>  void *buf = new (void (*[packsize]) ());
>  MPI_Buffer_attach (buf, bufsize);
>  ierr =MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0,  newcomm);
>  MPI_Buffer_detach (&buf, &bufsize);
> 
> Thanks,

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615


From james.p.lux at jpl.nasa.gov  Sun Feb 13 22:31:26 2005
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Sun, 13 Feb 2005 22:31:26 -0800
Subject: [Beowulf] cooling question:  cfm per rack?
References: <200502122000.j1CK096j019160@bluewest.scyld.com>
	<420FAAE3.9070108@harddata.com>
Message-ID: <002301c5125e$ede1ef40$32a8a8c0@LAPTOP152422>


----- Original Message -----
From: Maurice Hilarius

Dean Johnson mailto:<dtj at uberh4x0r.org wrote:
On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote:
>
> Relative airflow can probably be measured with a kid's toy -- one of the
> little pinwheels -- and counting revolutions with a stopwatch.
> Normalizing that to absolute airflow in CFM is a bit tricky (since the
> result depends to some extent on the resistance imposed by the measuring
> apparatus) but somebody out there may have designed a version of this
> with a real fan and magnets set so that the counting is done
> electronically.  In fact, I could build something to do this out of OTC
> parts if I had any way to normalize the count.
>

Could you not use one of those cheapish wind speed devices that amateur
weather folks use? That would give you a rating, presumably in miles per
hour, and then figure backward based upon the area of the little fan
thingy. That would likely be not too expensive and a great deal easier,
and more accurate, to deal with than counting a pinwheel.  ;-)

 -DeanOne can also go to an auto wreckers, and from ,any newer models of
cars get a Mass Air Flow sensor (MAF) from teh throttle body.
Modern cars use these, in  conjunction with an O2 sensor on the exhasut, to
manage fuel injection.
The MAF returns a variable DC voltage, usually in the range of 0 to 5V
(depending on air speed).
Make a tube, mount the MAF with the probe end in the tube, attach to back of
device being measured.
Supply 12V DC,Connect to output for measurement.
Obviosly this would have to be calibrated.
It is cheap, and very accurate and very relaible..

If you want to make it more useful , a lot of modern cars also use a
barometric pressure sensor, and the calucs can be done using bioth outputs.
This helps a lot as things like current weather conditions and altitude have
a large bearing on air pressure.
Measuring flow by speed only, and ignoring pressure is a fairly inaccurate
method.

Lastly, one can measure the humidity, as this also has a pretty large
influence on the cooling capacity of the air being moved.

For around $25 one can cannibalize the parts and cabling from a modern car
wreck.
All that is left is to provide a DC 12V source, a computer with a 4 channel
A/D chip on a proto board, and some calibration.
The calibration will be the toughest challenge as you will need accurate
precalibrated instruments for a test session, but at least this is one time,
and may be borrowed..

----

The problem with automotive mass air flow sensors is sensitivity at low
flows.  Consider, for a moment, a 1.8 liter engine turning over at 1800 rpm
(call it 30 rev/sec..) That's 1.8*15 liters/sec of air (27 liters/sec),
being drawn through a tube some 5-10 cm in diameter
(call it 60 cm2).. that's 450 cm/sec or 4.5 m/sec... 885 linear ft/minute a
fairly fast airflow in HVAC terms.... And that's the bottom of the range for
the automotive sensor.


From rgb at phy.duke.edu  Mon Feb 14 03:18:27 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 14 Feb 2005 06:18:27 -0500 (EST)
Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling?
In-Reply-To: <000401c51246$00751150$26f49580@LAPTOP152422>
References: <E1D0RdS-0007hy-00@mendel.bio.caltech.edu>
	<Pine.LNX.4.58.0502131745001.6364@lilith.rgb.private.net>
	<000401c51246$00751150$26f49580@LAPTOP152422>
Message-ID: <Pine.LNX.4.58.0502140501400.6364@lilith.rgb.private.net>

On Sun, 13 Feb 2005, Jim Lux wrote:

> > I think you're within a factor of 2 or so of the SANE threshold at 10KW.
> > A rack full of 220 W Opterons is there already (~40 1U enclosures).  I'd
> > "believe" that you could double that with a clever rack design, e.g.
> > Rackable's, but somewhere in this ballpark...it stops being sane.
> >
> > > If you were designing a computer room today (which I am) what would
> > > you allow for the maximum power dissipation per rack _to_be_handled_
> > > by_the_room_A/C.  The assumption being that in 8 years if somebody
> > > buys a 40kW (heaven forbid) rack it will dump its heat through
> > > a separate water cooling system.
> >
> > This is a tough one.  For a standard rack, ballpark of 10 KW is
> > accessible today.  For a Rackable rack, I think that they can not quite
> > double this (but this is strictly from memory -- something like 4 CPUs
> > per U, but they use a custom power distribution which cuts power and a
> > specially designed airflow which avoids recycling used cooling air).  I
> > don't know what bladed racks achieve in power density -- the earlier
> > blades I looked at had throttled back CPUs but I imagine that they've
> > cranked them up at this point (and cranked up the heat along with them).
> >
> > Ya pays your money and ya takes your choice.  An absolute limit of 25
> > (or even 30) KW/rack seems more than reasonable to me, but then, I'd
> > "just say no" to rack/serverroom designs that pack more power than I
> > think can sanely be dissipated in any given volume. Note that I consider
> > water cooled systems to be insane a priori for all but a small fraction
> > of server room or cluster operations, "space" generally being cheaper
> > than the expense associated with achieving the highest possible spatial
> > density of heat dissipating CPUs.  I mean, why stop at water?  Liquid
> > Nitrogen.  Liquid Helium.  If money is no option, why not?  OTOH, when
> > money matters, at some point it (usually) gets to be cheaper to just

Keyword:                             ^^^^^^^

> > build another cluster/server room, right?

Sure, I agree with everything below, for bleeding edge work.  Or if
you're building a cluster in your Manhattan office, where for whatever
reason you have to work with a space the size of a broom closet (but
where you miraculously have access to a stream of chilled water, or
liquid nitrogen, or liquid helium).

This just (IMO) pushes you over some sort of magic threshold that (while
arbitrary and existing perhaps only in my fevered imagination) separates
"COTS clusters" from a "big iron supercomputer".  I have a hard time
seeing liquid cooled clusters as being a beowulf in the sense I have
grown to know and love.  COTS clusters have always been about being ABLE
to DIY, and while I can (if my life depends on it) do plumbing, it just
seems like there would be some highly nonlinear cost and hassle
thresholds in there.  

Also, I just cannot see COTS systems being built with copper pipes and
coupling valves where you hook them into your household or office
chilled water supply at your desk.  I suspect that COTS desktops and
even server mobos will continue to be engineered to be air cooled in the
forseeable future.

Now your observation that racks themselves may start coming with a pair
of copper pipes and couplings for a built-in blower and heat exchanger
-- so the rack itself is in some sense "liquid cooled", while the actual
nodes within are still COTS mobos cooled by air -- I don't know what the
cost and volume trade-offs are of this solution. Cooling the air in the
rack bases (more likely at the top of each rack and ducting the cold air
down to the base) vs cooling the air in a big liebert and piping the
cool air around to the bases in a raised floor -- hmmm.

One thing to remember (that I think was brought up one of the last times
this issue was raised on list -- I know from bitter experience that
water couplings are a PITA to reliably get, and keep, tight under
pressure.  When they leak ("when" because of Murphy), they're going to
make God's Own mess and potentially ruin many tens of thousands of
dollars worth of hardware.  Heat exchangers at the tops of racks also
increase the probability that humidity will be a problem -- I also know
from bitter experience that overhead cold air ducting has a tendency to
sweat unless carefully insulated, and the sweat in a humid climate like
NC will inevitably drip into whatever is below.  Heat exchangers at the
bottom make it harder to move the warm air exhausted at the rack tops
back to the bottom for recooling as you're working against an air
pressure/density convective flow differential and not with it.

Finally, there are likely to be Human Resources and state regulatory
issues with liquid cooled electronics -- systems and network engineers
somehow are viewed as being competent to manage end-stage electronics
from the plug point on even by the unions in all but the most rabid of
union shops (although I have heard of places where you have to call a
union employee in to do any major plugging or unplugging of certain
kinds o hardware).  That simply won't be the case with liquid cooled
hardware.  I may be able to work on my household plumbing (and wiring),
but if I set my hand to plumbing at Duke the HR Gods and the State would
get Angry, and if anything wet wrong (like a leak causing a short and a
fire) I would be Held Liable. This adds another project-staffing human
notch to the TCO -- likely a fairly significant one as the heat
exchanger/blowers in EACH rack might well need servicing and inspection
1-2x a year (as the room unit does now).

None of these things are insurmountable difficulties, and as you note
there are certain big, expensive pieces of hot hardware (big lasers,
giant magnets, automobile engines) that one DOES plug right into a
chilled water loop.  With the exception of car engines they tend to be
components with 6-8 figure price tags, though, where tacking on a full
or part time FTE for managing the plumbing etc is a small fraction of
the total marginal cost of operation.  I'd expect this to make sense
only for clusters in this same category -- really large, already
expensive clusters shooting for bleeding edge performance (top 10 of top
500) at very high density someplace where a) physical space is very
"expensive" (justifying the trade off economically); or b) speed of
light and/or interconnect lengths are indeed an issue.

Note that the fixing the latter will likely rely as much on moving out
of the COTS arena for the cluster interconnect as it does on cooling
alone.  High end cluster interconnects are again almost by definition
engineered on the assumption of air-cooled node densities and internode
latencies that are specified by worst-case assumptions and protocol, not
speed of light in the sense that interconnect length is an important
parameter in the overall latency.  As in 1 usec is pretty good latency
for a modern interconnect IIRC, and a light-usecond is 3x10^8 x 10^-6 =
300 meters.  I'd guess that very little of the internode latency over
fiber is due to speed of light delays per se and nearly all of it is in
the interconnects themselves, the switches, and the node bus interface.

> The speed of light starts to set another limit for the physical size, if you
> want real speed.  There's a reason why the old Crays are compact and liquid
> cooled.  It's that several nanoseconds per foot propagation delay.  Once you

There's also a reason why old Crays are currently used primarily as
lobby art, whereever they haven't been disassembled and bathed in
mercury to recover all that gold.  Several reasons, actually, but liquid
cooling and the hassle and expense it entailed are a big one.  Many a
Cray was finally decommissioned when one could build and operate a true
COTS cluster with as much or more raw horsepower for what it cost for
just the infrastructure support for the Cray it supplanted.  Like it or
not, Moore's Law biases cost-benefit solutions heavily towards the COTS
and disposable, and wet-cooling requires a significant and sustained
investment in a particular technology that is likely to remain
non-mainstream, human-resource intensive, and hence nonlinearly costly
in a TCO CBA.  One needs significant benefit in order to make it
worthwhile.

> get past a certain threshold, you're actually better off going to very dense
> form factors and liquid cooling, in many areas.  I think that most clusters
> haven't reached the performance point where it's worth liquid cooling the
> processors, but it's probably pretty close to the threshold. Adding machine
> room space is expensive for other reasons.  You've already got to have the
> water chillers for any sort of major sized cluster (to cool the air), so the
> incremental cost to providing an appropriate interface to the racks and
> starting to build racks in liquid cooled configurations can't be far away.
> 
> Liquid cooling is MUCH more efficient than air cooling: better heat
> transfer, better life (more even temperatures), less real estate required,
> etc.  The hangup now is that nobody makes liquid cooled PCs as a commodity,
> mass production item.  What you'll find is liquid cooling retrofits that
> don't take advantage of what liquid cooling can get you. If you look at high
> performance radar or sonar processors and such that use liquid cooling, the
> layout and physical configuration is MUCH different (partly driven by the
> fact that the viscosity of liquid is higher than air).
> 
> Wouldn't YOU like to have, say, 1000 processors in one rack, with a  2-3"
> flexible pipe to somewhere else?  Especially if it was perfectly quiet? And
> could sit next to your desk?  (1000 processors*100W each is 100kW).

If somebody else paid for and fed the whole thing, you could multiply
the capacity by an order of magnitude and use liquid nitrogen for
cooling instead of water and I'd simply love it.  And as Austin Powers
might add, I'd like a gold-plated potty as well -- but I'm not going to
get it...;-)

Alas, in the real world it isn't about what I'd "like", it is about what
I can afford, about what I can convince a grant agency to pay for.  High
infrastructure costs come out of node count, and node count matters --
in many projects, it is the PRIMARY thing that matters.  High density
increases infrastructure costs, often nonlinearly, and hence decreases
node count at any fixed budget.  In order to for liquid cooling to ever
make sense for COTS clusters, it would have to BECOME COTS -- basically,
to become cheap in both hardware and human terms.  Might happen, might
happen, but I'm not holding my breath...

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From john.hearns at streamline-computing.com  Mon Feb 14 03:32:33 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Mon, 14 Feb 2005 11:32:33 +0000
Subject: [Beowulf] Home Beowulf Intial Startup Question
In-Reply-To: <Pine.LNX.4.44.0502131521220.16537-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0502131521220.16537-100000@coffee.psychology.mcmaster.ca>
Message-ID: <1108380753.5708.0.camel@localhost.localdomain>

On Sun, 2005-02-13 at 15:41 -0500, Mark Hahn wrote:
> > netowrk technologies i am building a 20 node dell optiplex 1.9 ghz 256 ram 

Have a look at the new OReilly book 'High Performance Linux Cluster
with Rocks, Oscar and Mosix'. Should be of help to you.
I'm doing a review for the UKUUG newsletter.


From ashley at quadrics.com  Mon Feb 14 08:23:02 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Mon, 14 Feb 2005 16:23:02 +0000
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
Message-ID: <1108398183.8243.54.camel@localhost.localdomain>

On Fri, 2005-02-11 at 20:47 -0600, Rob Ross wrote:
> I agree that people using MPI_Isend() and related non-blocking operations 
> are sometimes doing so because they would like to perform some 
> computation while the communication progresses.  People also use these 
> calls to initiate a collection of point-to-point operations before 
> waiting, so that multiple communications may proceed in parallel.  The 
> implementation has no way of really knowing which of these is the case.

Either of these reasons for using non-blocking sends is valid and both
will benefit from low CPU use in the Send call.  Why would the
implementation want to know the reason for using non-blocking sends?

> You should understand that the way MPI implementations are measured is by 
> their performance, not CPU utilization, so there is pressure to push the 
> former as much as possible at the expense of the latter.

It's relatively difficult to measure the CPU overhead of calls, some
benchmarks work out the "issue rate" of sends (operations/second) and
some measure how much compute (spinning) can be achieved before having a
measurable effect on the latency.  Both these are valid however the
results are harder for the non-technical person to comprehend.  Headline
latency/bandwidth are just that, Headline figures that don't tell the
whole story.

Ashley,


From rross at mcs.anl.gov  Mon Feb 14 09:04:17 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Mon, 14 Feb 2005 11:04:17 -0600 (CST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <web-520595@free.net>
References: <web-520595@free.net>
Message-ID: <Pine.LNX.4.58.0502141057320.1141@terra.mcs.anl.gov>

Hi Mikhail,

I don't know all the implementations well enough to comment on them 
one-by-one.  I'm sure that Rossen can talk about their implementation with 
regards to (a) below, and others will fill in other gaps.

In general, to support (a) the implementation must either spawn a thread 
or have support from the NIC to make progress (this is related to the 
"Progress Rule" that people occasionally bring up).  The standard *does 
not* specify that progress must be made when not in an MPI_ call.  
MPICH/MPICH2 do not use an extra thread (for portability one cannot assume 
that threads are available!).  Thus the only overlap that occurs in MPICH2 
over TCP is through the socket buffers.

Making a sequence of MPI_Isends followed by a MPI_Wait go faster than a 
sequence of MPI_Sends isn't hard, particularly if the messages are to 
different ranks.  I would guess that every implementation will provide 
better performance in the case where the user tells the implementation 
about all these concurrent operations and then MPI_Waits on the bunch.

Hope this helps some,

Rob
---
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab


On Mon, 14 Feb 2005, Mikhail Kuzminsky wrote:

> Let me ask some stupid's question: which MPI implementations allow
> really
>   
> a) to overlap MPI_Isend w/computations
> and/or 
> b) to perform a set of subsequent MPI_Isend calls faster than "the 
> same" set of MPI_Send calls ?
> 
> I say only about sending of large messages.
> 
> I'm interesting (1st of all) in
> - Gigabit Ethernet w/LAM MPI or MPICH
> - Infiniband (Mellanox equipment) w/NCSA MPI or OSU MPI
> 
> Yours
> Mikhail Kuzminsky
> Zelinsky Institute of Organic Chemistry
> Moscow


From rross at mcs.anl.gov  Mon Feb 14 09:11:31 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Mon, 14 Feb 2005 11:11:31 -0600 (CST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <1108398183.8243.54.camel@localhost.localdomain>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> 
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain>
Message-ID: <Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>

On Mon, 14 Feb 2005, Ashley Pittman wrote:

> On Fri, 2005-02-11 at 20:47 -0600, Rob Ross wrote:
> > I agree that people using MPI_Isend() and related non-blocking operations 
> > are sometimes doing so because they would like to perform some 
> > computation while the communication progresses.  People also use these 
> > calls to initiate a collection of point-to-point operations before 
> > waiting, so that multiple communications may proceed in parallel.  The 
> > implementation has no way of really knowing which of these is the case.
> 
> Either of these reasons for using non-blocking sends is valid and both
> will benefit from low CPU use in the Send call.  Why would the
> implementation want to know the reason for using non-blocking sends?

If you used the non-blocking send to allow for overlapped communication, 
then you would like the implementation to play nicely.  In this case the 
user will compute and eventually call MPI_Test or MPI_Wait (or a flavor 
thereof).

If you used the non-blocking sends to post a bunch of communications that
you are going to then wait to complete, you probably don't care about the
CPU -- you just want the messaging done.  In this case the user will call 
MPI_Wait after posting everything it wants done.

One way the implementation *could* behave is to assume the user is trying
to overlap comm. and comp. until it sees an MPI_Wait, at which point it
could go into this theoretical "burn CPU to make things go faster" mode.  
That mode could, for example, tweak the interrupt coalescing on an 
ethernet NIC to process packets more quickly (I don't know off the top of 
my head if that would work or not; it's just an example).

All of this is moot of course unless the implementation actually has more
than one algorithm that it could employ...

Rob


From James.P.Lux at jpl.nasa.gov  Mon Feb 14 09:17:00 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Mon, 14 Feb 2005 09:17:00 -0800
Subject: [Beowulf] some thoughts on thermal design, liquid cooling, etc.
Message-ID: <6.1.1.1.2.20050214090712.0416fd68@mail.jpl.nasa.gov>

It occurs to me that the real limiting factor in producing "cluster 
oriented thermal design" is the volume of sales.

Say you want to design a custom motherboard/package for use in 
clusters.  This is, at a guess, probably a 3-5 million dollar project 
(maybe down around a million if it's real close to an existing design).

Say the cost of a node is around a kilobuck or 2 (in plain, non-custom, 
commodity trim).

If you had a cluster with 1000 of those custom mobos, you're looking at 
adding $3K/node to the cluster.  That's a bit punitive...  You could buy a 
lot of machine room and cooling for that $3 mil.

Now, on the other hand, if you had 100 people willing to each buy a cluster 
of this scale, then it's only adding $30-50/node, which is a lot more 
reasonable.

Compare this to the consumer motherboard market (which, after all, is what 
we are really using here...) A production run of several million mobos 
isn't all that huge, so a Dell or HP can and do create customized 
motherboard designs to meet some peculiar requirement (on-board 
peripherals, etc.).  Such customization only adds a buck to the mobo cost, 
and presumably, that buck is made up in cheaper packaging, shorter cables, 
one less manufacturing step, or somewhere.

Somehow, I doubt that the total sales of ALL motherboards for clusters, of 
a given instance of motherboard design, exceeds a million units.  Cluster 
buyers tend to want different processors, different peripherals, etc., and 
each configuration change would drive a whole new design cycle.


There is hope on the horizon.  The increasing drive to "media computers" is 
creating a demand for PCs that have high performance, but are quiet and 
have good cooling.  I have a Motorola Moxi BMC9012 "set top box" at home 
from the cable company, and it is basically a Linux computer with an 80GB 
drive and a some custom video hardware.  It's also hideously noisy (for 
something designed to sit in your living room) and dissipates >100W (all 
the time.. there's no on-off button).  There WILL be consumer pressure to 
make it silent and to do better thermal management.


James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From steve_heaton at ozemail.com.au  Sun Feb 13 21:52:27 2005
From: steve_heaton at ozemail.com.au (steve_heaton at ozemail.com.au)
Date: Mon, 14 Feb 2005 16:52:27 +1100
Subject: [Beowulf] A home cluster of mobos
Message-ID: <20050214055227.YEMC24369.swebmail02.mail.ozemail.net@localhost>

Dear collective of great minds

I'd like to humbly introduce my little Beowulf "BORG" (Boring and Old but Real Grunt).

http://members.ozemail.com.au/~sheaton/lss/

-> Computing

The next performance consideration will be to start and work over TCP. Maybe a jump into GAMMA for a quick squizz? We'll see how it goes.

Cheers
Stevo

This message was sent through MyMail http://www.mymail.com.au


From rene at renestorm.de  Mon Feb 14 03:04:45 2005
From: rene at renestorm.de (rene)
Date: Mon, 14 Feb 2005 12:04:45 +0100
Subject: [Beowulf] Block send mpi
In-Reply-To: <42103A3D.8020605@scalableinformatics.com>
References: <200502130529.58915.rene@renestorm.de>
	<42103A3D.8020605@scalableinformatics.com>
Message-ID: <200502141204.45263.rene@renestorm.de>

Hi Joe,

here is some output and changes which solves the problem.
I don't know, why I created a void buffer and sended an int array.
After creating an int buffer I was also able to delete it ;o)


Tnx anyway
Rene

int packsize;
MPI_Pack_size (bit, MPI_INT, newcomm, &packsize);
int bufsize = packsize + (MPI_BSEND_OVERHEAD);
// 		  void *buf = new (void (*[packsize]) ());
int *buf = new (int ([packsize]));

for (int az = 0; az < repeat + 1; az++)
{
  MPI_Buffer_attach (buf, bufsize);
  for (int node = 1; node < rankcount; node++)
    {
      bsend->ierr = MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0,  newcomm);
    }
  MPI_Buffer_detach (&buf, &bufsize);
}
delete buf;


output for the old code: 
Program received signal SIGSEGV, Segmentation fault.
0:  0x40ad3860 in malloc_consolidate () from /lib/libc.so.6
0:  (gdb) kill
 rank 1 in job 4  xtrem_32898   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
rank 0 in job 4  xtrem_32898   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9
1:  aborting job:
1:  Fatal error in MPI_Recv: Other MPI error, error stack:
1:  MPI_Recv(207): MPI_Recv(buf=0x8186388, count=32, MPI_INT, src=0, tag=0, 
comm=0x84000002, status=0xbfffee30) failed
1:  MPIDI_CH3_Progress_wait(207): an error occurred while handling an event 
returned by MPIDU_Sock_Wait()
1:  MPIDI_CH3I_Progress_handle_sock_event(492):
1:  connection_recv_fail(1728):
1:  MPIDU_Socki_handle_read(590): connection closed by peer (set=0,sock=1)


Am Montag 14 Februar 2005 06:42 schrieb Joe Landman:
> Rene:
>
>    More data.  Where exactly does it SEGV?  At the void *buf line? at
> the Pack? or the Bsend?  Did you compile with -g?   Do you have a core
> dump?
>
> Joe
>
> rene wrote:
> > Hi folks,
> >
> > i know, this isn't a mpi forum, even so allow me a question about block
> > sending.
> >
> > i got some(times) nice SIGSEGVs with that code (C++ implementation).
> > Did I code something totally wrong?
> > I really don't understand this function.
> > // int MPI_Buffer_attach( void *buffer, int size )
> >
> >  int packsize;
> >  MPI_Pack_size (bit, MPI_INT, newcomm, &packsize);
> >  int bufsize = packsize + (MPI_BSEND_OVERHEAD);
> >  void *buf = new (void (*[packsize]) ());
> >  MPI_Buffer_attach (buf, bufsize);
> >  ierr =MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0,  newcomm);
> >  MPI_Buffer_detach (&buf, &bufsize);
> >
> > Thanks,

-- 
Rene Storm
@Cluster

Linux Cluster Consultant
Hamburgerstr. 42e
D-22952 Luetjensee
mailto:Rene at ReneStorm.de
Voice-IP: Skype.com, Rene_Storm


From kus at free.net  Mon Feb 14 07:47:15 2005
From: kus at free.net (Mikhail Kuzminsky)
Date: Mon, 14 Feb 2005 18:47:15 +0300
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
Message-ID: <web-520595@free.net>

In message from Rob Ross <rross at mcs.anl.gov> (Fri, 11 Feb 2005 
20:47:22 -0600 (CST)):
>Hi Isaac,
>On Fri, 11 Feb 2005, Isaac Dooley wrote:
>> >>Using MPI_ISend() allows programs to not waste CPU cycles waiting 
>>on the
>> >>completion of a message transaction.
>> >No, it allows the programmer to express that it wants to send a 
>>message 
>> >but not wait for it to complete right now.  The API doesn't specify 
>>the 
>> >semantics of CPU utilization.  It cannot, because the API doesn't 
>>have 
>> >knowledge of the hardware that will be used in the implementation.
>> That is partially true.  The context for my comment was under your 
>> assumption that everyone uses MPI_Send(). These people, as I stated 
>> before, do not care about what the CPU does during their blocking 
>>calls.
>I think that it is completely true.  I made no assumption about 
>everyone 
>using MPI_Send(); I'm a late-comer to the conversation. 
>I was not trying to say anything about what people making the calls 
>care
>about; I was trying to clarify what the standard does and does not 
>say.  
>However, I agree with you that it is unlikely that someone calling
>MPI_Send() is too worried about what the CPU utilization is during 
>the
>call.
>> I was trying to point out that programs utilizing non-blocking IO 
>>may 
>> have work that will be adversely impacted by CPU utilization for 
>> messaging. These are the people who care about CPU utilization for 
>> messaging. This I hopes answers your prior question, at least 
>>partially.
>I agree that people using MPI_Isend() and related non-blocking 
>operations 
>are sometimes doing so because they would like to perform some 
>computation while the communication progresses.  People also use 
>these 
>calls to initiate a collection of point-to-point operations before 
>waiting, so that multiple communications may proceed in parallel.

Let me ask some stupid's question: which MPI implementations allow
really
  
a) to overlap MPI_Isend w/computations
and/or 
b) to perform a set of subsequent MPI_Isend calls faster than "the 
same" set of MPI_Send calls ?

I say only about sending of large messages.

I'm interesting (1st of all) in
- Gigabit Ethernet w/LAM MPI or MPICH
- Infiniband (Mellanox equipment) w/NCSA MPI or OSU MPI

Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow
  
  
> The 
>implementation has no way of really knowing which of these is the 
>case.
>
>Greg just pointed out that for small messages most implementations 
>will do
>the exact same thing as in the MPI_Send() case anyway.  For large 
>messages
>I suppose that something different could be done.  In our 
>implementation
>(MPICH2), to my knowledge we do not differentiate.
>
>You should understand that the way MPI implementations are measured 
>is by 
>their performance, not CPU utilization, so there is pressure to push 
>the 
>former as much as possible at the expense of the latter.
>
>> Perhaps your applications demand low latency with no concern for the 
>>CPU 
>> during the time spent blocking. That is fine. But some applications 
>> benefit from overlapping computation and communication, and the 
>>cycles 
>> not wasted by the CPU on communication can be used productively.
>
>I wouldn't categorize the cycles spent on communication as "wasted"; 
>it's 
>not like we code in extraneous math just to keep the CPU pegged :).
>
>Regards,
>
>Rob
>---
>Rob Ross, Mathematics and Computer Science Division, Argonne National 
>Lab
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at mendel.bio.caltech.edu  Mon Feb 14 09:24:29 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Mon, 14 Feb 2005 09:24:29 -0800
Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling?
Message-ID: <E1D0jxV-0001jO-00@mendel.bio.caltech.edu>

Robert G. Brown wrote:

> In order to for liquid cooling to ever
> make sense for COTS clusters, it would have to BECOME COTS -- basically,
> to become cheap in both hardware and human terms.

Shuttle's itty bitty computers have a heat pipe that goes out to
a radiator on the back of the case.  It isn't much of a step from
there to replacing the back radiator with a copper block.  That block
could in turn mate with another copper block which itself was
on a cold water line.  Ie, move the radiator even further from
the CPU and other heat generating parts of the computer.  So
a company like shuttle could relatively easily start selling
liquid cooled nodes using only minor modifications to its existing
hardware.

In this sort of a system you might have to pay to have the pros 
install (plumb) the rack itself, but you could still work on the
nodes of the rack, as is true now.  It would seem to be relatively
straightforward to have the nodes mate up copper block to copper
block when fully inserted, so that each node is not itself part
of the rack circulation system. The tricky part is
that something else would have to be attached to the copper block
on the back when the node was serviced on the bench.

On the plus side your racks could replace the building's current
hot water supply!

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From ashley at quadrics.com  Mon Feb 14 09:42:42 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Mon, 14 Feb 2005 17:42:42 +0000
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain>
	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
Message-ID: <1108402962.8265.25.camel@localhost.localdomain>

On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote:
> If you used the non-blocking send to allow for overlapped communication, 
> then you would like the implementation to play nicely.  In this case the 
> user will compute and eventually call MPI_Test or MPI_Wait (or a flavor 
> thereof).
>
> If you used the non-blocking sends to post a bunch of communications that
> you are going to then wait to complete, you probably don't care about the
> CPU -- you just want the messaging done.  In this case the user will call 
> MPI_Wait after posting everything it wants done.
>
> One way the implementation *could* behave is to assume the user is trying
> to overlap comm. and comp. until it sees an MPI_Wait, at which point it
> could go into this theoretical "burn CPU to make things go faster" mode.  
> That mode could, for example, tweak the interrupt coalescing on an 
> ethernet NIC to process packets more quickly (I don't know off the top of 
> my head if that would work or not; it's just an example).

Maybe if you were using a channel interface (sockets) and all messages
were to the same remote process then it might make sense to coalesce all
the sends into a single transaction and just send this in the MPI_Wait
call.  The latency for a bigger network transaction *might* be lower
than the sum of the issue rates for smaller ones.

I'd hope that a well written application would bunch all it's sends into
a single larger block when possible though if this optimisation was
possible though.

Given any reasonably fast network not doing anything until the MPI_Wait
call however would destroy your latency.  It strikes me as this isn't
overlapping comms and compute though rather artificially delaying comms
to allow compute to finish, seems rather pointless?

If you had a bunch of sends to do to N remote processes then I'd expect
you to post them in order (non-blocking) and wait for them all at the
end, the time taken to do this should be (base_latency + ( (N-1) * M ))
where M is the recpipiocal of the "issue rate".  You can clearly see
here that even for small number of batched sends (even a 2d/3d nearest
neighbour matrix) the issue rate (that is how little CPU the send call
consumes) is at least as important that the raw latency.

> All of this is moot of course unless the implementation actually has more
> than one algorithm that it could employ...

In my experience there are often dozens of different algorithms for
every situation and each has their trade offs.  Choosing the right one
based on the parameters given is the tricky bit.

Ashley,


From rgb at phy.duke.edu  Mon Feb 14 10:12:09 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 14 Feb 2005 13:12:09 -0500 (EST)
Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling?
In-Reply-To: <E1D0jxV-0001jO-00@mendel.bio.caltech.edu>
References: <E1D0jxV-0001jO-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.58.0502141240050.20983@ganesh.phy.duke.edu>

On Mon, 14 Feb 2005, David Mathog wrote:

> Robert G. Brown wrote:
> 
> > In order to for liquid cooling to ever
> > make sense for COTS clusters, it would have to BECOME COTS -- basically,
> > to become cheap in both hardware and human terms.
> 
> Shuttle's itty bitty computers have a heat pipe that goes out to
> a radiator on the back of the case.  It isn't much of a step from
> there to replacing the back radiator with a copper block.  That block
> could in turn mate with another copper block which itself was
> on a cold water line.  Ie, move the radiator even further from
> the CPU and other heat generating parts of the computer.  So
> a company like shuttle could relatively easily start selling
> liquid cooled nodes using only minor modifications to its existing
> hardware.
> 
> In this sort of a system you might have to pay to have the pros 
> install (plumb) the rack itself, but you could still work on the
> nodes of the rack, as is true now.  It would seem to be relatively
> straightforward to have the nodes mate up copper block to copper
> block when fully inserted, so that each node is not itself part
> of the rack circulation system. The tricky part is
> that something else would have to be attached to the copper block
> on the back when the node was serviced on the bench.

I think there are lots of tricky parts, but I agree that it can be done.
In face, Eugen found this from Rittal:

 http://www.enclosureinfo.com/tech/rittal/lit/pdf/LV_lcs_01_01.pdf

where it IS being done, in the sense that one can get liquid cooling
adjuncts for racks that accept standard ported lq heat sinks for CPUs
and maybe a couple of other parts (disks, power supplies?).  Their
"mini-chiller" per rack is only around 1.3 tons (4500 "cooling watts")
which seems small, and running all the supply hoses around in and out of
the systems (especially MP motherboard or blade systems) inside
enclosures not really designed for them seems like it would be
"interesting".

I just don't think of this is being mainstream.  I didn't get a price
from anybody on this, but I'll bet it is an option on your newborn child
per rack.

The external heat exchanger idea is also "interesting".  I agree that
better thermal management in motherboards themselves would be desirable,
but it takes a biggish chunk of copper to make a heat pipe capable of
moving 100 W 20-30 cm at \kappa_Cu = 385 W/(m-K) and keep the end
temperature differentials in the 20-30 K range.  Maybe what, 0.5 cm in
radius?

> 
> On the plus side your racks could replace the building's current
> hot water supply!

Not unless you permit the max T on the sink in contact with the water to
get dangerously high... (taking this as a serious, rather than a wry,
remark).  Ditto for numerous discussions of using server room waste heat
to help heat buildings -- good idea on paper, pretty difficult in
practice, and then there is summer.


   rgb

> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Mon Feb 14 10:41:01 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 14 Feb 2005 13:41:01 -0500 (EST)
Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling?
In-Reply-To: <Pine.LNX.4.58.0502141240050.20983@ganesh.phy.duke.edu>
References: <E1D0jxV-0001jO-00@mendel.bio.caltech.edu>
	<Pine.LNX.4.58.0502141240050.20983@ganesh.phy.duke.edu>
Message-ID: <Pine.LNX.4.58.0502141340400.20983@ganesh.phy.duke.edu>

On Mon, 14 Feb 2005, Robert G. Brown wrote:

> moving 100 W 20-30 cm at \kappa_Cu = 385 W/(m-K) and keep the end
> temperature differentials in the 20-30 K range.  Maybe what, 0.5 cm in
> radius?

I meant diameter.  Sorry.

  rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From lindahl at pathscale.com  Mon Feb 14 10:58:43 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Mon, 14 Feb 2005 10:58:43 -0800
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <1108402962.8265.25.camel@localhost.localdomain>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain>
	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
	<1108402962.8265.25.camel@localhost.localdomain>
Message-ID: <20050214185843.GA1359@greglaptop.internal.keyresearch.com>

On Mon, Feb 14, 2005 at 05:42:42PM +0000, Ashley Pittman wrote:

> If you had a bunch of sends to do to N remote processes then I'd expect
> you to post them in order (non-blocking) and wait for them all at the
> end, the time taken to do this should be (base_latency + ( (N-1) * M ))
> where M is the recpipiocal of the "issue rate".  You can clearly see
> here that even for small number of batched sends (even a 2d/3d nearest
> neighbour matrix) the issue rate (that is how little CPU the send call
> consumes) is at least as important that the raw latency.

Unless I completely misunderstand your formula, M is not only the CPU
the send call consumes. It's easy to find situations (fast cpu, slow
network) where the cpu consumed isn't a part of M at all. Even for a
modern 1 GByte/sec network, cpu consumed might not be a part of M.

Reducing CPU consumed can't hurt. But reasoning about it seems to be
less useful than testing actual applications.

-- greg


From lindahl at pathscale.com  Mon Feb 14 11:07:37 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Mon, 14 Feb 2005 11:07:37 -0800
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <web-520595@free.net>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
Message-ID: <20050214190737.GB1359@greglaptop.internal.keyresearch.com>

On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote:

> Let me ask some stupid's question: which MPI implementations allow
> really
>  
> a) to overlap MPI_Isend w/computations
> and/or 
> b) to perform a set of subsequent MPI_Isend calls faster than "the 
> same" set of MPI_Send calls ?
> 
> I say only about sending of large messages.

For large messages, everyone does (b) at least partly right. (a) is
pretty rare. It's difficult to get (a) right without hurting short
message performance. One of the commercial MPIs, at first release, had
very slow short message performance because they thought getting (a)
right was more important. They've improved their short message
performance since, but I still haven't seen any real application
benchmarks that show benefit from their approach.

-- greg


From joachim at ccrl-nece.de  Mon Feb 14 11:18:51 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Mon, 14 Feb 2005 20:18:51 +0100
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502141057320.1141@terra.mcs.anl.gov>
References: <web-520595@free.net>
	<Pine.LNX.4.58.0502141057320.1141@terra.mcs.anl.gov>
Message-ID: <4210F99B.3040202@ccrl-nece.de>

Rob Ross wrote:
> Making a sequence of MPI_Isends followed by a MPI_Wait go faster than a 
> sequence of MPI_Sends isn't hard, particularly if the messages are to 
> different ranks.  I would guess that every implementation will provide 
> better performance in the case where the user tells the implementation 
> about all these concurrent operations and then MPI_Waits on the bunch.

In this case, the user should think about MPI_Alltoall(v) - there are 
MPI implementations which do this in a smarter way than 
Isend/Irecv/Waitall to achieve much better performance than using the 
naive approach. Especially if you go to large process numbers, some 
coordination can help a lot, even for a full bisection network like a 
single-stage full crossbar...

Generally, collectives are there to let the library know what kind of 
communication is coming next. All speculations in the library based on 
monitoring and predicting non-collective communication will probably 
only do good in the matching micro-benchmark (my personal experience).

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From rross at mcs.anl.gov  Mon Feb 14 11:49:49 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Mon, 14 Feb 2005 13:49:49 -0600 (CST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <1108402962.8265.25.camel@localhost.localdomain>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> 
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain> 
	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
	<1108402962.8265.25.camel@localhost.localdomain>
Message-ID: <Pine.LNX.4.58.0502141340330.1141@terra.mcs.anl.gov>

On Mon, 14 Feb 2005, Ashley Pittman wrote:

> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote:
> > If you used the non-blocking send to allow for overlapped communication, 
> > then you would like the implementation to play nicely.  In this case the 
> > user will compute and eventually call MPI_Test or MPI_Wait (or a flavor 
> > thereof).
> >
> > If you used the non-blocking sends to post a bunch of communications that
> > you are going to then wait to complete, you probably don't care about the
> > CPU -- you just want the messaging done.  In this case the user will call 
> > MPI_Wait after posting everything it wants done.
> >
> > One way the implementation *could* behave is to assume the user is trying
> > to overlap comm. and comp. until it sees an MPI_Wait, at which point it
> > could go into this theoretical "burn CPU to make things go faster" mode.  
> > That mode could, for example, tweak the interrupt coalescing on an 
> > ethernet NIC to process packets more quickly (I don't know off the top of 
> > my head if that would work or not; it's just an example).
> 
> Maybe if you were using a channel interface (sockets) and all messages
> were to the same remote process then it might make sense to coalesce all
> the sends into a single transaction and just send this in the MPI_Wait
> call.  The latency for a bigger network transaction *might* be lower
> than the sum of the issue rates for smaller ones.

This is exactly what MPICH2 does for the one-sided calls; see Thakur et. 
al in EuroPVM/MPI 2004.  It can be a very big win in some situations.

> I'd hope that a well written application would bunch all it's sends into
> a single larger block when possible though if this optimisation was
> possible though.

We would hope that too, but applications do not always adhere to best 
practice.

> Given any reasonably fast network not doing anything until the MPI_Wait
> call however would destroy your latency.  It strikes me as this isn't
> overlapping comms and compute though rather artificially delaying comms
> to allow compute to finish, seems rather pointless?

I agree that postponing progress until MPI_Wait for the purposes of 
providing lower CPU utilization would be pointless.  It can be useful for 
coalescing purposes, as mentioned above.  But certainly there will be a 
latency cost.

> If you had a bunch of sends to do to N remote processes then I'd expect
> you to post them in order (non-blocking) and wait for them all at the
> end, the time taken to do this should be (base_latency + ( (N-1) * M ))
> where M is the recpipiocal of the "issue rate".  You can clearly see
> here that even for small number of batched sends (even a 2d/3d nearest
> neighbour matrix) the issue rate (that is how little CPU the send call
> consumes) is at least as important that the raw latency.

Well I wasn't trying to start an argument about the importance of CPU
utilization as it relates to issue rate :).  The original question simply
asked if there was generally an advantage to doing what you expect people
to do anyway!  And I think that we agree the answer is yes.

> > All of this is moot of course unless the implementation actually has more
> > than one algorithm that it could employ...
> 
> In my experience there are often dozens of different algorithms for
> every situation and each has their trade offs.  Choosing the right one
> based on the parameters given is the tricky bit.

Absolutely!  And which few of those dozens are applicable to a wide-enough 
range of situations that you want to actually implement/debug them?

Rob


From rross at mcs.anl.gov  Mon Feb 14 13:09:30 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Mon, 14 Feb 2005 15:09:30 -0600 (CST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4210F99B.3040202@ccrl-nece.de>
References: <web-520595@free.net>
	<Pine.LNX.4.58.0502141057320.1141@terra.mcs.anl.gov>
	<4210F99B.3040202@ccrl-nece.de>
Message-ID: <Pine.LNX.4.58.0502141506210.1141@terra.mcs.anl.gov>

On Mon, 14 Feb 2005, Joachim Worringen wrote:

> Rob Ross wrote:
> > Making a sequence of MPI_Isends followed by a MPI_Wait go faster than a 
> > sequence of MPI_Sends isn't hard, particularly if the messages are to 
> > different ranks.  I would guess that every implementation will provide 
> > better performance in the case where the user tells the implementation 
> > about all these concurrent operations and then MPI_Waits on the bunch.
> 
> In this case, the user should think about MPI_Alltoall(v) - there are
> MPI implementations which do this in a smarter way than
> Isend/Irecv/Waitall to achieve much better performance than using the
> naive approach. Especially if you go to large process numbers, some
> coordination can help a lot, even for a full bisection network like a
> single-stage full crossbar...

Yes!  We don't see nearly enough of this I think.

> Generally, collectives are there to let the library know what kind of 
> communication is coming next. All speculations in the library based on 
> monitoring and predicting non-collective communication will probably 
> only do good in the matching micro-benchmark (my personal experience).

I agree.  Which is why we don't tend to try to figure out what the user is 
trying to do, and instead just implement an algorithm to get things done 
as quickly as we can.

Rob


From ashley at quadrics.com  Mon Feb 14 13:22:19 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Mon, 14 Feb 2005 21:22:19 +0000
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502141340330.1141@terra.mcs.anl.gov>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain>
	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
	<1108402962.8265.25.camel@localhost.localdomain>
	<Pine.LNX.4.58.0502141340330.1141@terra.mcs.anl.gov>
Message-ID: <d328ab002c2e2b2183663dd69f4a190e@quadrics.com>


On 14 Feb 2005, at 19:49, Rob Ross wrote:
> On Mon, 14 Feb 2005, Ashley Pittman wrote:
>> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote:
>>
>> Maybe if you were using a channel interface (sockets) and all messages
>> were to the same remote process then it might make sense to coalesce 
>> all
>> the sends into a single transaction and just send this in the MPI_Wait
>> call.  The latency for a bigger network transaction *might* be lower
>> than the sum of the issue rates for smaller ones.
>
> This is exactly what MPICH2 does for the one-sided calls; see Thakur 
> et.
> al in EuroPVM/MPI 2004.  It can be a very big win in some situations.

I'll look it up.  Presumably the win is because of higher bandwidth 
achieved by larger messages over a stream.  I guess the MPI_Fence call 
copies data out of a receive buffer.

>> I'd hope that a well written application would bunch all it's sends 
>> into
>> a single larger block when possible though if this optimisation was
>> possible though.
>
> We would hope that too, but applications do not always adhere to best
> practice.

As someone who maintains a MPI library I hope people do this, it's up 
to us to provide the functionality and application writers to actually 
make use of it.  There are often times when it may well not be worth 
doing this, either because time to market demands or simply when 
experiments with differing algorithms.

>> Given any reasonably fast network not doing anything until the 
>> MPI_Wait
>> call however would destroy your latency.  It strikes me as this isn't
>> overlapping comms and compute though rather artificially delaying 
>> comms
>> to allow compute to finish, seems rather pointless?
>
> I agree that postponing progress until MPI_Wait for the purposes of
> providing lower CPU utilization would be pointless.  It can be useful 
> for
> coalescing purposes, as mentioned above.  But certainly there will be a
> latency cost.

So potentially there is an optimization choice to me made, do you make 
the "noddy" application run faster at the cost of real performance for 
applications tuned to the particular library?  That sounds like a whole 
can of worms.

>>> All of this is moot of course unless the implementation actually has 
>>> more
>>> than one algorithm that it could employ...
>>
>> In my experience there are often dozens of different algorithms for
>> every situation and each has their trade offs.  Choosing the right one
>> based on the parameters given is the tricky bit.
>
> Absolutely!  And which few of those dozens are applicable to a 
> wide-enough
> range of situations that you want to actually implement/debug them?

Implement? Most of them. Debug/support? no more than two or three seems 
optimal.  There are some algorithms that just don't work on a given 
network and some that will only be best in corner cases.  Then it's 
just a case of choosing the correct thresholds between the remaining 
few.  For a given call *best* is absolute however for a given 
application tradeoffs have to be made.

Ashley,


From ashley at quadrics.com  Mon Feb 14 13:29:09 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Mon, 14 Feb 2005 21:29:09 +0000
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <20050214185843.GA1359@greglaptop.internal.keyresearch.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain>
	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
	<1108402962.8265.25.camel@localhost.localdomain>
	<20050214185843.GA1359@greglaptop.internal.keyresearch.com>
Message-ID: <2799056564ab963d97483d0d1d926351@quadrics.com>


On 14 Feb 2005, at 18:58, Greg Lindahl wrote:
> On Mon, Feb 14, 2005 at 05:42:42PM +0000, Ashley Pittman wrote:
>> If you had a bunch of sends to do to N remote processes then I'd 
>> expect
>> you to post them in order (non-blocking) and wait for them all at the
>> end, the time taken to do this should be (base_latency + ( (N-1) * M 
>> ))
>> where M is the recpipiocal of the "issue rate".  You can clearly see
>> here that even for small number of batched sends (even a 2d/3d nearest
>> neighbour matrix) the issue rate (that is how little CPU the send call
>> consumes) is at least as important that the raw latency.
>
> Unless I completely misunderstand your formula, M is not only the CPU
> the send call consumes. It's easy to find situations (fast cpu, slow
> network) where the cpu consumed isn't a part of M at all. Even for a
> modern 1 GByte/sec network, cpu consumed might not be a part of M.

I'm talking about our (Quadrics) network here which has a CPU offload, 
a Wait call is simply a few function calls, a memory read (to test 
completion),  a mutex lock/unlock cycle and a linked list insertion, 
nothing more.  Some CPU is used in the send call as I said but outside 
the two calls there is zero CPU usage although potentially reduced 
memory to CPU bandwidth.

> Reducing CPU consumed can't hurt. But reasoning about it seems to be
> less useful than testing actual applications.

I do that as well.

Ashley,


From eugen at leitl.org  Mon Feb 14 13:31:18 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Mon, 14 Feb 2005 22:31:18 +0100
Subject: [Beowulf] new company, looking for people (fwd from treese@acm.org)
Message-ID: <20050214213117.GQ1404@leitl.org>


(I presume a single job announcement in all these years is tolerable).

----- Forwarded message from Win Treese <treese at acm.org> -----

[snip]

Last fall I joined a startup called SiCortex, where we're building a new
Linux cluster computer. We're ramping up the software team, so if you
know anyone who is really good and looking for something new, let me
know.

Here's the short blurb and some job descriptions; feel free to get in
touch with me for more details. You can pass this along (minus headers,
of course).

   - Win

SiCortex is a new computer company developing a line of Linux cluster
computers for demanding scientific and technical applications. The
company is based in Maynard, Massachusetts.

Senior Software Developers

  Software developers to work on designing, porting, and qualifying a
  Linux-based software stack for technical computing clusters.
  Responsibilities include:

   * Analyzing one or more sub-projects
   * Recommending overall approach (buy, port, build) for sub-project(s)
   * Preparing design and/or implementation plans and schedules for
     sub-project(s)
   * Executing implementation plan for sub-project(s)
   * Executing test and verification strategy for sub-project(s)

  Areas of expertise being sought include:

   * Linux porting, drivers, network stack
   * Parallel file systems
   * Cluster middleware (job scheduling, single system image)
   * Compilers and tools
   * Math and communications libraries
   * Firmware, diagnostics, and system bring-up
   * Technical application analysis and tuning

  Desired skills and experience:

   * 5+ years industry experience in software development
   * Deep knowledge of Linux (preferred) or general Unix
   * Exposure to technical computing
   * Expertise in multiple areas of software development
   * Track record of successful results in small teams

Software Director

  Team leader for group of 8-10 software developers designing, porting,
  and qualifying a Linux-based software stack for technical computing
  clusters. Responsibilities include:

   * Analyzing required work and resources
   * Recruiting and managing team members
   * Qualifying, recommending, and managing potential third-party
     vendors (companies and/or consultants)
   * Preparing and monitoring team schedules
   * Reviewing and managing team results
   * Interfacing with potential and actual customers
   * Technical design and implementation in selected areas

  Software design task encompasses:

   * Linux operating system, including kernel, drivers, and networking
   * Parallel file systems
   * Cluster middleware
   * Compilers and tools
   * Libraries
   * Applications analysis
   * Firmware, diagnostics, and other hardware-related software

  Desired skills and experience:

   * 10+ years software development, with strong background in Linux
     (preferred) or Unix
   * Broad exposure rather than specialist experience in software development
   * Prior experience in software team leadership or management
   * Track record of successful results with constrained resources

  Ideal candidate will have prior exposure to startup environments,
  pragmatic approach to make vs buy decisions, good understanding of
  Open Source environments, and excellent people and leadership skills.

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050214/1b68ef97/attachment.sig>

From patrick at myri.com  Mon Feb 14 13:57:03 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Mon, 14 Feb 2005 16:57:03 -0500
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <420B264A.7050004@ccrl-nece.de>
References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl>	<420580AD.5050003@myri.com>
	<420B264A.7050004@ccrl-nece.de>
Message-ID: <42111EAF.5050709@myri.com>

Hi Joachim,

Joachim Worringen wrote:
> Patrick Geoffray wrote:
> 
>> Seriously, here are MPI latencies with MX on F cards on Opteron 
>> (PCI-X), that includes fibers and a switch in the middle:
>>
>>    Length   Latency(us)    Bandwidth(MB/s)
>>         0       2.684          0.000
> 
> [...]
> 
> Nice work, Patrick - but such numbers are of little value if the
> benchmark used to get them is not stated. I'd recommend mpptest (from 
> MPICH). Plus, the compiler etc. is also of interest when it comes to 
> latencies.

Thanks. Such numbers have always little value coming from a vendor. My 
point was simply that >10us was not really today's ballpark.

For your curiosity, it was using an in-house MPI Pingpong (one message 
at a time, not a bogus pipelined pingpong used to confuse people and 
make big pipes look good). For very small messages, most of Pingpong 
codes are similar, compiler has no impact (it was using the gcc that was 
installed on the machine at that time). For asymptotic bandwidth, the 
major difference is the way you compute 1 MB, either 1024*1024 Bytes, or 
1000000 Bytes. In the networking world, it tends to be 1000000 Bytes.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From patrick at myri.com  Mon Feb 14 15:09:27 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Mon, 14 Feb 2005 18:09:27 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <20050214190737.GB1359@greglaptop.internal.keyresearch.com>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
Message-ID: <42112FA7.7010900@myri.com>

Hi Greg,

Greg Lindahl wrote:
> On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote:
> 
> 
>>Let me ask some stupid's question: which MPI implementations allow
>>really
>> 
>>a) to overlap MPI_Isend w/computations
>>and/or 
>>b) to perform a set of subsequent MPI_Isend calls faster than "the 
>>same" set of MPI_Send calls ?
>>
>>I say only about sending of large messages.
> 
> 
> For large messages, everyone does (b) at least partly right. (a) is
> pretty rare. It's difficult to get (a) right without hurting short
> message performance. One of the commercial MPIs, at first release, had

Many believe you just need RDMA support to overlap com and comp, but 
it's not enough. Zero-copy is needed because the copy is obviously a 
waste of host CPU (along with cache trashing), but the real problem is 
matching. Ron did a lot of work in Portals to offload the matching, 
because it is a big synchronization point: if you send a message and you 
need the CPU on the receive side to find the appropriate receive buffer, 
you cannot tell the user that it can have the CPU between the time he 
posts the MPI_Irecv() and the time he checks on it with MPI_Wait(). What 
will happen is the matching occurs in the MPI_Wait() and overlap goes to 
the toilettes.

There are several ways to work around it:
1) You can have a thread on the receive side and wake it up with an 
interrupt. If you do that for all receives, then you add ~10 us in the 
critical path and the small message latency goes to the same place the 
overlap went before. This was what I believe the commercial MPI was 
doing at first.
2) If you can take decisions at the NIC level, you can receive small 
messages eagerly (with a copy) and fire an interrupt only for large 
messages (you want to steal some CPU cycles for matching). This is not 
bad, you steal (~5 us + cost of matching) worth of CPU cycles for large 
messages, that's not much for most people.
3) You can have the NIC doing the matching. Obviously the NIC is not as 
fast as the host CPU, so it's more expensive: you don't want to do that 
for small messages, it will hurt your latency. But you still has to do 
it for all messages to keep the matching order. One solution is to still 
receive small messages eagerly but match them in the shadow of the 
NIC->host DMA just to keep the list of posted receives consistent. For 
large messages, you match in the NIC in the critical path and you don't 
need the host CPU (assuming that the matched receive is in the small 
number that is kept on the NIC).

It's still not obvious if 3) is worth it, it's much more complex to 
implement and 5us per large receive is not that big. And you can reduce 
that overhead with MSIs (on PCIe, only the Alpha Marvel provided MSI on 
PCI-X, AFAIK).

There are more exotic work-arounds, like using 1) and polling at the 
same time, and hiding the interrupt overhead with some black magic on 
another processor. The one with the best potential would be to use 
HyperThreading on Intel chips to have a polling thread burning cycles 
continuously; it will run in-cache, won't use the FP unit or waste 
memory cycles. A perfect use for the otherwise useless HT feature. I 
wonder why nobody went that way...

> right was more important. They've improved their short message
> performance since, but I still haven't seen any real application
> benchmarks that show benefit from their approach.

That's the classical chicken-egg problem: Are people not trying to 
overlap in MPI because it is not implemented, or MPI implementations 
don't implement it because applications don't try to overlap ?

I think it's the later, too complicated for most. Do you know the 
story/joke about the Physicist and unexpected messages ?

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From mprinkey at aeolusresearch.com  Mon Feb 14 12:39:06 2005
From: mprinkey at aeolusresearch.com (Michael T. Prinkey)
Date: Mon, 14 Feb 2005 15:39:06 -0500 (EST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <20050214190737.GB1359@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.44.0502141538210.23003-100000@ra.thebes>


Greg, based on your evaluation of the available MPI libraries, does this        
imply that overlapping communication and computation can really only be         
done by explicitly building two separate threads?                               
                                                                           
Mike 

On Mon, 14 Feb 2005, Greg Lindahl wrote:

> On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote:
> 
> > Let me ask some stupid's question: which MPI implementations allow
> > really
> >  
> > a) to overlap MPI_Isend w/computations
> > and/or 
> > b) to perform a set of subsequent MPI_Isend calls faster than "the 
> > same" set of MPI_Send calls ?
> > 
> > I say only about sending of large messages.
> 
> For large messages, everyone does (b) at least partly right. (a) is
> pretty rare. It's difficult to get (a) right without hurting short
> message performance. One of the commercial MPIs, at first release, had
> very slow short message performance because they thought getting (a)
> right was more important. They've improved their short message
> performance since, but I still haven't seen any real application
> benchmarks that show benefit from their approach.
> 
> -- greg
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 


From mhyoung at valdosta.edu  Mon Feb 14 12:50:18 2005
From: mhyoung at valdosta.edu (michael young)
Date: Mon, 14 Feb 2005 15:50:18 -0500
Subject: [Beowulf] Poor man's SANS
Message-ID: <42110F0A.10408@valdosta.edu>

Hi,
Can I use beowulf or some other Linux cluster or HA Linux solution
to pool harddrive space together from differrent computers to make a
kinda "poor man's SANS"?

thank you
Michael


From rossen at VerariSoft.Com  Mon Feb 14 14:32:57 2005
From: rossen at VerariSoft.Com (Rossen Dimitrov)
Date: Mon, 14 Feb 2005 17:32:57 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <20050214190737.GB1359@greglaptop.internal.keyresearch.com>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
Message-ID: <42112719.4060500@verarisoft.com>


Greg Lindahl wrote:
> On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote:
> 
> 
>>Let me ask some stupid's question: which MPI implementations allow
>>really
>> 
>>a) to overlap MPI_Isend w/computations
>>and/or 
>>b) to perform a set of subsequent MPI_Isend calls faster than "the 
>>same" set of MPI_Send calls ?
>>
>>I say only about sending of large messages.
> 
> 
> For large messages, everyone does (b) at least partly right. (a) is
> pretty rare. It's difficult to get (a) right without hurting short
> message performance. One of the commercial MPIs, at first release, had
> very slow short message performance because they thought getting (a)
> right was more important. They've improved their short message
> performance since, but I still haven't seen any real application
> benchmarks that show benefit from their approach.

There is quite a bit of published data that for a number of real 
application codes modest increase of MPI latency for very short messages 
has no impact on the application performance. This can also be seen by 
doing traffic characterization, weighing the relative impact of the 
increased latency, and taking into account the computation/communication 
ratio. On the other hand, what you give the application developers with 
an interrupt-driven MPI library is a higher potential for effective 
overlapping, which they could chose to utilize or not, but unless they 
send only very short messages, they will not see a negative performance 
impact from using this library.

There is evidence that re-coding the MPI part of an application to take 
advantage of overlapping and asynchrony when the MPI library (and 
network) supports these well actually leads to real performance benefit.

There is evidence that even without changing anything in the code, but 
by just running the same code with an MPI library that plays nicer to 
the system leads to better application performance by improving the 
overall "application progress" - a loose term I used to describe all of 
the complex system activities that need to occur during the life-cycle 
of a parallel application not only on a single node, but on all nodes 
collectively.

The question of short message latency is connected to system scalability 
in at least one important scenario - running the same problem size as 
fast as possible by adding more processors. This will lead to smaller 
messages, much more sensitive to overhead, thus negatively impacting 
scalability.

In other practical scenarios though, users increase the problem size as 
the cluster size grows, or they solve multiple instances of the same 
problem concurrently, thus keeping the message sizes away from the 
extremely small sizes resulting from maximum scale runs, thus limiting 
the impact of shortest message latency. I have seen many large clusters 
whose only job run across all nodes is HPL for the top500 number. After 
that, the system is either controlled by a job scheduler, which limits 
the size of jobs to about 30% of all processors (an empirically derived 
number that supposedly improves the overall job throughput), or it is 
physically or logically divided into smaller sub-clusters.

All this being said, there is obviously a large group of codes that use 
small messages no matter what size problem they solve or what the 
cluster size is. For these, the lowest latency will be the most 
important (if not the only) optimization parameter. For these cases, 
users can just run the MPI library in polling mode.

With regard to the assessment that every MPI library does (a) partly 
right I'd like to mention that I have seen behavior where attempting to 
overlap computation and communication can lead to no performance 
improvement at all, or even worse, to performance degradation. This is 
one example of how a particular implementation of a standard API can 
affect the way users code against it. I use a metric called "degree of 
overlapping" which for "good" systems approaches 1, for "bad" systems 
approaches 0, and for terrible systems becomes negative... Here goodness 
is measured as how well the system facilitates overlapping.

Rossen

> 
> -- greg
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rross at mcs.anl.gov  Mon Feb 14 20:52:51 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Mon, 14 Feb 2005 22:52:51 -0600 (CST)
Subject: [Beowulf] Poor man's SANS
In-Reply-To: <42110F0A.10408@valdosta.edu>
References: <42110F0A.10408@valdosta.edu>
Message-ID: <Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>

Yes!

PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :).  My
group at ANL along with Clemson University and Ohio Supercomputer Center
and others are developing this.  It's entirely open source and open
development, and is in production use at ANL, OSC, and the University of 
Utah CHPC, among other places.

GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that
RPMs are available for it now through one source or another.  This used to 
be Sistina's product, who was subsequently bought by RedHat.  I'm sure 
this is used in production in many business environments, and we use it at 
ANL also.  Can someone provide a URL for this one?

Lustre (www.lustre.org) is another option.  This one is heavily funded by
the DOE ASC laboratories and is in use on some very large parallel
machines.  But unless you have a relationship with CFS you can only get a
crippled version of the source, so it's probably not a good option for
average joe.  If they change their policy on releasing source code, this 
would be worth reconsidering.

Regards,

Rob
---
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab


On Mon, 14 Feb 2005, michael young wrote:

> Hi,
> Can I use beowulf or some other Linux cluster or HA Linux solution
> to pool harddrive space together from differrent computers to make a
> kinda "poor man's SANS"?
> 
> thank you
> Michael


From rross at mcs.anl.gov  Mon Feb 14 21:12:36 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Mon, 14 Feb 2005 23:12:36 -0600 (CST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <42112719.4060500@verarisoft.com>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
Message-ID: <Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>

Rossen,

It would be good to mention that you work for a company that sells an
implementation specifically designed for facilitating overlapping, in case
people don't know that.  Clearly you guys have thought a lot about this.

The last two Scalable OS workshops (the only two I've had a chance to 
attend), there was a contingent of people that are certain that MPI isn't 
going to last too much longer as a programming model for very large 
systems.  The issue, as they see it, is that MPI simply imposes too much 
latency on communication, and because we (as MPI implementors) cannot 
decrease that latency fast enough to keep up with processor improvements, 
MPI will soon become too expensive to be of use on these systems.

Now, I don't personally think that this is going to happen as quickly as
some predict, but it is certainly an argument that we should be paying
very careful attention to the latency issue, because as MPI implementors 
this is an argument that never seems to end.

Also, there is additional overhead in the Isend()/Wait() pair over the
simple Send() (two function calls rather than one, allocation of a Request
structure at the least) that means that a naive attempt at overlapping
communication and computation will result in a slower application.  So
that doesn't surprise me at all.

I think that the theme from this thread should be that "it's a good thing
that we have more than one MPI implementation, because they all do
different things best."

Rob
---
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab


On Mon, 14 Feb 2005, Rossen Dimitrov wrote:

> There is quite a bit of published data that for a number of real 
> application codes modest increase of MPI latency for very short messages 
> has no impact on the application performance. This can also be seen by 
> doing traffic characterization, weighing the relative impact of the 
> increased latency, and taking into account the computation/communication 
> ratio. On the other hand, what you give the application developers with 
> an interrupt-driven MPI library is a higher potential for effective 
> overlapping, which they could chose to utilize or not, but unless they 
> send only very short messages, they will not see a negative performance 
> impact from using this library.
> 
> There is evidence that re-coding the MPI part of an application to take 
> advantage of overlapping and asynchrony when the MPI library (and 
> network) supports these well actually leads to real performance benefit.
> 
> There is evidence that even without changing anything in the code, but 
> by just running the same code with an MPI library that plays nicer to 
> the system leads to better application performance by improving the 
> overall "application progress" - a loose term I used to describe all of 
> the complex system activities that need to occur during the life-cycle 
> of a parallel application not only on a single node, but on all nodes 
> collectively.
> 
> The question of short message latency is connected to system scalability 
> in at least one important scenario - running the same problem size as 
> fast as possible by adding more processors. This will lead to smaller 
> messages, much more sensitive to overhead, thus negatively impacting 
> scalability.
> 
> In other practical scenarios though, users increase the problem size as 
> the cluster size grows, or they solve multiple instances of the same 
> problem concurrently, thus keeping the message sizes away from the 
> extremely small sizes resulting from maximum scale runs, thus limiting 
> the impact of shortest message latency. I have seen many large clusters 
> whose only job run across all nodes is HPL for the top500 number. After 
> that, the system is either controlled by a job scheduler, which limits 
> the size of jobs to about 30% of all processors (an empirically derived 
> number that supposedly improves the overall job throughput), or it is 
> physically or logically divided into smaller sub-clusters.
> 
> All this being said, there is obviously a large group of codes that use 
> small messages no matter what size problem they solve or what the 
> cluster size is. For these, the lowest latency will be the most 
> important (if not the only) optimization parameter. For these cases, 
> users can just run the MPI library in polling mode.
> 
> With regard to the assessment that every MPI library does (a) partly 
> right I'd like to mention that I have seen behavior where attempting to 
> overlap computation and communication can lead to no performance 
> improvement at all, or even worse, to performance degradation. This is 
> one example of how a particular implementation of a standard API can 
> affect the way users code against it. I use a metric called "degree of 
> overlapping" which for "good" systems approaches 1, for "bad" systems 
> approaches 0, and for terrible systems becomes negative... Here goodness 
> is measured as how well the system facilitates overlapping.
> 
> Rossen


From patrick at myri.com  Mon Feb 14 22:20:52 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 15 Feb 2005 01:20:52 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <420DA793.4000909@verarisoft.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<420DA793.4000909@verarisoft.com>
Message-ID: <421194C4.5050808@myri.com>

Hi Rossen,

Rossen Dimitrov wrote:
> Of course, there is always the case of running the actual application
> code and then evaluating the MPI performance by seeing which MPI library
> (or library mode) makes the application run faster. Unfortunately, this
> method for evaluating MPI often suffers from various efficiencies some
> of which originate from the parallel algorithm developers, who thoughout
> the years have sometimes adopted the most trivial ways of using MPI.

So if you run an MPI application and it sucks, this is because the 
application is poorly written ?

You don't want to benchmark an application to evaluate MPI, you want to 
benchmark an application to find the best set of resources to get the 
job done. If the code stinks, it's not an excuse. Good MPI 
implementations are good with poorly written applications, but still let 
smart people do smart things if they want.

> these in one way or another depend on CPU processing. Also, today's 
> processor architectures have many independent processing units and 
> complex memory hierarchies. When the MPI library polls for completion of 
> a communication request, most of this specialized hardware is virtually 
> unused (wasted). The processor architecture trends indicate that this 
> kind of internal CPU concurrency will continue to increase, thus making 
> the cost of MPI polling even higher.

When you poll, you have nothing else to do: you are stuck in a Wait or 
in a blocking call (collectives for example). Why do you care about the 
lost cycles ? The only way to rescue them would be to oversubscribe your 
processor, and hope than the cycles you recycle (no punt intended) are 
worth the context switches and the associated cache trashing. I would 
argue that polling should be the cheapest MPI operations ever (if 
nothing is found). This is the case of most half decent MPI implementation.

> In this regard, a parallel application developer might actually very
> much care what is actually happening in the MPI library even when he 
> makes a call to MPI_Send. If he doesn't, he probably should.

He absolutely should not. It's one thing to work around clueless 
developers, but it's way more difficult to work around someone who 
assume wrong things about the MPI implementation.

> - What application algorithm developers experience when they attempt to
> use the ever so nebulous "overlapping" with a polling MPI library and

Overlaping is completely orthogonal with polling. Overlaping means that 
you split the communication initiation from the communication 
completion. Polling means that you test for completion instead of wait 
for completion. You can perfectly overlap and check for completion of 
the asynchronous requests by polling, nothing wrong with that.

> how this experience has contributed to the overwhelming use of
> MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or
> (even better) persistent MPI calls, thus killing any hope that these
> codes can run faster on systems that actually facilitate overlapping.

There is 2 reasons why developers use blocking operations rather than 
non-blocking one:
1) they don't know about non-blocking operations.
2) MPI_Send is shorter than MPI_Isend().


Looking for overlaping is actually not that hard:
a) look for medium/large messages, don't waste time on small ones.
b) replace all MPI_Send() by a pair MPI_Isend() + MPI_Wait()
c) move the MPI_Isend() as early as possible (as soon as data is ready).
d) move the MPI_Wait() as late as possible (just before the buffer is 
needed).
e) do same for receive.

Most of the time, that would speed up things quite a bit, or not change 
anything. I am still looking for some tuning tool to do that 
automatically though.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From john.hearns at streamline-computing.com  Mon Feb 14 23:20:21 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Tue, 15 Feb 2005 07:20:21 -0000 (GMT)
Subject: [Beowulf] Poor man's SANS
In-Reply-To: <Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>
References: <42110F0A.10408@valdosta.edu>
	<Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>
Message-ID: <33007.212.159.87.168.1108452021.squirrel@webmail.streamline-computing.com>

> Yes!
>
> PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :).  My
> group at ANL along with Clemson University and Ohio Supercomputer Center
> and others are developing this.  It's entirely open source and open
> development, and is in production use at ANL, OSC, and the University of
> Utah CHPC, among other places.
>
> GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that
> RPMs are available for it now through one source or another.  This used to
> be Sistina's product, who was subsequently bought by RedHat.  I'm sure
> this is used in production in many business environments, and we use it at
> ANL also.  Can someone provide a URL for this one?
Source RPMs are of course available from RedHat,
and ou can get support for their version.

The Scientific Linux distribution has prebuilt RPMs
ftp://ftp.scientificlinux.org/linux/scientific/304/i386/SL/RPMS/


From patrick at myri.com  Mon Feb 14 23:48:47 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 15 Feb 2005 02:48:47 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<web-520595@free.net>	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
Message-ID: <4211A95F.2010709@myri.com>

Hi Rob,

Rob Ross wrote:

> The last two Scalable OS workshops (the only two I've had a chance to 
> attend), there was a contingent of people that are certain that MPI isn't 
> going to last too much longer as a programming model for very large 

Were they advocating shared memory paradigms, one sided operations, 
something more "natural" to program with ? I heard that before :-)

> systems.  The issue, as they see it, is that MPI simply imposes too much 
> latency on communication, and because we (as MPI implementors) cannot 
> decrease that latency fast enough to keep up with processor improvements, 
> MPI will soon become too expensive to be of use on these systems.

This is just wrong. How much of the latency in high speed interconnect 
is due to MPI ? Very very little. The core of it is in the hardare (IO 
bus, NICs, crossbars and wires). Doing pure RDMA in hardware is easy for 
the chip designers, but it's hell for irregular applications when you 
actually don't know where to remotely read or write.

> Also, there is additional overhead in the Isend()/Wait() pair over the
> simple Send() (two function calls rather than one, allocation of a Request
> structure at the least) that means that a naive attempt at overlapping
> communication and computation will result in a slower application.  So
> that doesn't surprise me at all.

What is the cost of one function call and an allocation in a slab ? At 
several GHz, 50 ns ? And most of the time, blocking calls are 
implemented on top of non-blocking routines, so the CPU overhead is the 
same.

> I think that the theme from this thread should be that "it's a good thing
> that we have more than one MPI implementation, because they all do
> different things best."

I would say having more than one MPI implementations is a bad thing as 
long as you cannot easily replace one by another. Let's define a 
standard MPI header and a standard API for spawning and such, and then 
having more than one implementation will actually be manageable. That 
would also remove the needs for swiss-army-knife MPI implementations 
that want to support all interconnect with the same binary. These 
implementations are, IMHO, a bad thing as they work at the lowest common 
denominator and are in essence inefficient for all devices.


While we are at it, here is my wish list for the next MPI specs:

a) only non-blocking calls. If there are no blocking calls, nobody will 
use them.
b) non-blocking calls for collectives too, there is no excuse. Yes, even 
an asynchronous barrier.
c) ban of the ANY_SENDER wildcard: a world of optimization goes away 
with this convenience.
d) throw away the user defined datatypes, or at least restrict it to 
regular strides.
e) get rid of one-sided communications: if someone is serious about it, 
it uses something like ARMCI or UPC or even low level vendor interfaces.


Rob, you are politically connected, could you make it happen, please ?
:-)

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From joachim at ccrl-nece.de  Tue Feb 15 00:20:48 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Tue, 15 Feb 2005 09:20:48 +0100
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211A95F.2010709@myri.com>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<web-520595@free.net>	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>	<42112719.4060500@verarisoft.com>	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com>
Message-ID: <4211B0E0.6030007@ccrl-nece.de>

Patrick Geoffray wrote:
> While we are at it, here is my wish list for the next MPI specs:
> 
> a) only non-blocking calls. If there are no blocking calls, nobody will 
> use them.

While this makes sense technically, nobody will probably offer an MPI 
implementation without MPI_Send for the next 20 years for compatibility 
reasons, so we can just forget about it.

> b) non-blocking calls for collectives too, there is no excuse. Yes, even 
> an asynchronous barrier.

No problem here - barrier_enter() and barrier_leaver() are not new.

> c) ban of the ANY_SENDER wildcard: a world of optimization goes away 
> with this convenience.

I think this could best be achieved with an assertion like those for 
one-sided and I/O. There are situations where ANY_SENDER is needed, or 
at least avoids large programming overheads.

> d) throw away the user defined datatypes, or at least restrict it to 
> regular strides.

This is nonsense: user-defined datatypes do not cause any overhead if 
you don't use them, there are ways to implemenent them very efficiently, 
and you can't do without in many situations (like MPI-IO).

> e) get rid of one-sided communications: if someone is serious about it, 
> it uses something like ARMCI or UPC or even low level vendor interfaces.

Instead, I propose to rework the MPI one-sided communications for a more 
simple and flexible semantic. The current definition does not match 
todays network capabilities, but was designed to allow a simple 
implemenentation for slow/non-RDMA networks.

> Rob, you are politically connected, could you make it happen, please ?
> :-)

One person alone can't do this. The best place to discuss such things is 
the MPI users group meeting (EuroPVM/MPI, this year in Capri/Italy).

Also, adding mpi.h to the standard to define an ABI is a good thing.

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From joachim at ccrl-nece.de  Tue Feb 15 00:53:37 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Tue, 15 Feb 2005 09:53:37 +0100
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <1108402962.8265.25.camel@localhost.localdomain>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<1108398183.8243.54.camel@localhost.localdomain>	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
	<1108402962.8265.25.camel@localhost.localdomain>
Message-ID: <4211B891.6020406@ccrl-nece.de>

Ashley Pittman wrote:
  > If you had a bunch of sends to do to N remote processes then I'd expect
> you to post them in order (non-blocking) and wait for them all at the
> end, the time taken to do this should be (base_latency + ( (N-1) * M ))
> where M is the recpipiocal of the "issue rate".  You can clearly see
> here that even for small number of batched sends (even a 2d/3d nearest
> neighbour matrix) the issue rate (that is how little CPU the send call
> consumes) is at least as important that the raw latency.

This is an interesting issue. If you look at what Greg mentioned about 
dump NICs (like InfiniPath, or SCI) and the latency numbers Ole posted 
for ScaMPI on different interconnects (all(?) accessed through uDAPL), 
you see that the dumb interface SCI has the lowest latency for both, 
pingpong and random, with random being about twice of pingpong. In 
contrast, the "smart" NIC Myrinet, which has much less CPU utilization, 
has twice the pingpong latency, and a slightly worse random-to-pingpong 
ratio.

Why this? Maybe better pipelining in SCI, because it's write-and-forget 
for the CPU, with 16 outstanding transactions on the network level, 
while Myrinet obviously behaves differently here (although GM should 
also be PIO-write to the NIC memory for small messages).

Then there is Infiniband, which has a much better random-to-pingpong 
ratio, which is striking.

Would be nice to see Quadrics or InfiniPath in this context.

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From gmpc at sanger.ac.uk  Tue Feb 15 01:22:21 2005
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Tue, 15 Feb 2005 09:22:21 +0000 (GMT)
Subject: [Beowulf] Poor man's SANS
In-Reply-To: <Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>
References: <42110F0A.10408@valdosta.edu>
	<Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>
Message-ID: <Pine.OSF.4.44.0502150916260.3077604-100000@ecs2e.internal.sanger.ac.uk>

> PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :).  My

> GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that

> Lustre (www.lustre.org) is another option.  This one is heavily funded by

You missed out GPFS from IBM. It is no-cost free for academic
institutions. You can use it with or without SAN hardware.

http://publib.boulder.ibm.com/clresctr/windows/public/gpfsbooks.html

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK
Tel: +44 (0)1223 834244 ex 7199


From joachim at ccrl-nece.de  Tue Feb 15 01:47:32 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Tue, 15 Feb 2005 10:47:32 +0100
Subject: [Beowulf] Home beowulf - NIC latencies
In-Reply-To: <42111EAF.5050709@myri.com>
References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl>	<420580AD.5050003@myri.com>
	<420B264A.7050004@ccrl-nece.de> <42111EAF.5050709@myri.com>
Message-ID: <4211C534.7070608@ccrl-nece.de>

Patrick Geoffray wrote:
> For your curiosity, it was using an in-house MPI Pingpong (one message 
> at a time, not a bogus pipelined pingpong used to confuse people and 
> make big pipes look good). For very small messages, most of Pingpong
> codes are similar, 

...but not equal and give different results. Just compare PMB and mpptest.

> compiler has no impact (it was using the gcc that was 
> installed on the machine at that time). 

I experienced differences of more than 2 us depending on whether using 
shared or static libraries, compiler version/options etc. on both scalar 
and vector machines.

 > For asymptotic bandwidth, the
> major difference is the way you compute 1 MB, either 1024*1024 Bytes, or 
> 1000000 Bytes. In the networking world, it tends to be 1000000 Bytes.

I tend to use MB for 10^6 and MiB for 2^10. This is a somewhat official no

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From patrick at myri.com  Tue Feb 15 01:48:09 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 15 Feb 2005 04:48:09 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211B0E0.6030007@ccrl-nece.de>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<web-520595@free.net>	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>	<42112719.4060500@verarisoft.com>	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>	<4211A95F.2010709@myri.com>
	<4211B0E0.6030007@ccrl-nece.de>
Message-ID: <4211C559.8070100@myri.com>

Joachim,

Joachim Worringen wrote:
> Patrick Geoffray wrote:
> 
>> While we are at it, here is my wish list for the next MPI specs:
>>
>> a) only non-blocking calls. If there are no blocking calls, nobody 
>> will use them.
> 
> 
> While this makes sense technically, nobody will probably offer an MPI 
> implementation without MPI_Send for the next 20 years for compatibility 
> reasons, so we can just forget about it.

Throw away compatibility. If you keep the legacy API, you have no 
incentive for change. I don't want MPI-3, I want MPI-light. We are 
against a wall because the MPI spec was too rich and developers took the 
lazy path.

The weight of legacy will make shared memory paradigms the only proposal 
for the next step. If you believe we have to support the whole MPI 
semantics in the next message passing standards, then we are doomed.

>> c) ban of the ANY_SENDER wildcard: a world of optimization goes away 
>> with this convenience.
> 
> 
> I think this could best be achieved with an assertion like those for 
> one-sided and I/O. There are situations where ANY_SENDER is needed, or 
> at least avoids large programming overheads.

It's used because it's there, there is no other reason. If you don't 
know who sends you what in a message passing application, then you 
cannot get either performance or robustness. If really you cannot do 
otherwise (and I don't believe that), you can always use unexpected 
messages (post the receive after Probe()ing), That's ugly, but you get 
what you deserved :-)

>> d) throw away the user defined datatypes, or at least restrict it to 
>> regular strides.
> 
> 
> This is nonsense: user-defined datatypes do not cause any overhead if 
> you don't use them, there are ways to implemenent them very efficiently, 
> and you can't do without in many situations (like MPI-IO).

I know this item would itch, you spend a lot of time working on that.

If you don't use user-defined datatypes, then you don't need it and it 
should not be there in the first place. It's a temptation, it's too 
easy. No, there is no ways to implement them efficiently unless they are 
regular, and this is what I am willing to keep: strided types with long 
segments. Everything else leads to memory copies. The developer should 
wipe his own bottom instead of asking the message passing interface to 
work around bad data layout. Sending a column of blocs, yes, that's 
regular stride and it makes a lot of sense. Sending non-contiguous 
irregular structure ? As we used to say in France, $100 and a chocolate 
bar with that ?

Oh, BTW, I would gut MPI-IO and make a separate interface. Only a small 
subset of applications use it and the core semantics are quite different 
that pure message passing. Man, it's not MPI, it's emacs...

>> e) get rid of one-sided communications: if someone is serious about 
>> it, it uses something like ARMCI or UPC or even low level vendor 
>> interfaces.
> 
> 
> Instead, I propose to rework the MPI one-sided communications for a more 
> simple and flexible semantic. The current definition does not match 
> todays network capabilities, but was designed to allow a simple 
> implemenentation for slow/non-RDMA networks.

I don't know about that. I just would took it out of the Message Passing 
Interface because it's not message passing. There would certainly be a 
need for a pure RMA interface, and there is already a lot of existing 
work and experience to build upon.

>> Rob, you are politically connected, could you make it happen, please ?
>> :-)
> 
> 
> One person alone can't do this. The best place to discuss such things is
> the MPI users group meeting (EuroPVM/MPI, this year in Capri/Italy).

Nothing that radical would ever come out of EuroPVM/MPI (I heard that 
Capri is a really nice place, I will definitively beg my boss) or any 
other users group.

> Also, adding mpi.h to the standard to define an ABI is a good thing.

Just achieving that would be beyond my greatest expectations. It would 
certainly be fun to watch. We could organize fist fights on the beach in 
Capri...

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From patrick at myri.com  Tue Feb 15 02:12:35 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 15 Feb 2005 05:12:35 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211B891.6020406@ccrl-nece.de>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<1108398183.8243.54.camel@localhost.localdomain>	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>	<1108402962.8265.25.camel@localhost.localdomain>
	<4211B891.6020406@ccrl-nece.de>
Message-ID: <4211CB13.3050902@myri.com>

Joachim Worringen wrote:
> This is an interesting issue. If you look at what Greg mentioned about 
> dump NICs (like InfiniPath, or SCI) and the latency numbers Ole posted 
> for ScaMPI on different interconnects (all(?) accessed through uDAPL), 
> you see that the dumb interface SCI has the lowest latency for both, 

Which is the original hardware Scali built its MPI upon, btw.

> pingpong and random, with random being about twice of pingpong. In 
> contrast, the "smart" NIC Myrinet, which has much less CPU utilization, 
> has twice the pingpong latency, and a slightly worse random-to-pingpong 
> ratio.

No, it's not Myrinet, it's GM/Myrinet. There are many things that come 
from the GM side of the equation, believe me.

> Why this? Maybe better pipelining in SCI, because it's write-and-forget 
> for the CPU, with 16 outstanding transactions on the network level, 
> while Myrinet obviously behaves differently here (although GM should 
> also be PIO-write to the NIC memory for small messages).

Nope, no PIO for small messages with GM, DMA for everything.


A last remark. I really think that the argument of using the same 
swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or even 
MPI/Pro to infere interconnect characteristics is even worse that 
looking at latency and bandwidth alone. These implementations are never 
going to be designed to use all hardware efficiently, their design is 
either historic (Scali used to provided software for SCI alone) or 
politicaly motivated (Intel is using uDapl, hummm, wonder why), or both. 
They are by-products of the MPI forum failure to make the Standard 
practical (compatible ABI).

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From jcownie at etnus.com  Tue Feb 15 05:06:56 2005
From: jcownie at etnus.com (James Cownie)
Date: Tue, 15 Feb 2005 13:06:56 +0000
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies 
In-Reply-To: Message from Patrick Geoffray <patrick@myri.com> 
	of "Tue, 15 Feb 2005 05:12:35 EST." <4211CB13.3050902@myri.com> 
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain>
	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
	<1108402962.8265.25.camel@localhost.localdomain>
	<4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> 
Message-ID: <20050215130656.8572F1C818@amd64.cownie.net>


> A last remark. I really think that the argument of using the same
> swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or
> even MPI/Pro to infere interconnect characteristics is even worse that
> looking at latency and bandwidth alone. These implementations are
> never going to be designed to use all hardware efficiently, their
> design is either historic (Scali used to provided software for SCI
> alone) or politicaly motivated (Intel is using uDapl, hummm, wonder
> why), or both. They are by-products of the MPI forum's failure to make
> the Standard practical (compatible ABI).

As someone who was on the MPI Forum, and sat through an awful lot of
meetings, I'd like to provide some justification for _why_ we didn't try
to make a binary standard.

1) At the time (over ten years ago), we would have been happy to have
   _one_ MPI implementation on a given machine, and we weren't expecting
   to have multiple MPIs on the same hardware. (It was by no means a
   foregone conclusion that MPI would succeeed).

2) We didn't expect MPI to move into a commercial environment in which
   the people running the code wouldn't have the sources, and wouldn't
   be optimising for _their_ machine, which obviously requires
   recompilation, making an ABI irrelevant.

3) Not having a binary interface allows optimisations in the C MPI
   interface (such as using macros rather than functions in some
   places).

4) A binary interface based on no MPI implementation experience would
   likely be worse than no binary interface.

5) MPI is supposed to be machine and architecture independent, specifying
   a binary interface under those circumstances is hard. 

   Maybe you can do it if you leverage the C ABI, however it's not clear
   that that is ideal, since that either changes with time, or suffers
   from poor vision of the future too (e.g. look at the required
   alignment of double in the x86 ABI).

6) It was a hard enough job to agree on the source level
   specification. If we'd tried to add an ABI we'd probably still be
   stuck in the Bristol Suites :-) 

You seem to think (maybe subconsciously) that the MPI forum added
features the standard just to make life hard for implementors and
to kill performance ;-)

I can assure you that that was not the case, and that the standard was a
compromise between features which users really wanted and what the
implementors felt they could reasonably provide. If the standard had not
provided things the users wanted (like wildcard receive), then it's
quite possible that his whole discussion would be moot because MPI would
by now be of only historical interest since the user community would
have ignored it.

If you _really_ believe that there is so much performance benefit for
your customers in having an MPI-light with the restrictions you outlined
which only runs on your hardware, then no-one's stopping you from
providing it. 

The market will decide...

-- 
-- Jim
--
James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


From rross at mcs.anl.gov  Tue Feb 15 07:47:11 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Tue, 15 Feb 2005 09:47:11 -0600 (CST)
Subject: [Beowulf] Poor man's SANS
In-Reply-To: <33007.212.159.87.168.1108452021.squirrel@webmail.streamline-computing.com>
References: <42110F0A.10408@valdosta.edu>
	<Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>
	<33007.212.159.87.168.1108452021.squirrel@webmail.streamline-computing.com>
Message-ID: <Pine.LNX.4.58.0502150947000.1141@terra.mcs.anl.gov>

Thanks John!

Rob

On Tue, 15 Feb 2005, John Hearns wrote:

> > Yes!
> >
> > PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :).  My
> > group at ANL along with Clemson University and Ohio Supercomputer Center
> > and others are developing this.  It's entirely open source and open
> > development, and is in production use at ANL, OSC, and the University of
> > Utah CHPC, among other places.
> >
> > GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that
> > RPMs are available for it now through one source or another.  This used to
> > be Sistina's product, who was subsequently bought by RedHat.  I'm sure
> > this is used in production in many business environments, and we use it at
> > ANL also.  Can someone provide a URL for this one?
> Source RPMs are of course available from RedHat,
> and ou can get support for their version.
> 
> The Scientific Linux distribution has prebuilt RPMs
> ftp://ftp.scientificlinux.org/linux/scientific/304/i386/SL/RPMS/
> 
> 


From rross at mcs.anl.gov  Tue Feb 15 07:48:17 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Tue, 15 Feb 2005 09:48:17 -0600 (CST)
Subject: [Beowulf] Poor man's SANS
In-Reply-To: <Pine.OSF.4.44.0502150916260.3077604-100000@ecs2e.internal.sanger.ac.uk>
References: <42110F0A.10408@valdosta.edu>
	<Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>
	<Pine.OSF.4.44.0502150916260.3077604-100000@ecs2e.internal.sanger.ac.uk>
Message-ID: <Pine.LNX.4.58.0502150947220.1141@terra.mcs.anl.gov>

Hello Guy,

I wasn't aware that IBM would give that out for use on existing systems.  
Does anyone know the constraints under which they will provide such a 
copy?

Thanks,

Rob

On Tue, 15 Feb 2005, Guy Coates wrote:

> You missed out GPFS from IBM. It is no-cost free for academic
> institutions. You can use it with or without SAN hardware.
> 
> http://publib.boulder.ibm.com/clresctr/windows/public/gpfsbooks.html
> 
> Guy


From rross at mcs.anl.gov  Tue Feb 15 08:42:56 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Tue, 15 Feb 2005 10:42:56 -0600 (CST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211A95F.2010709@myri.com>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com>
Message-ID: <Pine.LNX.4.58.0502151041540.1141@terra.mcs.anl.gov>

On Tue, 15 Feb 2005, Patrick Geoffray wrote:

> Rob, you are politically connected, could you make it happen, please ?
> :-)

If I had that level of connections, I'd be a DC lobbyist :).  Maybe sell 
off some national parks to the oil industry or something.

Rob


From gmpc at sanger.ac.uk  Tue Feb 15 08:56:10 2005
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Tue, 15 Feb 2005 16:56:10 +0000 (GMT)
Subject: [Beowulf] Poor man's SANS
In-Reply-To: <Pine.LNX.4.58.0502150947220.1141@terra.mcs.anl.gov>
References: <42110F0A.10408@valdosta.edu>
	<Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>
	<Pine.OSF.4.44.0502150916260.3077604-100000@ecs2e.internal.sanger.ac.uk>
	<Pine.LNX.4.58.0502150947220.1141@terra.mcs.anl.gov>
Message-ID: <Pine.OSF.4.44.0502151550330.3077604-100000@ecs2e.internal.sanger.ac.uk>

On Tue, 15 Feb 2005, Rob Ross wrote:

> Hello Guy,
>
> I wasn't aware that IBM would give that out for use on existing systems.
> Does anyone know the constraints under which they will provide such a
> copy?

As an academic, you sign up for it under the IBM "scholars program".  It
comes at no cost but unsupported (well, best-efforts support via the GPFS
mailing list).

http://www-306.ibm.com/software/info/university/members/faq.html

If you want support or want a commercial license, then you have to pay
money.

The "official" GPFS hardware support matrix is pretty tight, but if you
don't care about support, you should find that it will run on pretty much
any sort of disk hardware.

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK
Tel: +44 (0)1223 834244 ex 7199


From joachim at ccrl-nece.de  Tue Feb 15 10:43:18 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Tue, 15 Feb 2005 19:43:18 +0100
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211CB13.3050902@myri.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<1108398183.8243.54.camel@localhost.localdomain>	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>	<1108402962.8265.25.camel@localhost.localdomain>
	<4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com>
Message-ID: <421242C6.2050800@ccrl-nece.de>

Patrick Geoffray wrote:
> A last remark. I really think that the argument of using the same 
> swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or even 
> MPI/Pro to infere interconnect characteristics is even worse that 
> looking at latency and bandwidth alone. These implementations are never 
> going to be designed to use all hardware efficiently, their design is 
> either historic (Scali used to provided software for SCI alone) or 
> politicaly motivated (Intel is using uDapl, hummm, wonder why), or both. 

The two most important things done to optimise performance of an MPI 
implementation for a hardware platform are:
- low-level pt-2-pt communication
- collective operations

AFAIK, Myrinet's MPI (MPICH-GM), for example, does use the standard 
(partly naive) collective operations of MPICH. Considering this, plus 
the fact
- that it's not all that hard to use GM for pt-2-pt efficiently. We have 
done this in our MPI, too, with the same level of performance.
- that you probably do not know anything on ScaMPI's current internal 
design (Intel is MPICH2 plus some Intel-propietary device hacking) and 
little about it's performance (if this is wrong, let us know)
- that all code apart from the device, and also the device architecture 
  of MPICH-GM are more or less 10-year-old swiss-army-knive MPICH code 
(which is not a bad thing per se)
you should maybe think again before judging on the efficiency of other 
MPI implementations.

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From lindahl at pathscale.com  Wed Feb 16 00:05:25 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 16 Feb 2005 00:05:25 -0800
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211B891.6020406@ccrl-nece.de>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain>
	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
	<1108402962.8265.25.camel@localhost.localdomain>
	<4211B891.6020406@ccrl-nece.de>
Message-ID: <20050216080525.GA3122@greglaptop.attbi.com>

On Tue, Feb 15, 2005 at 09:53:37AM +0100, Joachim Worringen wrote:

> This is an interesting issue. If you look at what Greg mentioned about 
> dump NICs (like InfiniPath, or SCI) and the latency numbers Ole posted 
> for ScaMPI on different interconnects (all(?) accessed through uDAPL), 
> you see that the dumb interface SCI has the lowest latency for both, 
> pingpong and random, with random being about twice of pingpong. In 
> contrast, the "smart" NIC Myrinet, which has much less CPU utilization, 
> has twice the pingpong latency, and a slightly worse random-to-pingpong 
> ratio.

I would make 2 comments about this:

First, you should be using the best MPI for each piece of hardware.
Hardware architects pick their interface with a software
implementation in mind. I don't expect any 3rd party MPI to get close
to PathScale's MPI latency on PathScale's hardware, unless the 3rd
party is flexible enough to change a lot of code.

Second, you really can't generalize about dumb NICs by looking at
SCI. SCI has a unique situation: its raw latency is much lower than
the MPI latency of all MPI implementations for it. I suspect no
hardware designer would be out to imitate that property! Both
InfiniPath and the Quadrics STEN (forgive me for classing this as
dumb, I happen to think dumb is a compliment...) get this right.

Third (you knew I couldn't keep to my promise of 2), I wouldn't make
any scaling generalizations based on a test with 16 nodes.  Even at
128-256 nodes the picture is quite different, and that's the sweet
spot that lots of today's clusters are at. So, if you want to make a
scaling generalization, you should be quoting 256-512 node results.

-- greg


From patrick at myri.com  Wed Feb 16 00:17:02 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 16 Feb 2005 03:17:02 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <1108478089.4587.118.camel@s861954.sandia.gov>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com>
	<1108478089.4587.118.camel@s861954.sandia.gov>
Message-ID: <4213017E.7060302@myri.com>

Keith D. Underwood wrote:
>>c) ban of the ANY_SENDER wildcard: a world of optimization goes away 
>>with this convenience.
> 
> 
> Um, our apps guys say this is more than a convenience.  Apparently,
> sometimes you don't exactly know who you are going to receive from. 
> Would you rather them post receives from 4000 nodes and cancel the ones
> that don't send to that node after a while?

No, I would not post any receives and let them come unexpected, sing 
MPI_Probe() to post a matching receive when something show up. It leaves 
the MPI implementation a way to move most of the matching to the send 
side for most of the messages and, if the receive is posted early 
enough, remove the need for host CPU on the receive side when the 
application is potentially computing.

And you remind me, I would ban MPI_Cancel also. It should have been the 
item #1 :-)

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From patrick at myri.com  Wed Feb 16 00:39:17 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 16 Feb 2005 03:39:17 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <1108479093.4587.132.camel@s861954.sandia.gov>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de>
	<4211C559.8070100@myri.com>
	<1108479093.4587.132.camel@s861954.sandia.gov>
Message-ID: <421306B5.3080200@myri.com>

Hi Keith,

Keith D. Underwood wrote:
> Inertia is a powerful thing.  Billions of dollars have been invested in
> MPI codes.  Changing that will not be easy (or cheap).  This is not as
> simple as moving from vectors to distributed memory - there wasn't
> nearly as much accumulated code then (and, it hurt back then). 

I would not drop the whole MPI standard, I would define a subset that is 
the recommanded API for performance. If your code is too old, link with 
a legacy MPI lib. If it's coded with the subset, link either with a 
legacy MPI lib and it works, or link with the optimized MPI lib and see 
what the MPI implementation can deliver.

>>It's used because it's there, there is no other reason. If you don't 
>>know who sends you what in a message passing application, then you 
>>cannot get either performance or robustness. If really you cannot do 
>>otherwise (and I don't believe that), you can always use unexpected 
>>messages (post the receive after Probe()ing), That's ugly, but you get 
>>what you deserved :-)
> 
> 
> That just isn't true.  If I don't know how many messages I will get, or
> from whom, but I can bound it, then I should prepost those receives. 
> This is particularly true in your standard physics code that runs for
> days and does thousands of time steps. (i.e. you can maintain a circular
> queue of these things).

A few years back, I looked at a lot of real world code to see if 
triggering the communication from the receive side could be worth it, ie 
if most of the messages did not use ANY_SENDER. I was amazed that the 
vast majority of the messages sent across many applications used the tag 
to discriminate on the sender among other things, not the source. For 
the couple of large code I dissected (sorry, don't remember the names 
right now), there was no rationale. I guess doing bookkeeping on the 
source and the tag was too much for the developer(s).

You can still do the receive-pull optimization and fall back on 
sender-push when you see a receive with ANY_SENDER, but if ANY_SENDER is 
the common case, that's useless. The best way to force developer to 
write code that can leverage optimization in the MPI lib is to remove 
the source of the ambiguity. So ANY_SENDER in the legacy API, not in the 
subset.

> The user should always expose as much opportunity for optimization as
> possible to the MPI layer.  e.g. a load-store architecture like the X1
> (not what I am advocating for MPI performance, mind you) could do
> excellent datatype processing.  You would rather the user do the
> gather/scatter themselves to prohibit the MPI from being able to do it?

In general yes, more opportunities for optimization is better. Now, 
assuming that irregular datatypes can be optimized as much as regular 
ones is wrong. The hardware can gather/scatter better than the 
application for nice long strides. However, MPI libs should print 
insults when tiny segments are used (when the scatter/gather efficiency 
collapse). The developer assumes that's it's fine because he does not 
know or he does not care.

I advocate to hide the guns instead of letting the developer shoot 
himself in the foot.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From patrick at myri.com  Wed Feb 16 02:07:27 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 16 Feb 2005 05:07:27 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <421242C6.2050800@ccrl-nece.de>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<1108398183.8243.54.camel@localhost.localdomain>	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>	<1108402962.8265.25.camel@localhost.localdomain>	<4211B891.6020406@ccrl-nece.de>
	<4211CB13.3050902@myri.com> <421242C6.2050800@ccrl-nece.de>
Message-ID: <42131B5F.8040100@myri.com>

Joachim Worringen wrote:
> AFAIK, Myrinet's MPI (MPICH-GM), for example, does use the standard 
> (partly naive) collective operations of MPICH. Considering this, plus 
> the fact

Replacing the collectives from MPICH-1 was not high on the todo list 
because there was more important things to optimize, with more effects 
on applications that the scheduling of some collectives. For scaling 
real codes on large machines, your priority is not there, not enough 
bang for your time.

> - that it's not all that hard to use GM for pt-2-pt efficiently. We have 
> done this in our MPI, too, with the same level of performance.

You have then no idea how hard if to use GM efficiently and *correctly*. 
Enough to run pingpong ? sure, that's piece of cake. But how to recover 
from fatal errors on the wire, from resources exhaustion, to avoid to 
spend most of your time pinning/unpinning pages, to not trash the 
translation cache on the NIC, etc ? Did you address all of these issues 
in your MPI ? Maybe, but it requires some design characteristics that 
would be higher than the device layer. At one time you have to make 
choices, and in a Swiss-Army-Knive (SAK) implementation, you choose the 
common ground, or the existing ground.

> - that you probably do not know anything on ScaMPI's current internal 

True, I know zip about ScaMPI design. This is exactely why I don't know 
how they use GM. Without knowing that, how can you infer hardware 
characteristics from benchmark results ?!?

> design (Intel is MPICH2 plus some Intel-propietary device hacking) and 
> little about it's performance (if this is wrong, let us know)

Intel MPI is MPICH2 plus some multi-device glue. Intel got something 
right in their design: they ask the vendor to provide the native device 
layers instead of doing everything themselves. That's how a (SAK) 
implementation could actually be decent. However, the reference 
implementation is using uDapl. That means that there is stuff above the 
device layers that are needed to make the MPI-over-uDapl performance 
decent. Some of it can be used for other devices, the rest not. The 
question is that if I need something above the device layer to make my 
stuff decent, could I have it ? I would think so. Now, if it conflicts 
with something needed for another device, what happens ? Someone makes a 
choice.

> - that all code apart from the device, and also the device architecture 
>  of MPICH-GM are more or less 10-year-old swiss-army-knive MPICH code 
> (which is not a bad thing per se)

MPICH-1 is not a SAK. You cannot take an MPICH binary and run it on all 
of the devices on which MPICH has been ported. You can *compile* it on 
multiple targets, but nothing more.

Furthermore, many ch2 things where not used in ch_gm. If you look at it, 
most of the common code of MPICH is not performance related, at the 
exception of the collectives (and again they are not that bad). MPICH-2 
has been moving more things to the device-specific part, that's the good 
direction.

> you should maybe think again before judging on the efficiency of other 
> MPI implementations.

I could not care less about the efficiency of other MPI implementations. 
  None of my business. My point is that assuming that using a SAK MPI 
implementation factorize the software part and all remaining performance 
differences are thus hardware related is ridiculous. As Greg pointed 
out, an interconnect is a software/hardware stack, all the way to the 
MPI lib. Throw away the native MPI lib and you have a lame duck. Compare 
lame ducks and you go nowhere.


You don't have much choice when you have a commercial MPI than to 
support many interconnects. You cannot ask the vendors to write their 
part unless you are Intel, so you write it yourself. You do your best, 
because you need to sell your stuff, and you call it good. Is there a 
value ? Today yes, because it makes life easier to have binary 
compatibility. However, my second point is that binary compatibility 
should be addressed by the MPI community, not by commercial MPI 
implementations.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From patrick at myri.com  Wed Feb 16 02:14:45 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 16 Feb 2005 05:14:45 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4212087F.6070809@verarisoft.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<1108398183.8243.54.camel@localhost.localdomain>	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>	<1108402962.8265.25.camel@localhost.localdomain>	<4211B891.6020406@ccrl-nece.de>
	<4211CB13.3050902@myri.com> <4212087F.6070809@verarisoft.com>
Message-ID: <42131D15.4020305@myri.com>

Rossen Dimitrov wrote:

> Patrick, this is quite a broad statement. 4 years ago we had a paper 
> arguing that MPI's written to support many different interconnects and 
> messaging technologies through internal portability layers were probably 
> sub-optimal for at least some of the interconnects. Most of the reasons 

Yes, it's very logical. See my reply to Joachim, I don't critic the 
existence of SAK implementations (actually, yes, a little), all 
commercial implementations are essentially swiss-army-knives, they have 
to. My problem is to use results from one unique MPI implementations to 
connect dots at the hardware level. You don't know if the dots are from 
the MPI or the hardware, or both.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From eugen at leitl.org  Wed Feb 16 02:31:53 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Wed, 16 Feb 2005 11:31:53 +0100
Subject: [Beowulf] Mare Nostrum (not quite COTS)
Message-ID: <20050216103153.GH1404@leitl.org>


http://www-106.ibm.com/developerworks/library/pa-nl3-marenostrum.html

Power Architecture Community Newsletter, 15 Feb 2005: MareNostrum: A new
concept in Linux supercomputing		
	e-mail it!
	
	
Contents:
The name and the history
Meet MareNostrum
Distinguishing technologies
View from the crow's nest
Resources
About the author
Rate this article
Related content:
Project MareNostrum site
IBM eServer Cluster Servers
Subscriptions:
dW newsletters

Level: Introductory

developerWorks Power Architecture editors
IBM
15 Feb 2005

    The MareNostrum supercomputer at the Barcelona Supercomputing Center,
ranked number four in the world in speed in November 2004, is constructed of
such totally off-the-shelf parts as IBM BladeCenter JS20 servers, 64-bit
970FX PowerPC processors, TotalStorage DS4100 storage servers, and Linux 2.6.
This is its story.

IBM? has long been a supercomputing leader -- its heritage of innovation
currently and spectacularly manifested in its most powerful supercomputer,
Blue Gene?/L. The MareNostrum project is the latest bold experiment in
supercomputing by IBM -- a small but powerful, rapidly deployed and built
system that comes entirely from commercially available components. The Latin
term mare nostrum means "our sea" (which to the Romans meant the
Mediterranean, as familiar and available to the Italici as the air they
breathed, but also the critical key to their success).

MareNostrum is one of the world's most powerful supercomputers, ranked among
the top five in the prestigious TOP500 (see Resources), yet it is constructed
from products available for sale to any business, lives within a relatively
small footprint, and was built on a tight schedule using blade servers, a
Linux. operating environment, and other cost-efficient technologies.
MareNostrum represents a new way of thinking about high-performance
computing.

Blade servers, some of the most thin and dense machines that can be slid into
chassis with the ability to share sources such as power and network switches,
became the base components of this supercomputer design. Those familiar with
the IBM BladeCenter. JS20 servers' shared-resources architecture will
recognize how these servers cost-effectively minimize power consumption and
heat output. Running the Linux operating system, the servers exploit the
capabilities of the 2.6 kernel on 64-bit PowerPC? processors.

MareNostrum also demonstrates something very unique in its project timeline:
Part of its mission was to prove the speed at which IBM Linux clusters could
be implemented and unleashed. According to the IBM MareNostrum e-Science
Lead, Dr. Juan Jose Porta (Open Systems Design and Development, IBM
Boeblingen Laboratory):

    This is all about timely and focused execution. The speed at which this
project was realized is important. Consider: from the initial concept in late
December of 2003 to assembling the computer in Madrid took less than a year.
Normally, this kind of supercomputer projects take years. 

To make a remarkable saga short, MareNostrum is here and will soon be put
into operation by the Barcelona Supercomputer Center (BSC), a public
consortium created by the Spanish Government, the Catalonian Government, and
the Technical University of Catalonia (UPC), the hosts of the MareNostrum
supercomputer. The Barcelona Supercomputing Center is located on the
Polytechnic University of Catalonia (UPC) campus in Barcelona.

Dr. Porta added, "The supercomputer is based upon commodity technology
already developed and available. We were also playing with another piece of
magic -- an open environment. This has been a collaborative community effort,
where we closely worked with our partners."

The name and the history
Why "MareNostrum?" In the words of Dr. Porta:

    MareNostrum means literally "our sea," which is also the Latin name for
the Mediterranean Sea on which Barcelona is a port. It carries other apt
connotations. "Our sea" refers to a sea of processors and professors who are
flocking to the MareNostrum project with a deep commitment to breakthrough
science. MareNostrum also refers to the fact that our supercomputer is on the
shores of the Mediterranean which, in the days of old Rome, was the middle of
the world. This was the center of the Roman Empire, now to become the center
of European e-Science on the shores of the nice Mediterranean Sea! Thus, we
are talking about an ocean of many professors and a major hub around which
such facilitation will grow and thrive to empower a new generation of
scientists.

    Another significant aspect of the name is that, being Latin, it is more
culturally inclusive. Not everyone is aware that Spain has actually four
official languages, and we did not want to slight anyone. Latin was a safe
choice. Spain now understandably becomes the proud home to the most powerful
supercomputer in Europe. We see references to its having been assembled in
Madrid, but also references to its permanent home as being in Barcelona. 

MareNostrum is a result of the burgeoning partnership between IBM and the
Spanish Government, which has also led to the creation of the Barcelona
Supercomputing Center (BSC). BSC is a public consortium created by the
Spanish Government, the Catalonian Government, and the Technical University
of Catalonia (UPC), which will host the MareNostrum supercomputer.

Housed in a majestic 1920s chapel on the university grounds, MareNostrum
serves a dual purpose: To serve as a primary high-performance computing
resource for the European e-science community and to demonstrate the many
benefits of Linux on POWER. in scale.

Meet MareNostrum
With peak system performance of 40 teraflops for the final system
configuration, and a number four spot on the TOP500 list, MareNostrum
continues the IBM tradition of high-performance computing breakthroughs in
the service of scientific advancement with a twist: MareNostrum is built
entirely of commercially available components, including:

    * 2,282 IBM eServer BladeCenter JS20 blade servers housed in 163
    * BladeCenter chassis
    * 4,564 64-bit IBM PowerPC 970FX processors
    * 140 TB of IBM TotalStorage? DS4100 storage servers

The thinking behind MareNostrum's construction represents a new way of
looking at these and other compute-intensive areas. Today's typical
high-performance computing installation runs a large, parallel RISC-based
UNIX? system with performance instead of reliability being of utmost
importance. MareNostrum, however, is a small-footprint Linux cluster made up
entirely of off-the-shelf components. With the extreme density of IBM eServer
BladeCenter JS20 servers, diskless nodes, and an open system environment,
MareNostrum offers superior price/performance; greater reliability,
availability, and serviceability; and significant cost efficiencies --
factors that are endearing Linux-based cluster servers to more and more
businesses all the time.

Distinguishing technologies
The next sections explain the hardware and software technologies that
distinguish the high-performance computing strategy behind MareNostrum.

Hardware: Servers
There are 2,282 IBM eServer BladeCenter JS20 servers housed in 163
BladeCenters chassis. Each server Blade has two PowerPC 970 processors
running at 2.20GHz, providing superior performance for several varieties of
Linux. The BladeCenter technology offers the highest commercially available
computer density in the industry, which results in high performance with a
small footprint. The BladeCenter technology allows for 84 dual processor
servers in a single 42 U rack, giving more than 1.4 teraflops of compute
power in a single rack.

Hot-swappable JS20 servers also allow administrators to change servers
without disrupting applications, maximizing availability. Its
shared-resources architecture helps to minimize power consumption and heat
output, as well.

Hardware: Storage
MareNostrum's storage subsystem consists of 20 storage server nodes with 7
terabytes of capacity each or 140 terabytes of total capacity. Its backbone
is the IBM TotalStorage DS4100 storage server which, like the BladeCenter
JS20, uses redundant hot-swappable components for high availability. IBM
TotalStorage DS4100 technology enables tremendous scalability and a wide
range of RAID data protection options.

Hardware: Switching
Four switch frames with Myrinet, including 10 CLOS 256+256 switches and 2
Spine 1280s and densely bundled Myrinet cabling enables faster parallel
processing with less switching hardware. The redundant hot-swappable power
supply ensures greater availability. The complete switch with 12 chassis
provides for 2,560 uniform ports. This uniformity simplifies the programming
model so researches can focus on their programs and not the system
interconnect architecture.

Software: The power of Linux on POWER
The Linux 2.6 kernel offers an array of enterprise and performance features
that exploit the Power Architecture.. The virtualization capabilities of
Linux on POWER allow for more flexible partitioning, better balancing of
workloads, and superior scalability should workloads increase. Dr. Porta
explained, "It is the Linux 2.6 kernel which offers an array of enterprise
and performance features that exploit the Power Architecture."

Software: Diskless Image Management (DIM)
DIM is a prototype utility for managing the Linux distribution for the
compute nodes on the storage servers so that the compute node does not have
to manage the root file system. All the files for operation are obtained
through the cluster network. Because of this, blades can operate immediately
without Linux installation. This is on-demand operation. The blades do have a
disk drive but that is reserved for future application use such as
checkpointing. DIM also supports the network boot environment in a highly
distributed fashion.

Software: IBM Linux on POWER clustering technologies
The goal is to endow MareNostrum with the same benefits businesses in many
industries derive from IBM Linux clusters, albeit on a larger scale. Benefits
such as:

    * Superior density and improved operating efficiency, including smaller
    * space, power, and cooling requirements and related costs -- thanks to
    * the BladeCenter JS20 architecture
    * Record price/performance and system throughput for high-performance
    * computing workloads thanks to innovative POWER semiconductor
    * technology, specifically the eight-way superscalar design of the
    * PowerPC 970FX processor which fully supports symmetric multi-processing
    * (SMP)
    * The leading IBM 64-bit POWER microprocessors are capable of addressing
    * four billion times the amount of physical memory as traditional 32-bit
    * processors without resorting to complex memory-extension techniques.
    * Better systems management control thanks to embedded service processors
    * and software image management
    * Increased reliability, availability, and serviceability, as well as
    * lower installation and maintenance costs -- provided by diskless
    * compute nodes
    * Improved functionality and performance thanks to the Linux 2.6 kernel
    * Reduced switching hardware requirements and faster parallel processing
    * provided by Myrinet switch cabling
    * Improved storage subsystem costs and reliability thanks to TotalStorage
    * DS4100 storage technology

View from the crow's nest
When the power of MareNostrum is unleashed later this year, it will be at the
service of scientific, engineering, and medical researchers in the Spanish
and international scientific communities. Its to-do list includes issues that
are familiar in the supercomputing world, such as protein folding, in silico
(computer generated) drug screening and enzymatic reactions. MareNostrum will
be used to support basic and applied research in areas that include biology,
chemistry, physics, and information-based medicine.

As Dr. Porta summed up:

    ...[T]he very thinking that drove MareNostrum's construction is a new way
of looking at compute-intensive areas, particularly in the life sciences, as
we prepare new work to resolve challenging problems in information based
medicine -- including improvements in diagnostic and therapeutic treatments
in hospitals. In the EU context, many of the projects will be conducted in
collaboration with other leading European research institutions. We are
building collaborative efforts across geographic borders and disciplines. And
remember -- the name of the supercomputer is MareNostrum. Traditionally, it
was the Mediterranean Sea which allowed commerce and communication to
flourish in Europe and beyond. 

Resources

    * Visit the Project MareNostrum site, demonstrating the value of Linux
    * clustering for science, for business, for life itself.

    * MareNostrum is now at home at the Barcelona Supercomputing Center (BSC)
    * on the Polytechnic University of Catalonia (UPC) campus in Barcelona, a
    * prestigious public institution focused on higher education, research,
    * and technology transfer.

    * The TOP500 Supercomputer Sites project was started in 1993 to provide a
    * reliable basis for tracking and detecting trends in high-performance
    * computing -- twice a year, the project releases a list of the 500 sites
    * operating the most powerful computer systems.

    * See this chart for the Linpack benchmark for MareNostrum and others.

    * This news article examines MareNostrum, IBM's top-ranked,
    * off-the-shelf, blade-based supercomputer.

    * Connecting two or more IBM eServer Cluster Servers can create a single,
    * unified computing resource that will dramatically improve availability,
    * flexibility, and adaptability for essential services.

    * The IBM BladeCenter JS20 is well- suited for commercial mainstream
    * applications and 64-bit high performance computing (HPC) environments.

    * The IBM Redbook, The IBM eServer BladeCenter JS20, takes an in-depth
    * look at the two-way Blade eServer for applications requiring 64-bit
    * computing.

    * The Linux on IBM eServer product line is Linux-enabled to deliver
    * maximum performance, reliability, manageability, and price/performance
    * benefits.

    * See this site for more on how IBM supercomputing solutions can help
    * remove the barriers to deployment of clustered server systems.

    * IBM TotalStorage DS400 series has been enhanced with the DS4000 Storage
    * Manager V9.10, enhanced remote mirror option, DS4100 option for larger
    * capacity configurations, and support for EXP100 serial ATA expansion
    * units .

    * Take a look at the Myrinet switches used in MareNostrum.

About the author
The developerWorks Power Architecture editors welcome your comments on this
article. E-mail them at dwpower at us.ibm.com.


-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050216/14ca2081/attachment.sig>

From patrick at myri.com  Wed Feb 16 02:53:03 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 16 Feb 2005 05:53:03 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <1108477871.4587.115.camel@s861954.sandia.gov>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com>
	<1108477871.4587.115.camel@s861954.sandia.gov>
Message-ID: <4213260F.5040303@myri.com>

Keith,

Keith D. Underwood wrote:
>>Looking for overlaping is actually not that hard:
>>a) look for medium/large messages, don't waste time on small ones.
> 
> 
> I contend that this particular item is bad advice.  If you send a lot of
> small messages, you should use MPI_Isend there as well to give the MPI
> implementation every opportunity to do the right thing.  As we go
> forward, end-to-end acknowledgments are going to become a reality.  The

I agree. We are strongly considering acking at the lib level instead of 
at the firmware level in MX. It has many good side effects, and a few 
evil ones.

> last thing you want is to spend a round-trip delay on every message you
> send if you send a lot of them.  Yes, the implementation can copy on the
> sending side to allow the send to complete, but that wastes memory and
> time.  

If you are reliable, you need to be able to resend the data if you don't 
receive the ack in time. If you don't want to do a copy, you have to 
wait for the ack before releasing the send buffer. For small messages, 
the copy is cheaper than the rtt, IMHO.

Do you say that if someone use Isend for sending small messages, it's an 
hint that avoiding the copy is worth it because he tries to overlap and 
he does not care about latency ? Yes, that would be logical. But then 
you need to have blocking Send to hint the reverse, and then you assume 
smart people will use blocking Send because they know latency matters at 
that place, whereas clueless people will use it because it's simpler 
than Isend.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From patrick at myri.com  Wed Feb 16 03:28:00 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 16 Feb 2005 06:28:00 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4212182C.60607@verarisoft.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com>
	<4212182C.60607@verarisoft.com>
Message-ID: <42132E40.1060001@myri.com>

Rossen,

Rossen Dimitrov wrote:
> 
>>
>> So if you run an MPI application and it sucks, this is because the 
>> application is poorly written ?
> 
> 
> Patrick, here the argument is about whether and how you "measure" the 
> "performance of MPI". I guess you may have missed some of the preceding 
> postings.

No, I was pulling your leg :-) The bigger picture is that MPI has no 
performance in itself, it's a middleware. You can only measure the way 
an MPI implementation enable a specific application to perform. Only 
benchmarking of applications is meaningful, you can argue that 
everything else is futile and bogus.

>> You don't want to benchmark an application to evaluate MPI, you want 
>> to benchmark an application to find the best set of resources to get 
>> the job done. If the code stinks, it's not an excuse. Good MPI 
>> implementations are good with poorly written applications, but still 
>> let smart people do smart things if they want.
> 
> 
> This is exactly my point made in my previous posting - you cannot design 
> a system that is optimal in a single mode for all cases of its use when 
> there are multiple parameters defining the usage and performance 

I agree completely, being able to apply different assumptions for the 
whole code and see which one match the best the applications behavior is 
better than nothing. However, I believe that some tradeoffs are just too 
intrusive: you should not have to choose between low latency for small 
messages or progress by interrupt for large ones, especially when you 
can have both at the same time.

> I think it is fairly easy to show that overlapping and polling (or any 
> kind of communication completion synchronization) are not orthogonal. If 
> this was the case, you would see codes that show perfect overlapping 
> running on any MPI implementation/network pair. I am sure there is 
> plenty of evidence this is not the case.

I can show you codes where people sprinkled some MPI_Test()s in some 
loops. They don't poll to death, just a little from time to time to 
improve overlap by improving progression. They poll and they overlap. 
They could as well block and not overlap. polling/blocking and 
overlap/not are not linked. Interrupts are useful to get overlap without 
help from the application, but it's not required to overlap.

> There is an important point here that needs to be clarified: when I say 
> "polling" library, I assume that this library does both: polling 
> completion synchronization and polling progress. There is not much room 
> to define here these but I am sure MPI developers know what they are.

I think this is where we don't understand each other. For me, polling 
means no interrupts. Wherever you progress in the context of MPI calls 
or in the context of a progression thread, you pay for the same CPU 
cyles. If the application is providing CPU cycles to the MPI lib at the 
right time, you can overlap perfectly without wasting cycles.

> Here is a third one. Writing your code for overlapping with non-blocking 
> MPI calls and segmentation/pipelining, testing the code, and not seeing 
> any benefit of it.

Yes. This is very true. But if it's not worse than with blocking, they 
should stick with non-blocking, even if it's bigger and more confusing.

> stage I with communication in stage I+1. Then, there is the question how 
> many segments you use to break up the message for maximum speedup. The 
> pipelining theory says the more you can get the better, when they are 
> with equal duration, there aren't inter-stage dependencies, and the 
> stage setup time is low in proportion to the stage execution time. Also, 

The more steps, the more overhead. Small pipeline stages decrease your 
startup overhead (when the second stage is empty) but increase the 
number of segments and the total cost of the pipeline. The best is to 
find a piece of computation long enough to hide the communication. 
Pipelining would be overkill in my opinion.

> The metric I mentioned earlier "degree of overlapping" with some 
> additional analysis can help designers _predict_ whether the design is 
> good or not and whether it will work well or not on a particular system 
> of interest (including the MPI library).

Temporal dependency between buffers and computation is the metric for 
overlaping. The longuer you don't need a buffers, the better you can 
overlap a communication to/from it. Compilers could know that.

> This is however too much detail for this forum though, as most of the 
> postings here discuss much more practical issues :)

I am bored with cooling questions. However, it's quite time consuming to 
argue by email. I don't know how RGB can keep the distance :-)

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From ashley at quadrics.com  Wed Feb 16 03:26:55 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Wed, 16 Feb 2005 11:26:55 +0000
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <20050216080525.GA3122@greglaptop.attbi.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain>
	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
	<1108402962.8265.25.camel@localhost.localdomain>
	<4211B891.6020406@ccrl-nece.de>
	<20050216080525.GA3122@greglaptop.attbi.com>
Message-ID: <1108553215.14604.9.camel@localhost.localdomain>

On Wed, 2005-02-16 at 00:05 -0800, Greg Lindahl wrote:
> Quadrics STEN (forgive me for classing this as
> dumb, I happen to think dumb is a compliment...) get this right.

In this context the STEN in used on the transmit side of the network as
a way of doing effectively PIO writes directly into the network.  On the
receive side the NIC is anything but dumb and does the MPI tag matching.
It's almost entirely bypasses the CPU leaving it free to do *whatever
the application desires*.

Interesting enough the STEN is a very good example of what is being
discussed here, doing a remote write (Or MPI send) using the STEN is
lower latency than using a DMA but uses more CPU cycles (as the STEN
needs the data to be "pushed" from the main CPU whereas a (R)DMA only
needs the DMA descriptor to be "pushed" and the NIC then "pulls" the
actual data).

Ashley,


From patrick at myri.com  Wed Feb 16 03:53:43 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 16 Feb 2005 06:53:43 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <20050215130656.8572F1C818@amd64.cownie.net>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<1108398183.8243.54.camel@localhost.localdomain>
	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>
	<1108402962.8265.25.camel@localhost.localdomain>
	<4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com>
	<20050215130656.8572F1C818@amd64.cownie.net>
Message-ID: <42133447.9050207@myri.com>

Hi James,

James Cownie wrote:
> As someone who was on the MPI Forum, and sat through an awful lot of
> meetings, I'd like to provide some justification for _why_ we didn't try
> to make a binary standard.

No, I imagine the context was very different 10 years ago. I just don't 
understand why dynamic spawning, one-sided communications and MPI-IO 
were added to the Standard, but nobody wanted to address the mpi.h 
header compatibility issue. By that time, people knew that it was a 
problem. no ?

 > You seem to think (maybe subconsciously) that the MPI forum added
 > features the standard just to make life hard for implementors and
 > to kill performance ;-)

Well, it was the right thing to be as exhaustive as possible to insure 
the wide adoption of the standard. It was expert friendly, but easy for 
the application folks to miss the points or take shortcuts. That's the 
cose of success.

Now, I would hate to see a shared memory paradigm emerge to 
progressively replace MPI because existing applications don't really try 
to leverage the message passing paradigm capabilities. Some believe it 
will never happen, I am not so sure.

> If you _really_ believe that there is so much performance benefit for
> your customers in having an MPI-light with the restrictions you outlined
> which only runs on your hardware, then no-one's stopping you from
> providing it. 

This discussion is a beginning. It will only happen if all/most MPI 
implementators reach a point where it's clear that to move forward, some 
semantic has to be avoided and some ambiguities cleared, and that can 
only be done at the API level. I would prefer that the MPI forum focus 
on improving the core message passing functionalities instead of adding 
yet another vertical dimension (what's left for MPI-3 ?).

The urgent thing however is the ABI. Can we do that ?

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From ashley at quadrics.com  Wed Feb 16 03:55:37 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Wed, 16 Feb 2005 11:55:37 +0000
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <421306B5.3080200@myri.com>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de>
	<4211C559.8070100@myri.com>
	<1108479093.4587.132.camel@s861954.sandia.gov>
	<421306B5.3080200@myri.com>
Message-ID: <1108554937.14604.17.camel@localhost.localdomain>

On Wed, 2005-02-16 at 03:39 -0500, Patrick Geoffray wrote:


> In general yes, more opportunities for optimization is better. Now, 
> assuming that irregular datatypes can be optimized as much as regular 
> ones is wrong. The hardware can gather/scatter better than the 
> application for nice long strides.

>  However, MPI libs should print 
> insults when tiny segments are used (when the scatter/gather efficiency 
> collapse). The developer assumes that's it's fine because he does not 
> know or he does not care.

I have seen code that used a multi megabyte array of 64bit float/short
pairs, effectively having 10 bits of data and 6 bits of "space".
Changing this to a 64bit float and two 32bit ints removed the void space
and replaced it with deliberate zero data.  The "data transferred" went
up, application buffer sizes remained the same and performance was a
whole lot better.  The application writer had used a short to "save
space" and was somewhat stunned at the performance improvement.

This is a situation that would be best avoided, maybe user education is
the key but it's a common problem and there are an awful lot of users.
I'm not against complex datatypes on MPI but they are hard to deal with
and do get mis-used.

Ashley,


From patrick at myri.com  Wed Feb 16 04:04:31 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 16 Feb 2005 07:04:31 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <1108553215.14604.9.camel@localhost.localdomain>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<1108398183.8243.54.camel@localhost.localdomain>	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>	<1108402962.8265.25.camel@localhost.localdomain>	<4211B891.6020406@ccrl-nece.de>	<20050216080525.GA3122@greglaptop.attbi.com>
	<1108553215.14604.9.camel@localhost.localdomain>
Message-ID: <421336CF.5020505@myri.com>

Ashley Pittman wrote:

> Interesting enough the STEN is a very good example of what is being
> discussed here, doing a remote write (Or MPI send) using the STEN is
> lower latency than using a DMA but uses more CPU cycles (as the STEN
> needs the data to be "pushed" from the main CPU whereas a (R)DMA only
> needs the DMA descriptor to be "pushed" and the NIC then "pulls" the
> actual data).

It seems to be common practice to use PIO for small messages on the send 
side. MX/Myrinet does that too (whereas GM/Myrinet does not), SCI does 
it, Greg's IB on HT does it. I don't know who is not burning some cycles 
to get lower latency for small messages these days.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From rgb at phy.duke.edu  Wed Feb 16 04:17:10 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 16 Feb 2005 07:17:10 -0500 (EST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <42132E40.1060001@myri.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com>
	<4212182C.60607@verarisoft.com> <42132E40.1060001@myri.com>
Message-ID: <Pine.LNX.4.58.0502160659250.3848@lilith.rgb.private.net>

On Wed, 16 Feb 2005, Patrick Geoffray wrote:

> > This is however too much detail for this forum though, as most of the 
> > postings here discuss much more practical issues :)
> 
> I am bored with cooling questions. However, it's quite time consuming to 
> argue by email. I don't know how RGB can keep the distance :-)
> 
> Patrick
> 

I stuck a hairpin into an electrical socket at age 2 (an "enlightening"
experience I must say) and had a large rock fall on my head from a
height of almost a meter at age 8.

Since then, I hardly ever get bored with cooling questions, because I
cannot remember that they've been asked.  What were we talking about,
again?

Oh yeah, MPI and all that.

I've actually been enjoying reading the discussion and not
participating, since I'm a PVM kinda guy.  But SINCE my name was invoked
in vain, I'll make a single comment on the code quality issue, which is
that underlying the discussion of communication pattern, blocking vs
non-blocking, and directives is the fundamental scaling properties of
the code and algorithm itself.  So on the issue of whether MPI sucks
because the application sucks -- well, possibly, but it seems more
likely that the application sucks because its parallel scaling
properties (with the algorithm chosen) suck.

As to how "intelligent" the back end library should be at choosing
algorithm -- I would say the BASIC library should be atomic, elementary,
NOT algorithm level stuff.  A thin skin on top of raw networking calls
that provides the various things one always has to do oneself but not
much more.  Where one gets into trouble is where one uses a command that
has a complex structure that doesn't fit your code without realizing it,
and the reason you don't realize it is because all that detail is
hidden, and isn't even uniform in RELATIVE performance across varying
network hardware.

In other words, to make MPI do more, either make it do less (in the form
of commands that can be used to build "more" in a manner that is tuned
to application and hardware) or be prepared to REALLY make it SMART
behind the scenes.

This isn't just MPI, BTW.  PVM suffers from the same thing.  I honestly
think that both are limited tools in part BECAUSE they put too thick a
skin between the programmer and the network.  If you want real
performance and complete control over communication algorithm, you
probably have to use raw/low level networking commands, and write the
appropriate "collective" operations for your particular application and
hardware.

Of course nobody does this -- not portable and a PITA to
design/write/maintain.  Or perhaps a few people DO do this, but they're
programming gods.  And this isn't crazy, really.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From mathog at mendel.bio.caltech.edu  Wed Feb 16 08:16:25 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Wed, 16 Feb 2005 08:16:25 -0800
Subject: [Beowulf] Academic sites: who pays for the electricity?
Message-ID: <E1D1Rqj-0006jD-00@mendel.bio.caltech.edu>

In most universities services like electricity, water, and 
A/C are paid for by the school.  To do so they take "overhead"
out of every grant.  Partially as a consequence of this they
typically have a very poor ability to meter usage on a room
by room basis.

Now somewhere between the 10 node Pentium II beowulf sitting on
a lab bench and the 1000 node dual P4 Xeon beowulf in a machine
room that takes up half the basement the cost of the electricity
(both for power and A/C) goes from  a minor expense to a major
one.  Really major. For instance, in that hypothetical large machine,
at 10 cents per kilowatt hour (a round number), assuming 100 watts
per CPU (another round number) that's:

  1000  (nodes) *
     2  (cpus/node) *
     .1 (kilowatts/cpu) *
     .1 (dollars/kilowatt-hour) *
  365   (days /year) *
   24   (hours/day) =
-----------------------
  175200 dollars/year

The A/C expense is going to vary tremendously depending upon
the outside temperature.  It's going to be much higher for us
in Southern California than for a site in Anchorage.

"Typical" lab usage is widely variable but I'd be amazed
if most biology or chemistry labs burn through even 1/10th this
much for the equivalent lab area.  Some physics lab running
a tokamak might come close.


Anyway, the question is, have any of the universities said "enough
is enough" and started charging these electricity costs directly?
If so, what did they use for a cutover level, where usage was
"above and beyond" overhead?

>From an economic perspective having electricity and A/C come out
of overhead (without limit) grossly distorts the true cost
of the project over time and can lead to choices which increase
the total overall cost. For instance, the use of Xeons instead of
Opterons has little effect on TCO if somebody else is picking
up the electricity tab, but could change the power consumption
significantly on a large project.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From rgb at phy.duke.edu  Wed Feb 16 09:22:35 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 16 Feb 2005 12:22:35 -0500 (EST)
Subject: [Beowulf] Academic sites: who pays for the electricity?
In-Reply-To: <E1D1Rqj-0006jD-00@mendel.bio.caltech.edu>
References: <E1D1Rqj-0006jD-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.58.0502161146060.883@ganesh.phy.duke.edu>

On Wed, 16 Feb 2005, David Mathog wrote:

> In most universities services like electricity, water, and 
> A/C are paid for by the school.  To do so they take "overhead"
> out of every grant.  Partially as a consequence of this they
> typically have a very poor ability to meter usage on a room
> by room basis.
> 
> Now somewhere between the 10 node Pentium II beowulf sitting on
> a lab bench and the 1000 node dual P4 Xeon beowulf in a machine
> room that takes up half the basement the cost of the electricity
> (both for power and A/C) goes from  a minor expense to a major
> one.  Really major. For instance, in that hypothetical large machine,
> at 10 cents per kilowatt hour (a round number), assuming 100 watts
> per CPU (another round number) that's:
> 
>   1000  (nodes) *
>      2  (cpus/node) *
>      .1 (kilowatts/cpu) *
>      .1 (dollars/kilowatt-hour) *
>   365   (days /year) *
>    24   (hours/day) =
> -----------------------
>   175200 dollars/year

I usually assume $1/watt/year (including AC) which is likely to be good
within 20% or so depending on the actual cost of electricity in your
area and amount of AC required on a seasonally averaged basis.  That
yields an estimate of $200K in your example -- not really different,
just easier to do in your head as a round number.

> 
> The A/C expense is going to vary tremendously depending upon
> the outside temperature.  It's going to be much higher for us
> in Southern California than for a site in Anchorage.
> 
> "Typical" lab usage is widely variable but I'd be amazed
> if most biology or chemistry labs burn through even 1/10th this
> much for the equivalent lab area.  Some physics lab running
> a tokamak might come close.
> 
> 
> Anyway, the question is, have any of the universities said "enough
> is enough" and started charging these electricity costs directly?
> If so, what did they use for a cutover level, where usage was
> "above and beyond" overhead?

This issue has most definitely come up at Duke, although we're still
seeking a formula that will permit us to deal with it equitably.  This
is only one of several pieces of overhead associated with clusters that
go above and beyond the assumptions that went in to the original
indirect costs formulas.  For example, Duke now charges grants a
"recycling fee" for certain pieces of environmentally toxic end-of-life
hardware (e.g.  monitors, with their lead-filled screens).  Then there
are the really HUGE costs for physical space renovations as valuable and
scarce campus space is converted for use in the burgeoning clusters.

As our Dean of A&S recently remarked, if there aren't any checks and
balances or cost-equity in funding and installing clusters, they may
well continue to grow nearly exponentially, without bound (Duke's
cluster population is doubling almost according to Moore's Law -- every
couple of years).  Costs associated with those clusters from the space
to hold them, the power to run them, and the people to operate them, all
grow roughly linearly with the number of nodes.  This much is known.

What isn't known is the details of the income stream.  Each cluster (or
part of a cluster) is typically connected with a specific grant-funded
project and its associated income stream.  Indirect costs >>are<<
assessed on those grants; it may be that on average, enough income comes
from those indirect costs to easily support the clusters.  This isn't
crazy -- it is really a question of just what the ratio is of supported
people and other IC-producing expenses are to the number of cluster
nodes associated with the research.  I wish I knew this number -- it
would be very useful in a CWM column;-) -- but I don't, and last I heard
Duke still didn't know either, although they are perhaps moving slowly
towards expending the energy required to find out.  

Finding out isn't trivial -- it involves running down ALL the clusters
on campus, figuring out whom ALL those nodes "belong" to, determining
ALL the grant support associated with all those people and projects and
clusters (since even research done without a cluster by a person who
runs a cluster has to be considered as contributing, as the cluster may
be "essential" to retaining that person), figuring out what the sum of
the indirect costs are on all those grants, and finally connecting that
total to the estimated cost of running all the nodes.  By enabling more
research projects, postdocs, laboratory operations, and other
grant-funded activity to occur their presence on campus might MAKE the
university money, who knows?

Indirect cost formulas actually tend to EXCLUDE capital equipment such
as clusters.  If it didn't the University would have made something on
the order of 50% indirect costs on the roughly $2M the hardware in your
example above would cost, and out of the resulting $1M (noting that the
total grant would have had to be $3M for the hardware alone) plus
overhead on the salary of the 2-3 people likely to be hired to run the
1000 node cluster, they could have easily paid for power for 3-5 years.

So one proposal is to no longer exclude clusters from indirect cost
assessments.  Of course this "solution" creates another problem just as
big -- will granting agencies stand for this?  There is a reason
indirect costs aren't charged on capital equipment and it isn't because
Universities don't WANT to charge them, it is because many granting
agencies flatly refuse to pay them.  Some do -- IIRC, NIH is pretty
tolerant about indirect costs associated with hardware, probably because
in medical research they "expect" to have to support entire labs as
there is less likelihood of having a teaching stream of income to
partially defray the costs.  NSF does not, and I don't believe the DoD
or DOE grants like to as well.

Another is to just force clusters to budget and pay their own utility
bills.  I don't know how this would fly with grant agencies.  They might
be irritated if they had to pay for both the utilities and for indirect
costs on the utility money (basically paying 1.5x or so of the cost of
the power/AC used, so that the University would actually make another
$100K in overhead in your example above, but they might hold still for
the $200K/year for power alone.  They almost certainly WOULD pay for
utilities for clusters in places other than Universities, so this isn't
so big a jump.

> >From an economic perspective having electricity and A/C come out
> of overhead (without limit) grossly distorts the true cost
> of the project over time and can lead to choices which increase
> the total overall cost. For instance, the use of Xeons instead of
> Opterons has little effect on TCO if somebody else is picking
> up the electricity tab, but could change the power consumption
> significantly on a large project.

Absolutely.  Or, using shelved tower units vs 2U rackmounts vs 1U
rackmount nodes, when space is "scarce" and hence expensive.  Or
requiring each node to have remote management hardware, PXE network
cards, 3 year onsite service plans -- all of these choices will be very
differently made depending on how the chooser is constrained and who is
paying for what.

I don't have a really perfect solution to this dilemna, and indeed I
think it is a bit premature to expect one.  When SOME institution does a
real CBA on the total cash flow associated with grant-funded
cluster-based research projects, including the more esoteric benefits
such as "institutional prestige" (which is serious business, don't
forget -- a weight factor that affects ALL grants submitted from an
institution) perhaps we can start to think about which clever idea for
recovering costs is realistic and fair.  In the meantime, budgets of the
groups that actuall pay these costs continue to get a wee bit strained
as the number of nodes and associated costs continue to spiral upward.

Maybe I'll do a column on this soon.  I did a whole article on
infrastructure for Linux Mag a year or two ago, but the particular
aspect of infrastructure that you raise is still unresolved.  I wonder
if I could get Duke people to expedite collecting and assembling the
data required to get the big picture on this...?

   rgb

> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From James.P.Lux at jpl.nasa.gov  Wed Feb 16 10:56:03 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed, 16 Feb 2005 10:56:03 -0800
Subject: [Beowulf] Academic sites: who pays for the electricity?
In-Reply-To: <Pine.LNX.4.58.0502161146060.883@ganesh.phy.duke.edu>
References: <E1D1Rqj-0006jD-00@mendel.bio.caltech.edu>
	<Pine.LNX.4.58.0502161146060.883@ganesh.phy.duke.edu>
Message-ID: <6.1.1.1.2.20050216104144.07664e40@mail.jpl.nasa.gov>

At 09:22 AM 2/16/2005, Robert G. Brown wrote:
>On Wed, 16 Feb 2005, David Mathog wrote:
>
> > In most universities services like electricity, water, and
> > A/C are paid for by the school.  To do so they take "overhead"
> > out of every grant.  Partially as a consequence of this they
> > typically have a very poor ability to meter usage on a room
> > by room basis.
> >
> >
>I don't have a really perfect solution to this dilemna, and indeed I
>think it is a bit premature to expect one.  When SOME institution does a
>real CBA on the total cash flow associated with grant-funded
>cluster-based research projects, including the more esoteric benefits
>such as "institutional prestige" (which is serious business, don't
>forget -- a weight factor that affects ALL grants submitted from an
>institution) perhaps we can start to think about which clever idea for
>recovering costs is realistic and fair.  In the meantime, budgets of the
>groups that actuall pay these costs continue to get a wee bit strained
>as the number of nodes and associated costs continue to spiral upward.
>


Such issues come up ALL the time in any government funded research.  And, 
the more govermnent oversight, the more data you have to collect on such 
"burden" and "overhead".  An extreme might be a Defense Department (or 
NASA) Cost Reimbursement type contract (Aka Cost Plus... note well.. There 
are NO government contracts that are cost plus percentage of cost.. they're 
illegal... The fee amount is fixed, or based on award criteria, but does 
not depend on on the amount spent, except perhaps in a negative fashion 
(bust a spending cap, and your award/incentive fee gets smaller))

In such cases, the funding source is VERY interested in just how you 
calcualated "cost", and therein lies much accounting. There's a sort of 
pendulumn type swing back and forth for certain types of costs (and 
management philosophies).  Do you count telephone service as an overall 
burden (raising your "overhead" percentage, but reducing the project's 
"Other direct costs (ODC)") or, do you chargeback the project for the cost 
of the phoneline, plus usage, plus some management "tax"?  The latter 
reduces your overhead percentage, but increases the "direct costs".  Same 
dollars flow either way, but in the latter case you WILL spend more time 
accounting for the other direct costs.  I suppose that in academia, the 
grantee might be sheltered a bit by the institutional processes, but in 
most other environments, it's been a reality for a long time.

Different companies have different philosophies on the approach, and either 
works, and will generally pass muster with the auditors.  It does make 
evaluating proposals a bit trickier.

Taken to an extreme, we have the health care industry approach of "code and 
cost every item", so that the acetaminophen they give you after delivering 
a baby or having your gall bladder removed shows up on the bill as 
"Dispense acetaminophen, 2 tablets at 100mg" and "Administer acetaminophen, 
100mg", each with separate charges near $10.  Sadly, that $10 probably is a 
realistic cost, too, considering that some non-zero amount of time was 
spent to enter the transactions into a database, requiring the use of 
trained "medical coders" who know the procedure codes for everything, as 
well as the capital and operating costs of the terminal and computer 
they're using.

I'm sure that clusters in industry face the question of Cost/Benefit 
analysis, including infrastructure impact.  Certainly this is the case for 
desktop PCs and mainframes in at least one industry where my wife is employed.

Questions such as David raised are only going to become more and more 
common as the drive for "accountability" increases.  Even within government 
agencies, such as NASA, the drive for "Full Cost Accounting" (which 
essentially imposes the same controls that have always been imposed on 
vendors on cost reimbursement contracts) is causing great pain, not because 
the costs actually change, but because it is a huge cultural and mental 
shift in how one plans ones work.


James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From rgb at phy.duke.edu  Wed Feb 16 12:03:16 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 16 Feb 2005 15:03:16 -0500 (EST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl>
References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl>
Message-ID: <Pine.LNX.4.58.0502161245530.883@ganesh.phy.duke.edu>

On Wed, 16 Feb 2005, Vincent Diepeveen wrote:

> It is possible for algorithms to have sequential properties, in short
> making it hard to scale well. Game tree search happens to have a few of
> such algorithms, from which one is performing superior with a number of
> enhancements having the same property that a faster blocked get latency
> speeds it up exponential.
> 
> For the basic idea why there is an exponential speedup see Knuth and search
> for the algorithm 'alfabeta'.
> 
> So the assumption that an algorithm sucks because it doesn't need bandwidth
> but latency is like sticking a hairpin in an electrical socket.
> 
> If users would JUST need a little bit of bandwidth they already can get
> quite far with $40 cards. 
> 
> So optimizing MPI for low latency small messages IMHO is very relevant. 

I sort of agree, but I think you miss my point.  Or maybe we totally
agree but I misunderstand.  Some algorithms will, as you note, have
sequential communications times associated with them that scale with the
number of nodes.  BOTH bandwidth AND/OR latency can be important in
minimizing those and other times in the algorithms, and which one is
important can even change during the course of the computation.  If I
have an algorithm where lots of small messages have to go to lots of
places in some order, latency becomes important.  If I have an algorithm
where lots of BIG messages have to go lots of places in some order,
bandwidth becomes important.  There is plenty of room in between the
extremes, order might or might not matter, resource contention might or
might not be an issue, and there is plenty of opportunity for both big
messages and small messages to be sent within a single application.

Optimizing any particular MPI (or PVM) command for either extreme is
then like robbing Peter to pay Paul, when Peter and Paul are a single
bicephalic individual that has to pay protection money to the mob for
every theft transaction (oh how I just LOVE to fold, spindle and
mutilate metaphors).  

To illustrate some of the complexity issues and how interesting Life can
be, consider the notion of a "broadcast".  Node A wants to send the same
message to N other nodes.  Fine fine fine, happens all the time in
certain kinds of parallel code.  Node A therefore uses one of the
collective operations (such as pvm_bcast or pvm_mcast in PVM, which is
where I have more experience).

Now, just what happens when this code is executed?  PVM in this case
promises that it will be a non-blocking send -- execution in the current
thread on A resumes as soon as the "message is safely on its way to the
receiving processors".  It of course cannot promise that those
processors will receive them, so "safely" is a matter of opinion.  It
also completely hides the details of how this >>difficult<< task is
implemented, and whether or not it varies depending on the hardware or
code context.  To find out what it really does you must Use the (pvm)
Source, Luke! and that is Not Easy because the actual thing it does is
hidden beneath all sorts of lower level primitives.

For example, the pvmd might send the message to each node in the
receiving list one at a time, essentially serializing the entire message
transmission (without blocking the caller, sure, but serial time is
serial time).  Or is it?  If RDMA is used to deliver the message at the
hardware device level so not even the pvmd is blocked for the full time
of delivery, maybe not.  Or, if the network device supports it, it might
use an actual network broadcast or multicast.  Then how efficiently it
proceeds depends on whether and how broadcasts are supported by e.g. the
kernel and intermediate network hubs.  It could be anything from truly
parallelized, so it takes a single latency hit to put the message on N
lines (as an old ethernet repeater broadcast would probably do) to a de
facto serialized latency (possibly with a much lower latency) hit as a
store and forward switch stores and then forwards to each line in turn.
If that's what they do these days -- it might even vary ethernet switch
to switch.  Myrinet, FC, SCI all might (and probably do) do
>>different<< things when one tries to optimally implement a
"broadcast".

Or PVM might refuse to second guess the hardware at all.  Instead it
might use some sort of tree, and send the message (serially) to only
some of the hosts in the receive list who both accept the message and
forward the message to others so (perhaps) you send to four hosts, the
first of these sends to 3 more while you finish, the second of these
sends to two more while you finish, the third to one while you finish,
and where each of THESE recipients find still more to send to so that
you cover a lot more than four hosts in four latency hits (at the
expense of involving all the intermediary systems heavily in delivery,
which may or may not delay the threads running on those nodes depending
very much on the HARDWARE.

Which of these it SHOULD do very likely depends strongly -- very
strongly -- on some arcane mix of the task itself and its organization
and communications requirements, the PARTICULAR size of the object being
sent THIS time, the networking hardware in use, the number of CPUs on
the node and number of CPUs on the node that are running a task thread,
and (often forgotten) the global communication pattern -- whether this
particular transmission is a one-way master-slave sort of thing or just
one piece of some sort of global message exchange where all the nodes in
the group are going to be adding their own messages to this stream while
avoiding collisions.

The programmer sees none of this, so they think of none of this.  They
think "Gee, a broadcast.  That's an easy way to send a message to many
hosts with one send command.  Great!"  They think "this will cost just
one latency hit to deliver the message to all N hosts because that's
what "broadcast" >>means<< as they well know from watching TV and
listening to the radio at the same time as a zillion other humans, in
parallel.  They may not (with PVM or MPI) even know what a "collision"
or tree >>is<<, as there is no prior assumption of knowledge about
physical networking or network algorithms in the learning or application
of the toolsets.

This is the Dark Side of APIs that hide detail and wrap it up in
powerful black-box commands.  They make the programming relatively easy,
but they also keep you from coming to grips with all those little
details upon which performance and sometimes even robustness or
functionality depend.

This is the "problem" I alluded to in the previous note.  To really do
the right thing for the user (presuming there IS such a thing as a
universally right thing to do or even hoping to do the right thing
nearly all the time) one either needs to write a large set of highly
differentiated and differently optimized commands -- not just pvm_bcast,
but pvm_bcast (presuming low latency hardware broadcast exists and is
efficient), pvm_bcast_tree (presuming that NO efficient hardware
broadcast exists and is efficient and possibly including additional
arguments to spell out some things about the tree), pvm_bcast_tree_join
(presuming that you need a tree and that each branch will both take
something off a message passing through as a leaf and adding back a
message to join the transmission to the next leaf as a root),
pvm_bcast_rr (round robin broadcasts that are optimally synchronized for
no collisions between "simultaneously" broadcasting hosts), and
pvm_bcast_for_other_special_cases for the ones I've forgotten or don't
know about, and maybe double or treble the entire stack or add flags to
be optimal for latency dominated patterns, bandwidth dominated patterns,
or somewhere in between.

Alternatively, one can make just one pvm_bcast command, but put God's
Own AI into it.  Make smart decisions inside the hardware-aware daemons
that automatically switch between all of the above and more, possibly
dynamically during the course of a computation, to minimize delivery
time and the load of all systems participating in the delivery.  Hope
that you do a good enough job of it that the result is still robust and
doesn't constantly hang and crash when assumptions you built into it
turned out to be incorrect or your "AI" turns out to be rather stupid.

All of this is just the opposite from the problems you encounter if you
program at the raw socket (or other hardware interface) level.  There
you have to work very hard to achieve the simplest of tasks -- open up a
socket to a remote host, establish bidirectional ports, work out some
sort of reliable message transmission sequence (which is nontrivial to
do if you work with the lowest level and hence fastest networking
commands, because communications is fundamentally asynchronous and
unreliable and thus simple read/write commands do not suffice).

However, once you get to where you CAN talk beween nodes with sockets,
you are forced to confront the communication and computational topology
questions and the various "special" capabilities of the hardware head
on.  No black boxes.  You want a tree, you get a tree, but YOU have to
program the tree and decide what to do if a leaf dies in mid-delivery,
etc.  You want to synchronize communications, you go right ahead but be
prepared to figure out how to communicate synchronization information
out of band.  You want non-blocking communications, set the appropriate
ioctl or fcntl, if you know how for your particular "file" or hardware.
Learn the select call.

Now things are totally controlled, but you have to be an experienced
network programmer (a.k.a. a network programming God) to do anything
complex.  Sleep with Stevens underneath your pillow, that sort of thing.
The big set, not just the single book version.  And if you're THAT good,
what are you doing working on a beowulf?  There are big money jobs out
there a-waitin', as there are for the other seventeen humans on the
planet with that kind of ability...;-)

What I think a lot of people (even experienced people) end up doing is
using PVM or MPI to mask out the annoying parts of raw networking --
maintaining a list of hosts in the cluster, dealing with the repetitive
parts of the network code that ensure reliable delivery, adding some
nice management tools to pass out-of-band information around for e.g.
process control and synchronization.  Then they use a relatively small
set of the available message passing commands, because they do not trust
the more advanced black box collectives.  

Usually this lack of trust has a basis -- they tried them in some
application or other and found that instead of speeding up, things got
slower or had unexpected side effects, and they had no way of knowing
what they actually DID, let alone how to fix them.  They have no access
to the real low level mechanism whereby those commands move data.

That's what I meant about making MPI or PVM more "atomic".  PVM has all
sorts of pack and unpack commands for a message that permit (I suppose)
typechecking and conversion to be done, where all I really want for most
communications is a single send command that takes a memory address and
a buffer length and a single receive command that takes a memory address
and a (maximum) length.  If I want to "pack" the buffer in some
particular way, that's what bcopy is for.  I don't want to "have" to
allocate and free it every time I use it, or use a command that very
likely allocates and frees a buffer I cannot see when I call it.  The
buffer might hold an int or int array, a double matrix, a struct, a
vector of structs -- who cares?  Pointers at both ends can unambiguously
manage the typing and alignment -- I'm the programmer, and this is what
I do.

With this much simpler structure one can at least think about
optimization as the problem is now much simpler.  A message is a block
of anonymous memory, period, with the programmer fully responsible for
aligning the send address or receive address on both ends with whatever
structure(s) or variable(s) are to be used there.  It is very definitely
less portable -- it leaves the user (or more likely a higher level
command set built on top of the primitives) with the hassle of having to
manage big-endian and little-endian issues as well as data size issues
if they use the message passer across/between incompatible systems.
These issues, however, were a lot more important in the past than they
are today, and they add a bunch of easily avoided overhead to the vast
majority of clusters where they aren't needed.

With a simple set of building block such as this, one could then
>>implement<< PVM or MPI or any other higher order, more portable
message passing API on top of it.  Indeed, I'd guess that this is very
much what PVM really does (don't know about MPI) -- the pvm_pack
routines very likely wrap up a malloc (of a struct), set metadata in
struct, bcopy into data region of struct, send struct, free struct, with
the inverse process on the other side driven/modified by the metadata
(such as endianness) as needed.  All sorts of tradeoffs of speed and
total memory footprint in that sequence, many of which are not always
necessary.  One could ALSO focus more energy on the higher
order/collective send routines, as one could write them INSIDE the low
level constructs provided so they become USER level software instead of
library black boxes.  With sources, modifying or tuning them would no
longer involve working with either raw sockets or a hidden set of
internals for the library itself.

I'm not sure I'm making myself clear on all of this, and for lots of
programs I'm sure it doesn't matter, but for really bleeding edge
parallel performance and complex code I suspect that raw sockets (or the
equivalent device interface for non-IP-socketed devices) still hold a
substantial edge over the same algorithms managed through same harddware
with the message passing libraries.  This was where I jumped in -- when
Patrick made much the same statement.  

This (if true) is a shame, and is likely due to the assumptions that
have gone into the implementation of the commands, some of which date
back to big iron supercomputer days where the hardware was VERY
DIFFERENT from today but ABSOLUTELY UNIFORM within a given hardware
platform, so that "universal" tuning was indeed possible.  Maybe it's
time to reassess these assumptions.  I am therefore trying to suggest
that instead of "fixing" the collectives to work better for optimal
latency at the expense of bw or vice versa (without even MENTIONING the
wide array of hardware the same command is supposed to "transparently"
achieve this miracle on) it might be better to work the other way -- add
some very low level primitives that do little more than encapsulate and
manage the annoying aspects of raw interfaces while still permitting
their "direct" use.

THEN implement PVM and MPI both on top of those low level primitives --
why not?  The differences are all higher order interface things --
ultimately what they do is move buffers across buses and wires, although
the process would be made a lot easier if there were a shared data
structure and primitives to describe and perform common tasks on a
"cluster" between them.  A coder could then choose to "use a compiler"
(metaphorically the encapsulated primitives) for some or all of their
code and accept the default optimizations, or "use an assembler" (the
primitives themselves) to hand-tune critical parts of their code,
without having to leave the nice safe portable bounds of their preferred
parallel library.  If done really well, it would accomplish the long
discussed merger of PVM and MPI almost as an afterthought with teeny
tweaks (perhaps) of the commands, since they would be based on the same
primitives and underlying data structures, after all.

Just dreaming, I guess. Possibly hallucinating.  That bump on the head,
y'know.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From James.P.Lux at jpl.nasa.gov  Wed Feb 16 13:52:26 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed, 16 Feb 2005 13:52:26 -0800
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502161245530.883@ganesh.phy.duke.edu>
References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl>
	<Pine.LNX.4.58.0502161245530.883@ganesh.phy.duke.edu>
Message-ID: <6.1.1.1.2.20050216134838.041990c8@mail.jpl.nasa.gov>

At 12:03 PM 2/16/2005, Robert G. Brown wrote:
>On Wed, 16 Feb 2005, Vincent Diepeveen wrote:
>
>
>This (if true) is a shame, and is likely due to the assumptions that
>have gone into the implementation of the commands, some of which date
>back to big iron supercomputer days where the hardware was VERY
>DIFFERENT from today but ABSOLUTELY UNIFORM within a given hardware
>platform, so that "universal" tuning was indeed possible.  Maybe it's
>time to reassess these assumptions.  I am therefore trying to suggest
>that instead of "fixing" the collectives to work better for optimal
>latency at the expense of bw or vice versa (without even MENTIONING the
>wide array of hardware the same command is supposed to "transparently"
>achieve this miracle on) it might be better to work the other way -- add
>some very low level primitives that do little more than encapsulate and
>manage the annoying aspects of raw interfaces while still permitting
>their "direct" use.


>THEN implement PVM and MPI both on top of those low level primitives --
>why not?  The differences are all higher order interface things --
>ultimately what they do is move buffers across buses and wires, although
>the process would be made a lot easier if there were a shared data
>structure and primitives to describe and perform common tasks on a
>"cluster" between them.  A coder could then choose to "use a compiler"
>(metaphorically the encapsulated primitives) for some or all of their
>code and accept the default optimizations, or "use an assembler" (the
>primitives themselves) to hand-tune critical parts of their code,
>without having to leave the nice safe portable bounds of their preferred
>parallel library.  If done really well, it would accomplish the long
>discussed merger of PVM and MPI almost as an afterthought with teeny
>tweaks (perhaps) of the commands, since they would be based on the same
>primitives and underlying data structures, after all.

Isn't this what "self tuning" kinds of packages (ATLAS?) do?  Or, at 
another level, what those horrible MAKE scripts do that attempt to address 
every possible instruction set, hardware, glibc, etc. variation in 
existence (or that some bright soul took it into his mind to come up with 
one weekend after getting home from a Grateful Dead concert).


>Just dreaming, I guess. Possibly hallucinating.  That bump on the head,
>y'know.
>
>    rgb
>
>--
>Robert G. Brown                        http://www.phy.duke.edu/~rgb/
>Duke University Dept. of Physics, Box 90305
>Durham, N.C. 27708-0305
>Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From lindahl at pathscale.com  Wed Feb 16 16:04:55 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 16 Feb 2005 16:04:55 -0800
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502161245530.883@ganesh.phy.duke.edu>
References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl>
	<Pine.LNX.4.58.0502161245530.883@ganesh.phy.duke.edu>
Message-ID: <20050217000455.GF2018@greglaptop.internal.keyresearch.com>

On Wed, Feb 16, 2005 at 03:03:16PM -0500, Robert G. Brown wrote:

> Optimizing any particular MPI (or PVM) command for either extreme is
> then like robbing Peter to pay Paul, when Peter and Paul are a single
> bicephalic individual that has to pay protection money to the mob for
> every theft transaction (oh how I just LOVE to fold, spindle and
> mutilate metaphors).  

Um, most MPI implementations have at least 3 algorithms, for short,
long, and very long messages. So are they all breaking your rule?

It's *unoptimizing* some of the cases that's at question. Most MPIs
unoptimize compute/communication overlap with long messages, because
it's hard work to get that right without hurting all short messages.

-- greg


From rgb at phy.duke.edu  Thu Feb 17 07:14:48 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 17 Feb 2005 10:14:48 -0500 (EST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <20050217000455.GF2018@greglaptop.internal.keyresearch.com>
References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl>
	<Pine.LNX.4.58.0502161245530.883@ganesh.phy.duke.edu>
	<20050217000455.GF2018@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.58.0502170833510.3892@lilith.rgb.private.net>

On Wed, 16 Feb 2005, Greg Lindahl wrote:

> On Wed, Feb 16, 2005 at 03:03:16PM -0500, Robert G. Brown wrote:
> 
> > Optimizing any particular MPI (or PVM) command for either extreme is
> > then like robbing Peter to pay Paul, when Peter and Paul are a single
> > bicephalic individual that has to pay protection money to the mob for
> > every theft transaction (oh how I just LOVE to fold, spindle and
> > mutilate metaphors).  
> 
> Um, most MPI implementations have at least 3 algorithms, for short,
> long, and very long messages. So are they all breaking your rule?

No, as I noted later in the (yes, long:-) message.  That's what there
should be.  Although that they do isn't clear to the user, and the user
has no control over it (that I can see in the standard).  They have to
trust the implementation to do the right thing.

> It's *unoptimizing* some of the cases that's at question. Most MPIs
> unoptimize compute/communication overlap with long messages, because
> it's hard work to get that right without hurting all short messages.

Again, I think that we in agreement.  All I was ultimately suggesting is
that message passing libraries that contain complex higher level
commands that make optimization decisions (including the decision not to
optimize) that result in a complex command that may not be optimal for a
significant number of complex cases might benefit from having access to
lower level primitives from which the actual complex commands are built
so users can roll their own within the library without having to resort
to raw networking.

You might not agree with this suggestion, but it is as you say the point
in question.  As I also said, I'm not an MPI expert by any means and
therefore have to go look up commands beyond the MPI 1 standard (which
look not horribly unlike the PVM command set as far as communication is
concerned) and am probably shaky there, but looking them up on the
mpi-forum.org site, it looks like MPI 2 adds MPI_PUT, MPI_GET,
MPI_ACCUMULATE which are just exactly what I was suggesting and what I
would have hoped for, especially if they are indeed the primitives from
which at least some of the higher order commands are built.  If so,
users can either choose to use the optimized/unoptimized higher level
commands provided or (if they understand their problem and hardware)
roll their own.

This is the distinction I was talking about.  MPI originally passed
messages at a high level of abstraction to wrap a variety of mechanisms
in use on big supercomputers (not forgetting that it was a consortium of
the vendors of such supercomputers that wrote the standard in response
to pressure from the government and other major consumers who were tired
of rewriting code every time a new supercomputer was released with its
own internals and API for moving data between processors/processes).  It
(I think deliberately) avoided providing any sort of interface that
might be interpreted as a "thin" wrapper to those internals that were
responsible for minimal latency, maximum bandwidth movement of data.
Whether this was to make the government happy (hiding the detail) or to
make themselves happy (leaving a purchaser of a supercomputer with an
incentive to write optimizations in their native API and hence become
"hooked" on the hardware) is a moot point.

PVM has a different, but related, history.  It was built on top of
networking from the beginning, more or less, and was deliberately
designed to hide the networking primitives (specifically) from the
programmer where MPI might have been hiding shared memory primitives and
create a "virtual machine" where MPI was running on REAL machines.  It
if anything went out of its way to avoid RMA-like message passing
commands that "look" like a wrapper to shared memory following instead a
fairly simple reliable message transmission model and in the end (3.x)
had almost exactly the same range and general form of commands as MPI
1.x for the bulk of what a user was likely to do, with maybe a bit nicer
control interface over the virtual machine and a bit less control over
collective operations.

Looking over (for the first time) the MPI 2 additions, I have to say
that they look very nice, possibly nice enough to finally consider
switching to MPI from PVM.  Alternatively, it is something that should
be cloned in PVM -- PVM would really benefit from PVM_GET, PVM_PUT, and
some synchronization primitives.  Provision of what amount to wrappers
on raw RMA primitive commands (that can be/should be tuned for the
hardware) and the separation of the RMA part and any synchronization
components mean that a serious programmer has a lot of ability to
control and optimize (assuming only that these commands truly are
implemented as primitives as used to develop the higher level commands)
without leaving the library, while people are able to use the higher
level collectives when they are either a good match for their task or if
they are a beginner and not ready to tackle lower level programming.

The only thing I still don't find (on a fairly rapid lookover) is a
discussion on just what e.g. broadcast does or how to make it vary what
it does.  Part of this of course doesn't belong in a standards document
which isn't intended to describe algorithms or implementations at that
level of detail.  However, one part does.  I think it matters a great
deal to the programmer to know whether or not broadcast (and other
commands) are indeed hardware primitive or if they are implemented on
top of point-to-point communications primitives that may or may not
involve diverting intermediary processors from their running tasks (and
ditto for scatter/gather type operations).  

This seems like it might be a programming decision point for people who
really want to hand-optimize their code.  Again, this is based on my
experiences in PVM, where I've tried using broadcast several times in
master/slave contexts expecting to reduce latency and communications
times only to find that the command was de facto serialized and in fact
took as long or longer than just running a loop over point to point
communications calls.  Perhaps MPI does it better, or differently, but
it doesn't LOOK like it is anything but a black box which can swing from
being good on one network to terrible on another without warning.

How to implement such a thing in a standard is an open question, but
from a programmer interface point of view having a set of commands that
can query and set variables to control the back end behavior of
collectives or determine properties of the hardware in the cluster would
be very useful.  Just one creative idea might be for MPI to provide an
optional initialization command to run on a cluster that builds a table
of quiescent-state and cpu-loaded-state latencies for short, medium, and
long messages both point to point and in collective mode.  The same
table might hold some describing the selected hardware device such as
hw_bcast=TRUE along with the broadcast latency.

>From this one might be able to build portable MPI programs that run
optimally on Myrinet while they still run optimally on gig Ethernet,
with or without e.g. a hardware RDMA command that significantly affects
and redistributes the CPU loading per message.

But maybe this is all too complicated, or doesn't belong in the standard
per se.  It is indeed like the ATLAS thing, but then, I think that ATLAS
is sheer genius although it is also cumbersome and clunky to build...;-)
I just dream of the day that ATLAS-like runtime optimization isn't so
clunky and is based on tools that create tables of microbenchmark
numbers that ARE sufficiently accurate and rich to achieve
near-optimization without running a build loop that sweeps and searches
a high-dimensional space...:-)

  rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rossen at VerariSoft.Com  Mon Feb 14 22:27:41 2005
From: rossen at VerariSoft.Com (Rossen Dimitrov)
Date: Tue, 15 Feb 2005 01:27:41 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<web-520595@free.net>	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
Message-ID: <4211965D.5080704@verarisoft.com>

Rob, I agree that by now it is well understood that by providing a very 
flexible API with a rich set of semantics, MPI may have missed some 
opportunities for accelerating message passing in some constrained 
cases. Many of us have seen codes that not only use just the famous 6 
MPI functions, but also avoid wild cards and out-of-order messages. As a 
result, these codes pay for services they don't use.

As far as the predicted end-of-life for MPI, I wouldn't necessarily bet 
on it. As often happens, the technical reasons may have little to do 
with the issue. By now MPI has had penetration in so many long-term 
programs that it will be around for quite a while. Of course, this does 
not mean that there would not be attempts to "fix" it or replace it with 
something else. This might in fact be a good thing - natural evolution 
of technology.

Rossen
Verari Systems Software

Rob Ross wrote:
> Rossen,
> 
> It would be good to mention that you work for a company that sells an
> implementation specifically designed for facilitating overlapping, in case
> people don't know that.  Clearly you guys have thought a lot about this.
> 
> The last two Scalable OS workshops (the only two I've had a chance to 
> attend), there was a contingent of people that are certain that MPI isn't 
> going to last too much longer as a programming model for very large 
> systems.  The issue, as they see it, is that MPI simply imposes too much 
> latency on communication, and because we (as MPI implementors) cannot 
> decrease that latency fast enough to keep up with processor improvements, 
> MPI will soon become too expensive to be of use on these systems.
> 
> Now, I don't personally think that this is going to happen as quickly as
> some predict, but it is certainly an argument that we should be paying
> very careful attention to the latency issue, because as MPI implementors 
> this is an argument that never seems to end.
> 
> Also, there is additional overhead in the Isend()/Wait() pair over the
> simple Send() (two function calls rather than one, allocation of a Request
> structure at the least) that means that a naive attempt at overlapping
> communication and computation will result in a slower application.  So
> that doesn't surprise me at all.
> 
> I think that the theme from this thread should be that "it's a good thing
> that we have more than one MPI implementation, because they all do
> different things best."
> 
> Rob
> ---
> Rob Ross, Mathematics and Computer Science Division, Argonne National Lab
> 
> 
> On Mon, 14 Feb 2005, Rossen Dimitrov wrote:
> 
> 
>>There is quite a bit of published data that for a number of real 
>>application codes modest increase of MPI latency for very short messages 
>>has no impact on the application performance. This can also be seen by 
>>doing traffic characterization, weighing the relative impact of the 
>>increased latency, and taking into account the computation/communication 
>>ratio. On the other hand, what you give the application developers with 
>>an interrupt-driven MPI library is a higher potential for effective 
>>overlapping, which they could chose to utilize or not, but unless they 
>>send only very short messages, they will not see a negative performance 
>>impact from using this library.
>>
>>There is evidence that re-coding the MPI part of an application to take 
>>advantage of overlapping and asynchrony when the MPI library (and 
>>network) supports these well actually leads to real performance benefit.
>>
>>There is evidence that even without changing anything in the code, but 
>>by just running the same code with an MPI library that plays nicer to 
>>the system leads to better application performance by improving the 
>>overall "application progress" - a loose term I used to describe all of 
>>the complex system activities that need to occur during the life-cycle 
>>of a parallel application not only on a single node, but on all nodes 
>>collectively.
>>
>>The question of short message latency is connected to system scalability 
>>in at least one important scenario - running the same problem size as 
>>fast as possible by adding more processors. This will lead to smaller 
>>messages, much more sensitive to overhead, thus negatively impacting 
>>scalability.
>>
>>In other practical scenarios though, users increase the problem size as 
>>the cluster size grows, or they solve multiple instances of the same 
>>problem concurrently, thus keeping the message sizes away from the 
>>extremely small sizes resulting from maximum scale runs, thus limiting 
>>the impact of shortest message latency. I have seen many large clusters 
>>whose only job run across all nodes is HPL for the top500 number. After 
>>that, the system is either controlled by a job scheduler, which limits 
>>the size of jobs to about 30% of all processors (an empirically derived 
>>number that supposedly improves the overall job throughput), or it is 
>>physically or logically divided into smaller sub-clusters.
>>
>>All this being said, there is obviously a large group of codes that use 
>>small messages no matter what size problem they solve or what the 
>>cluster size is. For these, the lowest latency will be the most 
>>important (if not the only) optimization parameter. For these cases, 
>>users can just run the MPI library in polling mode.
>>
>>With regard to the assessment that every MPI library does (a) partly 
>>right I'd like to mention that I have seen behavior where attempting to 
>>overlap computation and communication can lead to no performance 
>>improvement at all, or even worse, to performance degradation. This is 
>>one example of how a particular implementation of a standard API can 
>>affect the way users code against it. I use a metric called "degree of 
>>overlapping" which for "good" systems approaches 1, for "bad" systems 
>>approaches 0, and for terrible systems becomes negative... Here goodness 
>>is measured as how well the system facilitates overlapping.
>>
>>Rossen
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From philippe.blaise at cea.fr  Tue Feb 15 00:52:53 2005
From: philippe.blaise at cea.fr (Philippe Blaise)
Date: Tue, 15 Feb 2005 09:52:53 +0100
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <web-520595@free.net>
References: <web-520595@free.net>
Message-ID: <4211B865.6050400@cea.fr>

Mikhail Kuzminsky wrote:

>
> Let me ask some stupid's question: which MPI implementations allow
> really
>  
> a) to overlap MPI_Isend w/computations
> and/or b) to perform a set of subsequent MPI_Isend calls faster than 
> "the same" set of MPI_Send calls ?
>
Dear Mikhail,

sorry if it's not a direct answer to your question, but it could help.

There is a potential difficulty when you try to overlap MPI_Isend with 
some computations :
generally you do it on a cluster of SMP machines and the performance of 
the overlapping
should depend a lot on the placement of the processes on the SMP nodes.
On one hand if  some of the pair processes that do the MPI_Isend / Irecv 
are on the same node, you
won't be able to overlap communications with computations, but of course 
the communications should be faster
for large messages using shared memory  than using the NIC.
On the other hand if the pair processesses are on different nodes, for 
large messages the communication time using
the NIC is larger than the time for doing the same communication using 
shared memory, but of course if your NIC
(like the quadrics one for example) is able to do some overlap you will 
save some time.
Quadrics (again, but may be it's true for other network technologies) 
provide a way to use the NIC even for
the intra-node communication ;  but as a consequence you will share the 
NIC for intra and inter nodes communications
together and the potential benefit is not so clear.
So don't expect too much by overlapping communication with computation : 
it's very hard to tune, it depends a
lot on the placement of your program on the SMP nodes, the NIC 
functionnalities, and the scheme you use for the
communications !
If you have enough time, you could have a look to another approach by 
using a mixed OpenMP/MPI
programming scheme.
 
Regards,

  Phil.

 
From ole at scali.com  Tue Feb 15 01:28:10 2005
From: ole at scali.com (Ole W. Saastad)
Date: Tue, 15 Feb 2005 10:28:10 +0100
Subject: [Beowulf] Re: Re: Re: Home beowulf - NIC latencies (Patrick
	Geoffray)
In-Reply-To: <200502150010.j1F09vpb024195@bluewest.scyld.com>
References: <200502150010.j1F09vpb024195@bluewest.scyld.com>
Message-ID: <1108459690.25145.8.camel@pc-2.office.scali.no>


Patrick Geoffray wrote:
> There are more exotic work-arounds, like using 1) and polling at the 
> same time, and hiding the interrupt overhead with some black magic on 
> another processor. The one with the best potential would be to use 
> HyperThreading on Intel chips to have a polling thread burning cycles 
> continuously; it will run in-cache, won't use the FP unit or waste 
> memory cycles. A perfect use for the otherwise useless HT feature. I 
> wonder why nobody went that way...
> 


I have tried this in order to see if we could poll a memory location
for free using Intel HT. I run a kinetics program that is a small
Runge-Kutta stepping of equations simultaneously with a small
loop checking the content of a memory location then issuing a
PAUSE instruction and repeating the loop.

The simple finding is that the kinetics program got somewhat more 
than 70% of the CPU cycles and that the polling waisted close to
30% of the CPU cycles, 30% is not for free.


After this test I finally decided to forever leave HT off.


Ole

-- 
Ole W. Saastad, Dr.Scient. 
Manager Cluster Expert Center
dir. +47 22 62 89 68
fax. +47 22 62 89 51
mob. +47 93 05 74 87
ole at scali.com

Scali - www.scali.com
High Performance Clustering


From rene at renestorm.de  Tue Feb 15 02:37:23 2005
From: rene at renestorm.de (rene)
Date: Tue, 15 Feb 2005 11:37:23 +0100
Subject: [Beowulf] Block send mpi
In-Reply-To: <Pine.LNX.4.44.0502141335310.30236-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0502141335310.30236-100000@coffee.psychology.mcmaster.ca>
Message-ID: <200502151137.23227.rene@renestorm.de>

Hi Mark,

please revise this one more time.
Maybe I understood it now.

 int packsize;
 MPI_Pack_size (bit, MPI_INT, newcomm, &packsize);
- Calculates the memory demand (packsize) in bytes  needed for count (bit) of 
MPI_INTs.

 int bufsize = packsize + (MPI_BSEND_OVERHEAD);
- Adds the overhead
 void *buf = new (void (*[bufsize]) );
- allocates the needed buffer in bytes.

MPI_Buffer_attach (buf, bufsize);
- attaches the buffer  
bsend->ierr =	MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0,  newcomm);
- sends the data
MPI_Buffer_detach (&buf, &bufsize);
- Detaches it

Regards,
Rene

> > int packsize;
> > MPI_Pack_size (bit, MPI_INT, newcomm, &packsize);
>
> I would expect packsize to be counting bytes here.
>
> > int bufsize = packsize + (MPI_BSEND_OVERHEAD);
> > // 		  void *buf = new (void (*[packsize]) ());
> > int *buf = new (int ([packsize]));
>
> but here you have allocated an array of ints where the number
> of elements is packsize.  that means you have 4x too many bytes.
>
> >       bsend->ierr = MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, 
> > newcomm);
>
> bear in mind that &testdata[0] is legal but redundant -
> it means the same thing as bare 'testdata'.


-- 
Rene Storm
@Cluster

Linux Cluster Consultant
Hamburgerstr. 42e
D-22952 Luetjensee
mailto:Rene at ReneStorm.de
Voice-IP: Skype.com, Rene_Storm


From Dries.Kimpe at cs.kuleuven.ac.be  Tue Feb 15 05:06:40 2005
From: Dries.Kimpe at cs.kuleuven.ac.be (Dries Kimpe)
Date: Tue, 15 Feb 2005 14:06:40 +0100
Subject: [Beowulf] Block send mpi
In-Reply-To: <200502141204.45263.rene@renestorm.de>
References: <200502130529.58915.rene@renestorm.de>	<42103A3D.8020605@scalableinformatics.com>
	<200502141204.45263.rene@renestorm.de>
Message-ID: <4211F3E0.1060207@cs.kuleuven.ac.be>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

rene wrote:
| Hi Joe,
|
| here is some output and changes which solves the problem.
| I don't know, why I created a void buffer and sended an int array.
| After creating an int buffer I was also able to delete it ;o)
|

There are still some mistakes left:

| int *buf = new (int ([packsize]));
|
[...]
| delete buf;
|

The last line should be:

delete[] (buf);

Also, MPI_Pack_size returns the needed buffer size in bytes.
You are requesting (sizeof(int)*packsize) bytes where packsize bytes
would suffice.

~   Greetings,
~   Dries
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCEfPfv/8puanD4GoRArI9AJ0cmdT4m+Q7e9jvYhTZbbHviUDmQACglnTS
CqBqJ/GqpaHJjM7jI0MGkJc=
=nqLt
-----END PGP SIGNATURE-----


From kdunder at sandia.gov  Tue Feb 15 06:31:12 2005
From: kdunder at sandia.gov (Keith D. Underwood)
Date: Tue, 15 Feb 2005 07:31:12 -0700
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <421194C4.5050808@myri.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>
	<420D1801.9090206@isaacdooley.com>
	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>
	<420D54DA.8000904@uiuc.edu>
	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com>
Message-ID: <1108477871.4587.115.camel@s861954.sandia.gov>


> Looking for overlaping is actually not that hard:
> a) look for medium/large messages, don't waste time on small ones.

I contend that this particular item is bad advice.  If you send a lot of
small messages, you should use MPI_Isend there as well to give the MPI
implementation every opportunity to do the right thing.  As we go
forward, end-to-end acknowledgments are going to become a reality.  The
last thing you want is to spend a round-trip delay on every message you
send if you send a lot of them.  Yes, the implementation can copy on the
sending side to allow the send to complete, but that wastes memory and
time.  

					Keith


From kdunder at sandia.gov  Tue Feb 15 06:34:50 2005
From: kdunder at sandia.gov (Keith D. Underwood)
Date: Tue, 15 Feb 2005 07:34:50 -0700
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211A95F.2010709@myri.com>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com>
Message-ID: <1108478089.4587.118.camel@s861954.sandia.gov>


> c) ban of the ANY_SENDER wildcard: a world of optimization goes away 
> with this convenience.

Um, our apps guys say this is more than a convenience.  Apparently,
sometimes you don't exactly know who you are going to receive from. 
Would you rather them post receives from 4000 nodes and cancel the ones
that don't send to that node after a while?

					Keith


From kdunder at sandia.gov  Tue Feb 15 06:51:34 2005
From: kdunder at sandia.gov (Keith D. Underwood)
Date: Tue, 15 Feb 2005 07:51:34 -0700
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211C559.8070100@myri.com>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de>
	<4211C559.8070100@myri.com>
Message-ID: <1108479093.4587.132.camel@s861954.sandia.gov>


> Throw away compatibility. If you keep the legacy API, you have no 
> incentive for change. I don't want MPI-3, I want MPI-light. We are 
> against a wall because the MPI spec was too rich and developers took the 
> lazy path.

Inertia is a powerful thing.  Billions of dollars have been invested in
MPI codes.  Changing that will not be easy (or cheap).  This is not as
simple as moving from vectors to distributed memory - there wasn't
nearly as much accumulated code then (and, it hurt back then). 

> It's used because it's there, there is no other reason. If you don't 
> know who sends you what in a message passing application, then you 
> cannot get either performance or robustness. If really you cannot do 
> otherwise (and I don't believe that), you can always use unexpected 
> messages (post the receive after Probe()ing), That's ugly, but you get 
> what you deserved :-)

That just isn't true.  If I don't know how many messages I will get, or
from whom, but I can bound it, then I should prepost those receives. 
This is particularly true in your standard physics code that runs for
days and does thousands of time steps. (i.e. you can maintain a circular
queue of these things).

> If you don't use user-defined datatypes, then you don't need it and it 
> should not be there in the first place. It's a temptation, it's too 
> easy. No, there is no ways to implement them efficiently unless they are 
> regular, and this is what I am willing to keep: strided types with long 
> segments. Everything else leads to memory copies. The developer should 
> wipe his own bottom instead of asking the message passing interface to 
> work around bad data layout. Sending a column of blocs, yes, that's 
> regular stride and it makes a lot of sense. Sending non-contiguous 
> irregular structure ? As we used to say in France, $100 and a chocolate 
> bar with that ?

The user should always expose as much opportunity for optimization as
possible to the MPI layer.  e.g. a load-store architecture like the X1
(not what I am advocating for MPI performance, mind you) could do
excellent datatype processing.  You would rather the user do the
gather/scatter themselves to prohibit the MPI from being able to do it?
Not that anyone uses irregular MPI datatypes because they were so bad
for so long...  but it would be nice if that were exposed to MPI.

					Keith


From mhyoung at valdosta.edu  Tue Feb 15 06:56:03 2005
From: mhyoung at valdosta.edu (michael young)
Date: Tue, 15 Feb 2005 09:56:03 -0500
Subject: [Beowulf] Poor man's SANS - THANK YOU!!!!!!
In-Reply-To: <Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>
References: <42110F0A.10408@valdosta.edu>
	<Pine.LNX.4.58.0502142243530.1141@terra.mcs.anl.gov>
Message-ID: <42120D83.6080804@valdosta.edu>

Thank you so much to everyone who replyed.
All the info ya'll previded should keep me busy
for some time.  :)
again, thank you all very much.

Michael


Rob Ross wrote:

>Yes!
>
>PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :).  My
>group at ANL along with Clemson University and Ohio Supercomputer Center
>and others are developing this.  It's entirely open source and open
>development, and is in production use at ANL, OSC, and the University of 
>Utah CHPC, among other places.
>
>GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that
>RPMs are available for it now through one source or another.  This used to 
>be Sistina's product, who was subsequently bought by RedHat.  I'm sure 
>this is used in production in many business environments, and we use it at 
>ANL also.  Can someone provide a URL for this one?
>
>Lustre (www.lustre.org) is another option.  This one is heavily funded by
>the DOE ASC laboratories and is in use on some very large parallel
>machines.  But unless you have a relationship with CFS you can only get a
>crippled version of the source, so it's probably not a good option for
>average joe.  If they change their policy on releasing source code, this 
>would be worth reconsidering.
>
>Regards,
>
>Rob
>---
>Rob Ross, Mathematics and Computer Science Division, Argonne National Lab
>
>
>On Mon, 14 Feb 2005, michael young wrote:
>
>  
>
>>Hi,
>>Can I use beowulf or some other Linux cluster or HA Linux solution
>>to pool harddrive space together from differrent computers to make a
>>kinda "poor man's SANS"?
>>
>>thank you
>>Michael
>>    
>>
>
>  
>


From rossen at VerariSoft.Com  Tue Feb 15 06:34:39 2005
From: rossen at VerariSoft.Com (Rossen Dimitrov)
Date: Tue, 15 Feb 2005 09:34:39 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211CB13.3050902@myri.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<1108398183.8243.54.camel@localhost.localdomain>	<Pine.LNX.4.58.0502141104530.1141@terra.mcs.anl.gov>	<1108402962.8265.25.camel@localhost.localdomain>	<4211B891.6020406@ccrl-nece.de>
	<4211CB13.3050902@myri.com>
Message-ID: <4212087F.6070809@verarisoft.com>

> A last remark. I really think that the argument of using the same 
> swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or even 
> MPI/Pro to infere interconnect characteristics is even worse that 
> looking at latency and bandwidth alone. These implementations are never 
> going to be designed to use all hardware efficiently, their design is 
> either historic (Scali used to provided software for SCI alone) or 
> politicaly motivated (Intel is using uDapl, hummm, wonder why), or both. 
> They are by-products of the MPI forum failure to make the Standard 
> practical (compatible ABI).
> 
> Patrick

Patrick, this is quite a broad statement. 4 years ago we had a paper 
arguing that MPI's written to support many different interconnects and 
messaging technologies through internal portability layers were probably 
sub-optimal for at least some of the interconnects. Most of the reasons 
are obvious. At the time we were dealing with Portals, LAPI, and GM. You 
can easily see why having an internal portability layer for these 
interfaces does not seem to easily match the semantics of either one of 
them. We probably did something in our design to reflect this.


From rossen at VerariSoft.Com  Tue Feb 15 06:28:28 2005
From: rossen at VerariSoft.Com (Rossen Dimitrov)
Date: Tue, 15 Feb 2005 09:28:28 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <4211B0E0.6030007@ccrl-nece.de>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<web-520595@free.net>	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>	<42112719.4060500@verarisoft.com>	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>	<4211A95F.2010709@myri.com>
	<4211B0E0.6030007@ccrl-nece.de>
Message-ID: <4212070C.9050207@verarisoft.com>

> 
> 
> One person alone can't do this. The best place to discuss such things is 
> the MPI users group meeting (EuroPVM/MPI, this year in Capri/Italy).
> 
> Also, adding mpi.h to the standard to define an ABI is a good thing.
> 
>  Joachim
> 

In a conversation with MPI and tool developers and I once mentioned that 
not defining a standard/mandatory mpi.h was probably a missed 
opportunity for improving interoperability of MPI. I was then told by a 
member of the MPI-1 Forum that this was done on purpose. This makes me 
think that we will not see an ABI definition for MPI any time soon.

Rossen


From rossen at VerariSoft.Com  Tue Feb 15 07:41:32 2005
From: rossen at VerariSoft.Com (Rossen Dimitrov)
Date: Tue, 15 Feb 2005 10:41:32 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <421194C4.5050808@myri.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com>
Message-ID: <4212182C.60607@verarisoft.com>


> 
> So if you run an MPI application and it sucks, this is because the 
> application is poorly written ?

Patrick, here the argument is about whether and how you "measure" the 
"performance of MPI". I guess you may have missed some of the preceding 
postings.

> 
> You don't want to benchmark an application to evaluate MPI, you want to 
> benchmark an application to find the best set of resources to get the 
> job done. If the code stinks, it's not an excuse. Good MPI 
> implementations are good with poorly written applications, but still let 
> smart people do smart things if they want.

This is exactly my point made in my previous posting - you cannot design 
a system that is optimal in a single mode for all cases of its use when 
there are multiple parameters defining the usage and performance 
evaluation spaces. And this is the reason why we provide both {polling 
synchronization/polling progress} and {interrupt-driven 
synchronization/independent progress} MPI modes (we have published 
papers defining a space based on MPI design choices). With these modes 
we can at least increase the chance that the user can get a better match 
to his scenario.

>> - What application algorithm developers experience when they attempt to
>> use the ever so nebulous "overlapping" with a polling MPI library and
> 
> Overlaping is completely orthogonal with polling. Overlaping means that 
> you split the communication initiation from the communication 
> completion. Polling means that you test for completion instead of wait 
> for completion. You can perfectly overlap and check for completion of 
> the asynchronous requests by polling, nothing wrong with that.

Well, I would probably have to say that I don't agree with this. First, 
I think it is fairly easy to show that overlapping and polling (or any 
kind of communication completion synchronization) are not orthogonal. If 
this was the case, you would see codes that show perfect overlapping 
running on any MPI implementation/network pair. I am sure there is 
plenty of evidence this is not the case.

There is an important point here that needs to be clarified: when I say 
"polling" library, I assume that this library does both: polling 
completion synchronization and polling progress. There is not much room 
to define here these but I am sure MPI developers know what they are.

If polling and overlapping were orthogonal, the following would have had 
to be true:
1. You have a perfect network engine that takes no resources that might 
be used by computation when you either push bytes out or poll for completion
2. Once you start a request (e.g., MPI_Isend), the execution of this 
communication request takes no CPU.
3. You can have a very cheap, bound in duration polling operation from 
which you return immediately after it checks for your particular 
communication request
4. You have something else to do when the polling completion returns 
that your request is not done

I would argue that none of these are true in practical scenarios, even 
including very smart polling schemes or networks with DMA engines, like 
Myrinet.

Here I don't even bring the cases with multithreaded applications. These 
are still a fairly small minority.

> 
>> how this experience has contributed to the overwhelming use of
>> MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or
>> (even better) persistent MPI calls, thus killing any hope that these
>> codes can run faster on systems that actually facilitate overlapping.
> 
> There is 2 reasons why developers use blocking operations rather than 
> non-blocking one:
> 1) they don't know about non-blocking operations.
> 2) MPI_Send is shorter than MPI_Isend().

Here is a third one. Writing your code for overlapping with non-blocking 
MPI calls and segmentation/pipelining, testing the code, and not seeing 
any benefit of it.

> 
> 
> Looking for overlaping is actually not that hard:
> a) look for medium/large messages, don't waste time on small ones.
> b) replace all MPI_Send() by a pair MPI_Isend() + MPI_Wait()
> c) move the MPI_Isend() as early as possible (as soon as data is ready).
> d) move the MPI_Wait() as late as possible (just before the buffer is 
> needed).
> e) do same for receive.

Not quite. Most of the time the message-passing segment of the code you 
optimize for overlapping is in the innermost loop of the algorithm - the 
one that is most overhead sensitive and usually most optimized. You will 
not see common cases where you can "pull" MPI_Send much earlier or push 
MPI_Wait much later than where MPI_Send is. So what you usually end up 
doing is introducing another loop inside the innermost one, breaking up 
the MPI_Send message in a number of segments and pipelining them with 
MPI_Isend (or even better MPI_Start) by initiating segment I+1 while 
computing with segment I, thus attempting to overlap computation in 
stage I with communication in stage I+1. Then, there is the question how 
many segments you use to break up the message for maximum speedup. The 
pipelining theory says the more you can get the better, when they are 
with equal duration, there aren't inter-stage dependencies, and the 
stage setup time is low in proportion to the stage execution time. Also, 
the size of the segments should be such that the transmission time (not 
the whole latency) of the segment is as close as possible to the 
computation performed on the segment. I can continue with other factors 
that one need to take into account in order to write a good algorithm 
with overlapping.

The metric I mentioned earlier "degree of overlapping" with some 
additional analysis can help designers _predict_ whether the design is 
good or not and whether it will work well or not on a particular system 
of interest (including the MPI library).

This is however too much detail for this forum though, as most of the 
postings here discuss much more practical issues :)

Rossen


From steve_heaton at ozemail.com.au  Wed Feb 16 17:10:15 2005
From: steve_heaton at ozemail.com.au (steve_heaton at ozemail.com.au)
Date: Thu, 17 Feb 2005 12:10:15 +1100
Subject: [Beowulf] Re: Academic sites: who pays for the electricity?
Message-ID: <20050217011015.HITQ24369.swebmail02.mail.ozemail.net@localhost>

G'day all

Speaking as someone from "industry", and a Project/Programme Manager at that, I'd just like to add that I'm shocked and dismayed at the apparent lack of accountability that seems rampant in academic circles! If it was down to me I'd sack the lot of ya!! ;)

I'd strongly recommend that all good cluster folk have a good idea about operation expenditure (opex). If you get a visit from the Meanie Beanies (auditors / cost accountants etc etc) then it'll help cover your A. It's a great way to have your gig cancelled because you didn't have a firm understanding of your $'s in and out. Happens all the time in Industry and in my job it's a sackable offence. No joke. Do some homework and you won't need to be afraid (OK, *as* afraid of the Purple Pen People).

Some things to know
 ideally you should be able to quote these with as little as an hour's warning (shows you're on top of things):

-) The amount of floor space you consume (sq ft or m) - don't worry about the cost of this one, those asking will know ;) Becomes a hot topic if you're paying rent in some form.
-) Find out how much electricity you use per hour - chances are you're on one or more dedicated circuit(s) and probably separate metering - look at the bills. Don't worry about general lighting etc. It's often rolled into the floor space calcs.
-) Ditto aircon (include your maintenance)
-) Cluster hardware maintenance (out of warranty stuff, cost of spares) - quoting your amazing uptime can help explain this figure
-) Service contracts (you've got a Service Level Agreement right? Uptime % etc helps explain)
-) Staff / admin costs
-) The good ol' "anything else you can think of"

Now the fun part. Who used you cluster and for how long? Look at your job scheduling etc. Your department? Another department (do you cross charge somehow)? Which projects? What's their contribution to cluster opex?

If you answer reasonably accurately then the Beanies will treat you with some respect :) >>Someone, somewhere is paying your bills already.<< Know where that money is going!

Don't say I didn't warn you ;)

Cheers
Stevo

This message was sent through MyMail http://www.mymail.com.au


From rene at renestorm.de  Tue Feb 15 09:55:49 2005
From: rene at renestorm.de (rene)
Date: Tue, 15 Feb 2005 18:55:49 +0100
Subject: [Beowulf] Block send mpi
In-Reply-To: <200502151724.44168.rene@renestorm.de>
References: <Pine.LNX.4.44.0502150831010.13546-100000@coffee.psychology.mcmaster.ca>
	<200502151724.44168.rene@renestorm.de>
Message-ID: <200502151855.49438.rene@renestorm.de>


> OK nice
>
> > void *buf = new char[bufsize];
>
> would allocate a buffer of size bufsize (sizeof(char)=1) which is
> calculated by  mpi packsize + the overhead.
>
> Seems to be clear but wont' work and results in:
> 0 - MPI_BSEND : Insufficent space available in user-defined buffer
> [0]  Aborting program !
> [0] Aborting program!
> p0_9158:  p4_error: : 321
>
>
> Cu
>Rene


From diep at xs4all.nl  Wed Feb 16 02:56:18 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 16 Feb 2005 11:56:18 +0100
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
Message-ID: <3.0.32.20050216115617.01058100@pop.xs4all.nl>

At 11:07 14-2-2005 -0800, Greg Lindahl wrote:
>On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote:
>
>> Let me ask some stupid's question: which MPI implementations allow
>> really
>>  
>> a) to overlap MPI_Isend w/computations
>> and/or 
>> b) to perform a set of subsequent MPI_Isend calls faster than "the 
>> same" set of MPI_Send calls ?
>> 
>> I say only about sending of large messages.
>
>For large messages, everyone does (b) at least partly right. (a) is
>pretty rare. It's difficult to get (a) right without hurting short
>message performance. One of the commercial MPIs, at first release, had
>very slow short message performance because they thought getting (a)
>right was more important. They've improved their short message
>performance since, but I still haven't seen any real application
>benchmarks that show benefit from their approach.

Perhaps no one who needed fast latency bought those NICs in the first place.

A huge number of jobs that the 1024 processor SGI of dutch government used
to handle is 4-8 processors. Simply because latency matters.

>-- greg
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From diep at xs4all.nl  Wed Feb 16 04:13:34 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 16 Feb 2005 13:13:34 +0100
Subject: [Beowulf] Mare Nostrum (not quite COTS)
Message-ID: <3.0.32.20050216131329.0106f960@pop.xs4all.nl>

That looks great,

Congratulations on the supercomputer!

Which myrinet cards are in Mare Nostrum?
What one way pingpong latency can it get from 1 end of the machine to the
other end of the machine?

Vincent

At 11:31 16-2-2005 +0100, Eugen Leitl wrote:
>
>http://www-106.ibm.com/developerworks/library/pa-nl3-marenostrum.html
>
>Power Architecture Community Newsletter, 15 Feb 2005: MareNostrum: A new
>concept in Linux supercomputing		
>	e-mail it!
>	
>	
>	
>Contents:
>The name and the history
>Meet MareNostrum
>Distinguishing technologies
>View from the crow's nest
>Resources
>About the author
>Rate this article
>Related content:
>Project MareNostrum site
>IBM eServer Cluster Servers
>Subscriptions:
>dW newsletters
>
>Level: Introductory
>
>developerWorks Power Architecture editors
>IBM
>15 Feb 2005
>
>    The MareNostrum supercomputer at the Barcelona Supercomputing Center,
>ranked number four in the world in speed in November 2004, is constructed of
>such totally off-the-shelf parts as IBM BladeCenter JS20 servers, 64-bit
>970FX PowerPC processors, TotalStorage DS4100 storage servers, and Linux 2.6.
>This is its story.
>
>IBM? has long been a supercomputing leader -- its heritage of innovation
>currently and spectacularly manifested in its most powerful supercomputer,
>Blue Gene?/L. The MareNostrum project is the latest bold experiment in
>supercomputing by IBM -- a small but powerful, rapidly deployed and built
>system that comes entirely from commercially available components. The Latin
>term mare nostrum means "our sea" (which to the Romans meant the
>Mediterranean, as familiar and available to the Italici as the air they
>breathed, but also the critical key to their success).
>
>MareNostrum is one of the world's most powerful supercomputers, ranked among
>the top five in the prestigious TOP500 (see Resources), yet it is constructed
>from products available for sale to any business, lives within a relatively
>small footprint, and was built on a tight schedule using blade servers, a
>Linux. operating environment, and other cost-efficient technologies.
>MareNostrum represents a new way of thinking about high-performance
>computing.
>
>Blade servers, some of the most thin and dense machines that can be slid into
>chassis with the ability to share sources such as power and network switches,
>became the base components of this supercomputer design. Those familiar with
>the IBM BladeCenter. JS20 servers' shared-resources architecture will
>recognize how these servers cost-effectively minimize power consumption and
>heat output. Running the Linux operating system, the servers exploit the
>capabilities of the 2.6 kernel on 64-bit PowerPC? processors.
>
>MareNostrum also demonstrates something very unique in its project timeline:
>Part of its mission was to prove the speed at which IBM Linux clusters could
>be implemented and unleashed. According to the IBM MareNostrum e-Science
>Lead, Dr. Juan Jose Porta (Open Systems Design and Development, IBM
>Boeblingen Laboratory):
>
>    This is all about timely and focused execution. The speed at which this
>project was realized is important. Consider: from the initial concept in late
>December of 2003 to assembling the computer in Madrid took less than a year.
>Normally, this kind of supercomputer projects take years. 
>
>To make a remarkable saga short, MareNostrum is here and will soon be put
>into operation by the Barcelona Supercomputer Center (BSC), a public
>consortium created by the Spanish Government, the Catalonian Government, and
>the Technical University of Catalonia (UPC), the hosts of the MareNostrum
>supercomputer. The Barcelona Supercomputing Center is located on the
>Polytechnic University of Catalonia (UPC) campus in Barcelona.
>
>Dr. Porta added, "The supercomputer is based upon commodity technology
>already developed and available. We were also playing with another piece of
>magic -- an open environment. This has been a collaborative community effort,
>where we closely worked with our partners."
>
>The name and the history
>Why "MareNostrum?" In the words of Dr. Porta:
>
>    MareNostrum means literally "our sea," which is also the Latin name for
>the Mediterranean Sea on which Barcelona is a port. It carries other apt
>connotations. "Our sea" refers to a sea of processors and professors who are
>flocking to the MareNostrum project with a deep commitment to breakthrough
>science. MareNostrum also refers to the fact that our supercomputer is on the
>shores of the Mediterranean which, in the days of old Rome, was the middle of
>the world. This was the center of the Roman Empire, now to become the center
>of European e-Science on the shores of the nice Mediterranean Sea! Thus, we
>are talking about an ocean of many professors and a major hub around which
>such facilitation will grow and thrive to empower a new generation of
>scientists.
>
>    Another significant aspect of the name is that, being Latin, it is more
>culturally inclusive. Not everyone is aware that Spain has actually four
>official languages, and we did not want to slight anyone. Latin was a safe
>choice. Spain now understandably becomes the proud home to the most powerful
>supercomputer in Europe. We see references to its having been assembled in
>Madrid, but also references to its permanent home as being in Barcelona. 
>
>MareNostrum is a result of the burgeoning partnership between IBM and the
>Spanish Government, which has also led to the creation of the Barcelona
>Supercomputing Center (BSC). BSC is a public consortium created by the
>Spanish Government, the Catalonian Government, and the Technical University
>of Catalonia (UPC), which will host the MareNostrum supercomputer.
>
>Housed in a majestic 1920s chapel on the university grounds, MareNostrum
>serves a dual purpose: To serve as a primary high-performance computing
>resource for the European e-science community and to demonstrate the many
>benefits of Linux on POWER. in scale.
>
>Meet MareNostrum
>With peak system performance of 40 teraflops for the final system
>configuration, and a number four spot on the TOP500 list, MareNostrum
>continues the IBM tradition of high-performance computing breakthroughs in
>the service of scientific advancement with a twist: MareNostrum is built
>entirely of commercially available components, including:
>
>    * 2,282 IBM eServer BladeCenter JS20 blade servers housed in 163
>    * BladeCenter chassis
>    * 4,564 64-bit IBM PowerPC 970FX processors
>    * 140 TB of IBM TotalStorage? DS4100 storage servers
>
>The thinking behind MareNostrum's construction represents a new way of
>looking at these and other compute-intensive areas. Today's typical
>high-performance computing installation runs a large, parallel RISC-based
>UNIX? system with performance instead of reliability being of utmost
>importance. MareNostrum, however, is a small-footprint Linux cluster made up
>entirely of off-the-shelf components. With the extreme density of IBM eServer
>BladeCenter JS20 servers, diskless nodes, and an open system environment,
>MareNostrum offers superior price/performance; greater reliability,
>availability, and serviceability; and significant cost efficiencies --
>factors that are endearing Linux-based cluster servers to more and more
>businesses all the time.
>
>Distinguishing technologies
>The next sections explain the hardware and software technologies that
>distinguish the high-performance computing strategy behind MareNostrum.
>
>Hardware: Servers
>There are 2,282 IBM eServer BladeCenter JS20 servers housed in 163
>BladeCenters chassis. Each server Blade has two PowerPC 970 processors
>running at 2.20GHz, providing superior performance for several varieties of
>Linux. The BladeCenter technology offers the highest commercially available
>computer density in the industry, which results in high performance with a
>small footprint. The BladeCenter technology allows for 84 dual processor
>servers in a single 42 U rack, giving more than 1.4 teraflops of compute
>power in a single rack.
>
>Hot-swappable JS20 servers also allow administrators to change servers
>without disrupting applications, maximizing availability. Its
>shared-resources architecture helps to minimize power consumption and heat
>output, as well.
>
>Hardware: Storage
>MareNostrum's storage subsystem consists of 20 storage server nodes with 7
>terabytes of capacity each or 140 terabytes of total capacity. Its backbone
>is the IBM TotalStorage DS4100 storage server which, like the BladeCenter
>JS20, uses redundant hot-swappable components for high availability. IBM
>TotalStorage DS4100 technology enables tremendous scalability and a wide
>range of RAID data protection options.
>
>Hardware: Switching
>Four switch frames with Myrinet, including 10 CLOS 256+256 switches and 2
>Spine 1280s and densely bundled Myrinet cabling enables faster parallel
>processing with less switching hardware. The redundant hot-swappable power
>supply ensures greater availability. The complete switch with 12 chassis
>provides for 2,560 uniform ports. This uniformity simplifies the programming
>model so researches can focus on their programs and not the system
>interconnect architecture.
>
>Software: The power of Linux on POWER
>The Linux 2.6 kernel offers an array of enterprise and performance features
>that exploit the Power Architecture.. The virtualization capabilities of
>Linux on POWER allow for more flexible partitioning, better balancing of
>workloads, and superior scalability should workloads increase. Dr. Porta
>explained, "It is the Linux 2.6 kernel which offers an array of enterprise
>and performance features that exploit the Power Architecture."
>
>Software: Diskless Image Management (DIM)
>DIM is a prototype utility for managing the Linux distribution for the
>compute nodes on the storage servers so that the compute node does not have
>to manage the root file system. All the files for operation are obtained
>through the cluster network. Because of this, blades can operate immediately
>without Linux installation. This is on-demand operation. The blades do have a
>disk drive but that is reserved for future application use such as
>checkpointing. DIM also supports the network boot environment in a highly
>distributed fashion.
>
>Software: IBM Linux on POWER clustering technologies
>The goal is to endow MareNostrum with the same benefits businesses in many
>industries derive from IBM Linux clusters, albeit on a larger scale. Benefits
>such as:
>
>    * Superior density and improved operating efficiency, including smaller
>    * space, power, and cooling requirements and related costs -- thanks to
>    * the BladeCenter JS20 architecture
>    * Record price/performance and system throughput for high-performance
>    * computing workloads thanks to innovative POWER semiconductor
>    * technology, specifically the eight-way superscalar design of the
>    * PowerPC 970FX processor which fully supports symmetric multi-processing
>    * (SMP)
>    * The leading IBM 64-bit POWER microprocessors are capable of addressing
>    * four billion times the amount of physical memory as traditional 32-bit
>    * processors without resorting to complex memory-extension techniques.
>    * Better systems management control thanks to embedded service processors
>    * and software image management
>    * Increased reliability, availability, and serviceability, as well as
>    * lower installation and maintenance costs -- provided by diskless
>    * compute nodes
>    * Improved functionality and performance thanks to the Linux 2.6 kernel
>    * Reduced switching hardware requirements and faster parallel processing
>    * provided by Myrinet switch cabling
>    * Improved storage subsystem costs and reliability thanks to TotalStorage
>    * DS4100 storage technology
>
>View from the crow's nest
>When the power of MareNostrum is unleashed later this year, it will be at the
>service of scientific, engineering, and medical researchers in the Spanish
>and international scientific communities. Its to-do list includes issues that
>are familiar in the supercomputing world, such as protein folding, in silico
>(computer generated) drug screening and enzymatic reactions. MareNostrum will
>be used to support basic and applied research in areas that include biology,
>chemistry, physics, and information-based medicine.
>
>As Dr. Porta summed up:
>
>    ...[T]he very thinking that drove MareNostrum's construction is a new way
>of looking at compute-intensive areas, particularly in the life sciences, as
>we prepare new work to resolve challenging problems in information based
>medicine -- including improvements in diagnostic and therapeutic treatments
>in hospitals. In the EU context, many of the projects will be conducted in
>collaboration with other leading European research institutions. We are
>building collaborative efforts across geographic borders and disciplines. And
>remember -- the name of the supercomputer is MareNostrum. Traditionally, it
>was the Mediterranean Sea which allowed commerce and communication to
>flourish in Europe and beyond. 
>
>Resources
>
>    * Visit the Project MareNostrum site, demonstrating the value of Linux
>    * clustering for science, for business, for life itself.
>
>    * MareNostrum is now at home at the Barcelona Supercomputing Center (BSC)
>    * on the Polytechnic University of Catalonia (UPC) campus in Barcelona, a
>    * prestigious public institution focused on higher education, research,
>    * and technology transfer.
>
>    * The TOP500 Supercomputer Sites project was started in 1993 to provide a
>    * reliable basis for tracking and detecting trends in high-performance
>    * computing -- twice a year, the project releases a list of the 500 sites
>    * operating the most powerful computer systems.
>
>    * See this chart for the Linpack benchmark for MareNostrum and others.
>
>    * This news article examines MareNostrum, IBM's top-ranked,
>    * off-the-shelf, blade-based supercomputer.
>
>    * Connecting two or more IBM eServer Cluster Servers can create a single,
>    * unified computing resource that will dramatically improve availability,
>    * flexibility, and adaptability for essential services.
>
>    * The IBM BladeCenter JS20 is well- suited for commercial mainstream
>    * applications and 64-bit high performance computing (HPC) environments.
>
>    * The IBM Redbook, The IBM eServer BladeCenter JS20, takes an in-depth
>    * look at the two-way Blade eServer for applications requiring 64-bit
>    * computing.
>
>    * The Linux on IBM eServer product line is Linux-enabled to deliver
>    * maximum performance, reliability, manageability, and price/performance
>    * benefits.
>
>    * See this site for more on how IBM supercomputing solutions can help
>    * remove the barriers to deployment of clustered server systems.
>
>    * IBM TotalStorage DS400 series has been enhanced with the DS4000 Storage
>    * Manager V9.10, enhanced remote mirror option, DS4100 option for larger
>    * capacity configurations, and support for EXP100 serial ATA expansion
>    * units .
>
>    * Take a look at the Myrinet switches used in MareNostrum.
>
>About the author
>The developerWorks Power Architecture editors welcome your comments on this
>article. E-mail them at dwpower at us.ibm.com.
>
>
>-- 
>Eugen* Leitl <a href="http://leitl.org">leitl</a>
>______________________________________________________________
>ICBM: 48.07078, 11.61144            http://www.leitl.org
>8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
>http://moleculardevices.org         http://nanomachines.net
>
>Attachment Converted: "f:\internet\eudora\attach\[Beowulf] Mare Nostrum
(not qui"
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>


From diep at xs4all.nl  Wed Feb 16 05:12:23 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 16 Feb 2005 14:12:23 +0100
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
Message-ID: <3.0.32.20050216141222.00924be0@pop.xs4all.nl>

At 06:28 16-2-2005 -0500, Patrick Geoffray wrote:
>Rossen,
>
>Rossen Dimitrov wrote:
>> 
>>>
>>> So if you run an MPI application and it sucks, this is because the 
>>> application is poorly written ?
>> 
>> 
>> Patrick, here the argument is about whether and how you "measure" the 
>> "performance of MPI". I guess you may have missed some of the preceding 
>> postings.
>
>No, I was pulling your leg :-) The bigger picture is that MPI has no 
>performance in itself, it's a middleware. You can only measure the way 
>an MPI implementation enable a specific application to perform. Only 
>benchmarking of applications is meaningful, you can argue that 
>everything else is futile and bogus.

A problem of MPI over DSM type forms of parallellism has been described
very well by Chrilly Donninger with respect to his chessprogram Hydra which
runs at a few nodes MPI :

For every write :

MPI_Isend(....)
MPI_Test(&Reg,&flg,&Stat)
while(!flg) {
    Hydra_MsgPending();  // Important, read in messages and process them
while waiting on complete. Otherwise the own Input-Buffer can overflow
                                         // and we get a deadlock.
    MPI_Test(&Reg,&flg,&Stat);
}

The above is dead slow simply and delays the software.

In a DSM model like Quadrics you don't have all these delays.

Can Myri memory on the card (4MB and 8MB in the $1500 version) get used to
directly write to the RAM on a remote network card?

If so which library can i download for that for myri cards?

Thanks in advance,
Vincent


>>> You don't want to benchmark an application to evaluate MPI, you want 
>>> to benchmark an application to find the best set of resources to get 
>>> the job done. If the code stinks, it's not an excuse. Good MPI 
>>> implementations are good with poorly written applications, but still 
>>> let smart people do smart things if they want.
>> 
>> 
>> This is exactly my point made in my previous posting - you cannot design 
>> a system that is optimal in a single mode for all cases of its use when 
>> there are multiple parameters defining the usage and performance 
>
>I agree completely, being able to apply different assumptions for the 
>whole code and see which one match the best the applications behavior is 
>better than nothing. However, I believe that some tradeoffs are just too 
>intrusive: you should not have to choose between low latency for small 
>messages or progress by interrupt for large ones, especially when you 
>can have both at the same time.
>
>> I think it is fairly easy to show that overlapping and polling (or any 
>> kind of communication completion synchronization) are not orthogonal. If 
>> this was the case, you would see codes that show perfect overlapping 
>> running on any MPI implementation/network pair. I am sure there is 
>> plenty of evidence this is not the case.
>
>I can show you codes where people sprinkled some MPI_Test()s in some 
>loops. They don't poll to death, just a little from time to time to 
>improve overlap by improving progression. They poll and they overlap. 
>They could as well block and not overlap. polling/blocking and 
>overlap/not are not linked. Interrupts are useful to get overlap without 
>help from the application, but it's not required to overlap.
>
>> There is an important point here that needs to be clarified: when I say 
>> "polling" library, I assume that this library does both: polling 
>> completion synchronization and polling progress. There is not much room 
>> to define here these but I am sure MPI developers know what they are.
>
>I think this is where we don't understand each other. For me, polling 
>means no interrupts. Wherever you progress in the context of MPI calls 
>or in the context of a progression thread, you pay for the same CPU 
>cyles. If the application is providing CPU cycles to the MPI lib at the 
>right time, you can overlap perfectly without wasting cycles.
>
>> Here is a third one. Writing your code for overlapping with non-blocking 
>> MPI calls and segmentation/pipelining, testing the code, and not seeing 
>> any benefit of it.
>
>Yes. This is very true. But if it's not worse than with blocking, they 
>should stick with non-blocking, even if it's bigger and more confusing.
>
>> stage I with communication in stage I+1. Then, there is the question how 
>> many segments you use to break up the message for maximum speedup. The 
>> pipelining theory says the more you can get the better, when they are 
>> with equal duration, there aren't inter-stage dependencies, and the 
>> stage setup time is low in proportion to the stage execution time. Also, 
>
>The more steps, the more overhead. Small pipeline stages decrease your 
>startup overhead (when the second stage is empty) but increase the 
>number of segments and the total cost of the pipeline. The best is to 
>find a piece of computation long enough to hide the communication. 
>Pipelining would be overkill in my opinion.
>
>> The metric I mentioned earlier "degree of overlapping" with some 
>> additional analysis can help designers _predict_ whether the design is 
>> good or not and whether it will work well or not on a particular system 
>> of interest (including the MPI library).
>
>Temporal dependency between buffers and computation is the metric for 
>overlaping. The longuer you don't need a buffers, the better you can 
>overlap a communication to/from it. Compilers could know that.
>
>> This is however too much detail for this forum though, as most of the 
>> postings here discuss much more practical issues :)
>
>I am bored with cooling questions. However, it's quite time consuming to 
>argue by email. I don't know how RGB can keep the distance :-)
>
>Patrick
>-- 
>
>Patrick Geoffray
>Myricom, Inc.
>http://www.myri.com
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From rossen at VerariSoft.Com  Wed Feb 16 06:34:56 2005
From: rossen at VerariSoft.Com (Rossen Dimitrov)
Date: Wed, 16 Feb 2005 09:34:56 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <42132E40.1060001@myri.com>
References: <200502112000.j1BK0DNm021457@bluewest.scyld.com>	<420D1801.9090206@isaacdooley.com>	<Pine.LNX.4.58.0502111528560.1141@terra.mcs.anl.gov>	<420D54DA.8000904@uiuc.edu>	<Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>	<420DA793.4000909@verarisoft.com>
	<421194C4.5050808@myri.com>	<4212182C.60607@verarisoft.com>
	<42132E40.1060001@myri.com>
Message-ID: <42135A10.3060804@verarisoft.com>

>>>
>>> So if you run an MPI application and it sucks, this is because the 
>>> application is poorly written ?
>>
>> Patrick, here the argument is about whether and how you "measure" the 
>> "performance of MPI". I guess you may have missed some of the 
>> preceding postings.
> 
> No, I was pulling your leg :-) The bigger picture is that MPI has no 
> performance in itself, it's a middleware. You can only measure the way 
> an MPI implementation enable a specific application to perform. Only 
> benchmarking of applications is meaningful, you can argue that 
> everything else is futile and bogus.

Actually, we have been arguing this exact argument for quite some time, 
which might sound odd as we are a commercial MPI vendor :) The whole 
idea is that focusing too much on microbenchmarks and then extending the 
results from these to characterize a whole parallel system does not seem 
to be right thing to do (or at least not the only thing to do) but on 
the other hand, I often see it being done.


From diep at xs4all.nl  Wed Feb 16 08:44:55 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 16 Feb 2005 17:44:55 +0100
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
Message-ID: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl>

At 07:17 16-2-2005 -0500, Robert G. Brown wrote:
>On Wed, 16 Feb 2005, Patrick Geoffray wrote:
>
>> > This is however too much detail for this forum though, as most of the 
>> > postings here discuss much more practical issues :)
>> 
>> I am bored with cooling questions. However, it's quite time consuming to 
>> argue by email. I don't know how RGB can keep the distance :-)
>> 
>> Patrick
>> 
>
>I stuck a hairpin into an electrical socket at age 2 (an "enlightening"
>experience I must say) and had a large rock fall on my head from a
>height of almost a meter at age 8.
>
>Since then, I hardly ever get bored with cooling questions, because I
>cannot remember that they've been asked.  What were we talking about,
>again?
>
>Oh yeah, MPI and all that.
>
>I've actually been enjoying reading the discussion and not
>participating, since I'm a PVM kinda guy.  But SINCE my name was invoked
>in vain, I'll make a single comment on the code quality issue, which is
>that underlying the discussion of communication pattern, blocking vs
>non-blocking, and directives is the fundamental scaling properties of
>the code and algorithm itself.  So on the issue of whether MPI sucks
>because the application sucks -- well, possibly, but it seems more
>likely that the application sucks because its parallel scaling
>properties (with the algorithm chosen) suck.

It is possible for algorithms to have sequential properties, in short
making it hard to scale well. Game tree search happens to have a few of
such algorithms, from which one is performing superior with a number of
enhancements having the same property that a faster blocked get latency
speeds it up exponential.

For the basic idea why there is an exponential speedup see Knuth and search
for the algorithm 'alfabeta'.

So the assumption that an algorithm sucks because it doesn't need bandwidth
but latency is like sticking a hairpin in an electrical socket.

If users would JUST need a little bit of bandwidth they already can get
quite far with $40 cards. 

So optimizing MPI for low latency small messages IMHO is very relevant. 

We get many improvements in hardware coming years. dual core, cell
streaming type and obviously when software becomes available in larger
quantities to run parallel, many will try the jump to running at clusters too.

Obviously if you can make it easier from programmers viewpoint then to
parallellize their software, like implementing short messages in a kind of
single system image type of software, or even certain algorithms, makes
sense to me.

The step from shared memory programming to MPI is a rather huge step
currently.

Even if all you want is a byte which sometimes is at a remote machine and
usually at your local cache, but you NEED that byte for your software, just
to know whether it's a 1 or 0, then the last you want to be toying with is
writing special code. You don't care how the result gets there, just as
long as it gets there.

>As to how "intelligent" the back end library should be at choosing
>algorithm -- I would say the BASIC library should be atomic, elementary,
>NOT algorithm level stuff.  A thin skin on top of raw networking calls
>that provides the various things one always has to do oneself but not
>much more.  Where one gets into trouble is where one uses a command that
>has a complex structure that doesn't fit your code without realizing it,
>and the reason you don't realize it is because all that detail is
>hidden, and isn't even uniform in RELATIVE performance across varying
>network hardware.
>
>In other words, to make MPI do more, either make it do less (in the form
>of commands that can be used to build "more" in a manner that is tuned
>to application and hardware) or be prepared to REALLY make it SMART
>behind the scenes.
>
>This isn't just MPI, BTW.  PVM suffers from the same thing.  I honestly
>think that both are limited tools in part BECAUSE they put too thick a
>skin between the programmer and the network.  If you want real
>performance and complete control over communication algorithm, you
>probably have to use raw/low level networking commands, and write the
>appropriate "collective" operations for your particular application and
>hardware.
>
>Of course nobody does this -- not portable and a PITA to
>design/write/maintain.  Or perhaps a few people DO do this, but they're
>programming gods.  And this isn't crazy, really.
>
>    rgb
>
>-- 
>Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>Duke University Dept. of Physics, Box 90305
>Durham, N.C. 27708-0305
>Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From bclem at rice.edu  Wed Feb 16 08:48:34 2005
From: bclem at rice.edu (Brent M. Clements)
Date: Wed, 16 Feb 2005 10:48:34 -0600 (CST)
Subject: [Beowulf] Academic sites: who pays for the electricity?
In-Reply-To: <E1D1Rqj-0006jD-00@mendel.bio.caltech.edu>
References: <E1D1Rqj-0006jD-00@mendel.bio.caltech.edu>
Message-ID: <Pine.GSO.4.60.0502161048010.16886@is.rice.edu>

I would suggest asking this on the Educause CIO mailing list.

Anyone can join as long as your from a educational entity.

-Brent

> A/C are paid for by the school.  To do so they take "overhead"
> out of every grant.  Partially as a consequence of this they
> typically have a very poor ability to meter usage on a room
> by room basis.
>
> Now somewhere between the 10 node Pentium II beowulf sitting on
> a lab bench and the 1000 node dual P4 Xeon beowulf in a machine
> room that takes up half the basement the cost of the electricity
> (both for power and A/C) goes from  a minor expense to a major
> one.  Really major. For instance, in that hypothetical large machine,
> at 10 cents per kilowatt hour (a round number), assuming 100 watts
> per CPU (another round number) that's:
>
>  1000  (nodes) *
>     2  (cpus/node) *
>     .1 (kilowatts/cpu) *
>     .1 (dollars/kilowatt-hour) *
>  365   (days /year) *
>   24   (hours/day) =
> -----------------------
>  175200 dollars/year
>
> The A/C expense is going to vary tremendously depending upon
> the outside temperature.  It's going to be much higher for us
> in Southern California than for a site in Anchorage.
>
> "Typical" lab usage is widely variable but I'd be amazed
> if most biology or chemistry labs burn through even 1/10th this
> much for the equivalent lab area.  Some physics lab running
> a tokamak might come close.
>
>
> Anyway, the question is, have any of the universities said "enough
> is enough" and started charging these electricity costs directly?
> If so, what did they use for a cutover level, where usage was
> "above and beyond" overhead?
>
>> From an economic perspective having electricity and A/C come out
> of overhead (without limit) grossly distorts the true cost
> of the project over time and can lead to choices which increase
> the total overall cost. For instance, the use of Xeons instead of
> Opterons has little effect on TCO if somebody else is picking
> up the electricity tab, but could change the power consumption
> significantly on a large project.
>
> Regards,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From diep at xs4all.nl  Wed Feb 16 10:08:05 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Wed, 16 Feb 2005 19:08:05 +0100
Subject: [Beowulf] Academic sites: who pays for the electricity?
Message-ID: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl>

At 08:16 16-2-2005 -0800, David Mathog wrote:
>In most universities services like electricity, water, and 
>A/C are paid for by the school.  To do so they take "overhead"
>out of every grant.  Partially as a consequence of this they
>typically have a very poor ability to meter usage on a room
>by room basis.
>
>Now somewhere between the 10 node Pentium II beowulf sitting on
>a lab bench and the 1000 node dual P4 Xeon beowulf in a machine
>room that takes up half the basement the cost of the electricity
>(both for power and A/C) goes from  a minor expense to a major
>one.  Really major. For instance, in that hypothetical large machine,
>at 10 cents per kilowatt hour (a round number), assuming 100 watts
>per CPU (another round number) that's:
>
>  1000  (nodes) *
>     2  (cpus/node) *
>     .1 (kilowatts/cpu) *
>     .1 (dollars/kilowatt-hour) *
>  365   (days /year) *
>   24   (hours/day) =
>-----------------------
>  175200 dollars/year

Complete academic nonsense calculation. If you use quite some electricity
the electricity gets up to factor 20-40 cheaper. Getting a factor 10
reduction in usage bill is pretty easy if you negotiate properly.

However you must avoid starting machines at peaktimes. Big fines get given
for that. So it's cheaper to let them run 24 hours a day than to start them
in the morning after say 7 AM (depending upon local habits).

Please note that nothing beats the price of nuclear power 

(as a member of the high voltage power forum i do not have an opinion on
that). 

Electricity production costs of nuclear power are hundreds of times cheaper
than producing it with oil, oil produces it roughly for 5 dollar cent a
kilowatt (if memory serves me well). Coals have a CO2 problem for nations
which are in Kyoto agreement (USA isn't), but also is nearly as cheap as
nuclear power. 

So the actual price they deliver huge power for to big institutes is a very
easy negotiation to get it factors down.

Vincent Diepeveen
ex-member of high voltage powerline forum.

>The A/C expense is going to vary tremendously depending upon
>the outside temperature.  It's going to be much higher for us
>in Southern California than for a site in Anchorage.
>
>"Typical" lab usage is widely variable but I'd be amazed
>if most biology or chemistry labs burn through even 1/10th this
>much for the equivalent lab area.  Some physics lab running
>a tokamak might come close.
>
>
>Anyway, the question is, have any of the universities said "enough
>is enough" and started charging these electricity costs directly?
>If so, what did they use for a cutover level, where usage was
>"above and beyond" overhead?
>
>>From an economic perspective having electricity and A/C come out
>of overhead (without limit) grossly distorts the true cost
>of the project over time and can lead to choices which increase
>the total overall cost. For instance, the use of Xeons instead of
>Opterons has little effect on TCO if somebody else is picking
>up the electricity tab, but could change the power consumption
>significantly on a large project.
>
>Regards,
>
>David Mathog
>mathog at caltech.edu
>Manager, Sequence Analysis Facility, Biology Division, Caltech
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From eno at dorsai.org  Wed Feb 16 17:10:10 2005
From: eno at dorsai.org (Alpay Kasal)
Date: Wed, 16 Feb 2005 20:10:10 -0500
Subject: [Beowulf] powering up 18 motherboards
Message-ID: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net>

Hello all. I have a question about powering on motherboards simulataneously.

 
I have 18 identical mobo's right now with identical ram, cpu, and hard disk.
I hooked one up to a kill-a-watt and found that it draws 140-150 watts when
powering on, and stays level at about 90-100 watts afterwards. The problem
is that I am setting this up at home, where I only have 10 amp circuits (and
only a couple of them can be freed up). Correct me if I am wrong here
please.

 
1 mobo = 100 watts / 115 volts = .87amps each mobo while steady on

1 mobo = 150 watts / 115 volts = 1.3amps each mobo while turning on

 
I won't include the rest of the math, but needless to say, it'd be a pain in
arse to turn on the room in piecemeal without tripping a circuit breaker. My
questions is :

 
Will a heavy duty UPS aid in getting me through powering up the room? I
don't mind splitting up the 18 machines with 6 outlet surge strips. Any
advice?

 
Thanks.

Alpay Kasal

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050216/9beb162a/attachment.html>

From billk01 at metrumrg.com  Wed Feb 16 04:45:48 2005
From: billk01 at metrumrg.com (billk01)
Date: Wed, 16 Feb 2005 07:45:48 -0500
Subject: [Beowulf] running non-mpi programs on beowulf cluster
Message-ID: <4213407C.7000605@metrumrg.com>

I have a program that is called from a perl batch script.  The program 
is non-MPI aware so I have been  using mpprun to execute the perl 
program.  The perl program can start from 1 - x processes depending upon 
the arguments to the batch file.  I currently call the batch file as:

mpprun -no-local perl batch.p 1 2 3 &

1 2 3 cause the perl program to start proceses 1 2 and 3 in three 
different directories.  (The different directories are necessary because 
of the nature of the program being run.)  The results is three processes 
all running on one node. (Each node has two processors and there are 3 
nodes for now for a total of 6 processors.)  I have tried supplying the 
-np x  option but this simply starts starts the same three processes 
over an another node once the initial three processes are complete. The 
same thing occurs if I use the -map x:x:x option.  I have also tried 
batching the commant via the "batch now" interactive command line 
interface and the results is the same.
Is there anyway to indicate to the cluster to load balance these 
processes across the nodes?  Or do I need to start each process with a 
seperate mpprun command?  Also, it appears that the NO_LOCAL=1 option 
does not work with the "Batch" command.  Does that seem correct?
The cluster consists of a dual processor (2 Xeon's) master node with 
three compute nodes each with 2 Xeon processors.  Eventually we will 
have a number of additional nodes up but I am testing for now.

Any help would be greatly appreciated.

Regards,

Bill


From billk at metrumrg.com  Thu Feb 17 05:45:11 2005
From: billk at metrumrg.com (Bill Knebel)
Date: Thu, 17 Feb 2005 08:45:11 -0500
Subject: [Beowulf] sun grid engine on Scyld beowulf cluster
Message-ID: <42149FE7.5060000@metrumrg.com>

I am in the process of installing SGE on a Scyld beowulf cluster.  As 
most people are aware, the Scyld cluster runs a complete OS (linux) only 
on the master node and the compute nodes are simply for executing.  
During the SGE install, it requires adding the compute nodes as execute 
hosts.  I do not understand how to do this given the current setup of a 
scyld cluster since you can't "login" to the nodes to execute the 
install script.  The script does exist on an NFS shared directory 
(clusater wide).  Has anybody else ran into this problem? 

Regards,

Bill

-- 
Bill Knebel, PharmD, Ph.D.
Principal Scientist
Metrum Research Group
15 Ensign Drive
Avon, CT 06001
email: billk at metrumrg.com


From billk01 at metrumrg.com  Thu Feb 17 05:49:25 2005
From: billk01 at metrumrg.com (billk01)
Date: Thu, 17 Feb 2005 08:49:25 -0500
Subject: [Beowulf] sun grid engine on Scyld beowulf cluster
Message-ID: <4214A0E5.3010804@metrumrg.com>

I am in the process of installing SGE on a Scyld beowulf cluster.  As
most people are aware, the Scyld cluster runs a complete OS (linux) only
on the master node and the compute nodes are simply for executing.
During the SGE install, it requires adding the compute nodes as execute
hosts.  I do not understand how to do this given the current setup of a
scyld cluster since you can't "login" to the nodes to execute the
install script.  The script does exist on an NFS shared directory
(cluster wide).  Has anybody else ran into this problem?

Regards,

Bill

-------------------------
Bill Knebel, PharmD, Ph.D.
Principal Scientist
Metrum Research Group
15 Ensign Drive
Avon, CT 06001
email: billk at metrumrg.com


From dag at sonsorol.org  Thu Feb 17 14:12:07 2005
From: dag at sonsorol.org (Chris Dagdigian)
Date: Thu, 17 Feb 2005 17:12:07 -0500
Subject: [Beowulf] sun grid engine on Scyld beowulf cluster
In-Reply-To: <4214A0E5.3010804@metrumrg.com>
References: <4214A0E5.3010804@metrumrg.com>
Message-ID: <421516B7.3060906@sonsorol.org>


I know Grid Engine well but not Scyld so forgive my ignorance if I say 
something stupid and given the level of expertise on this list I'm quite 
certain I'm about to make a fool myself :)

If Scyld is presenting you with a single system image (ie a single linux 
server that can farm out tasks to all those nodes) then you would 
install SGE in the same way that you would install it on a big SMP box:

1. Install the SGE qmaster and scheduler on the master node
2. Install the execution host on the master node as well

You will only have 1 execd per queue but each queue can be configured 
with N number of "job slots" which actually control how many jobs can 
run at the same time on the same machine.

Try setting your # of job slots within your single SGE queue to the 
number of nodes in your cluster. This is simlar to what you would do on 
a big SMP machine -- small number of queues each supporting a decent 
jobslot count.

Then submit a bunch of jobs and see if SGE causes the master node to 
fall over under load. If not then Scyld is doing its thing behind the 
scenes to migrate stuff around to the other nodes.

-Chris


billk01 wrote:

> I am in the process of installing SGE on a Scyld beowulf cluster.  As
> most people are aware, the Scyld cluster runs a complete OS (linux) only
> on the master node and the compute nodes are simply for executing.
> During the SGE install, it requires adding the compute nodes as execute
> hosts.  I do not understand how to do this given the current setup of a
> scyld cluster since you can't "login" to the nodes to execute the
> install script.  The script does exist on an NFS shared directory
> (cluster wide).  Has anybody else ran into this problem?
> 


From dtj at uberh4x0r.org  Thu Feb 17 14:18:40 2005
From: dtj at uberh4x0r.org (Dean Johnson)
Date: Thu, 17 Feb 2005 16:18:40 -0600
Subject: [Beowulf] Re: Academic sites: who pays for the electricity?
In-Reply-To: <20050217011015.HITQ24369.swebmail02.mail.ozemail.net@localhost>
References: <20050217011015.HITQ24369.swebmail02.mail.ozemail.net@localhost>
Message-ID: <1108678721.3683.16.camel@terra>

I am going out on a limb here, but I suspect that a majority of the
people on this list lean far more toward the casual administrator types
and not professional cluster admins. Often, I suspect, the issues of
power and cooling only arise in times of scarcity and not out of some
anal-retentive need to quantify resources. They are very likely people
doing science (or other worthy endeavour) and simply use clusters as
tools. Should one quantify the ROI on each pencil and pen in ones
office? I realize that the first pen/pencil in your office is an
important writing device, but why do you have a second and third, given
that the useful life of a pen/pencil, under normal load, is quite
long? ;-)

I'll be right back as soon as I justify the dozen or so pens I have
under the seat of my car. Hold it! I don't use the company's crappy pens
and provide my own. I guess beans don't interest me.

	-Dean

On Thu, 2005-02-17 at 12:10 +1100, steve_heaton at ozemail.com.au wrote:
> G'day all
> 
> Speaking as someone from "industry", and a Project/Programme Manager at that, I'd just like to add that I'm shocked and dismayed at the apparent lack of accountability that seems rampant in academic circles! If it was down to me I'd sack the lot of ya!! ;)
> 
> I'd strongly recommend that all good cluster folk have a good idea about operation expenditure (opex). If you get a visit from the Meanie Beanies (auditors / cost accountants etc etc) then it'll help cover your A. It's a great way to have your gig cancelled because you didn't have a firm understanding of your $'s in and out. Happens all the time in Industry and in my job it's a sackable offence. No joke. Do some homework and you won't need to be afraid (OK, *as* afraid of the Purple Pen People).
> 
> Some things to know? ideally you should be able to quote these with as little as an hour's warning (shows you're on top of things):
> 
> -) The amount of floor space you consume (sq ft or m) - don't worry about the cost of this one, those asking will know ;) Becomes a hot topic if you're paying rent in some form.
> -) Find out how much electricity you use per hour - chances are you're on one or more dedicated circuit(s) and probably separate metering - look at the bills. Don't worry about general lighting etc. It's often rolled into the floor space calcs.
> -) Ditto aircon (include your maintenance)
> -) Cluster hardware maintenance (out of warranty stuff, cost of spares) - quoting your amazing uptime can help explain this figure
> -) Service contracts (you've got a Service Level Agreement right? Uptime % etc helps explain)
> -) Staff / admin costs
> -) The good ol' "anything else you can think of"
> 
> Now the fun part. Who used you cluster and for how long? Look at your job scheduling etc. Your department? Another department (do you cross charge somehow)? Which projects? What's their contribution to cluster opex?
> 
> If you answer reasonably accurately then the Beanies will treat you with some respect :) >>Someone, somewhere is paying your bills already.<< Know where that money is going!
> 
> Don't say I didn't warn you ;)
> 
> Cheers
> Stevo
> 
> This message was sent through MyMail http://www.mymail.com.au
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From dtj at uberh4x0r.org  Thu Feb 17 14:22:12 2005
From: dtj at uberh4x0r.org (Dean Johnson)
Date: Thu, 17 Feb 2005 16:22:12 -0600
Subject: [SPAM] [Beowulf] powering up 18 motherboards
In-Reply-To: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net>
References: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net>
Message-ID: <1108678933.3683.19.camel@terra>

Howdy,
Correct me if I am wrong. Presumably your cpu's have fan kits and if you
may have case fans (or something). Do the motors on fan's not draw more
initial amperage at startup? Thus you would have an initial spike.

	-Dean

On Wed, 2005-02-16 at 20:10 -0500, Alpay Kasal wrote:
> Hello all? I have a question about powering on motherboards
> simulataneously?
> 
>  
> 
> I have 18 identical mobo?s right now with identical ram, cpu, and hard
> disk. I hooked one up to a kill-a-watt and found that it draws 140-150
> watts when powering on, and stays level at about 90-100 watts
> afterwards. The problem is that I am setting this up at home, where I
> only have 10 amp circuits (and only a couple of them can be freed up).
> Correct me if I am wrong here please?
> 
>  
> 
> 1 mobo = 100 watts / 115 volts = .87amps each mobo while steady on
> 
> 1 mobo = 150 watts / 115 volts = 1.3amps each mobo while turning on
> 
>  
> 
> I won?t include the rest of the math, but needless to say, it?d be a
> pain in arse to turn on the room in piecemeal without tripping a
> circuit breaker. My questions is :
> 
>  
> 
> Will a heavy duty UPS aid in getting me through powering up the room?
> I don?t mind splitting up the 18 machines with 6 outlet surge strips.
> Any advice?
> 
>  
> 
> Thanks.
> 
> Alpay Kasal
> 
>  
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at mendel.bio.caltech.edu  Thu Feb 17 14:54:15 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Thu, 17 Feb 2005 14:54:15 -0800
Subject: [Beowulf] Academic sites: who pays for the electricity?
Message-ID: <E1D1uXH-0002X1-00@mendel.bio.caltech.edu>

At Wed, 16 Feb 2005 19:08:05 +0100 Vincent Diepeveen wrote:

> Date: Wed, 16 Feb 2005 19:08:05 +0100
> From: Vincent Diepeveen <diep at xs4all.nl>
> Subject: Re: [Beowulf] Academic sites: who pays for the electricity?
> To: "David Mathog" <mathog at mendel.bio.caltech.edu>,
> 	beowulf at beowulf.org
> Message-ID: <3.0.32.20050216190804.0106fcc0 at pop.xs4all.nl>
> Content-Type: text/plain; charset="us-ascii"
> 
> At 08:16 16-2-2005 -0800, David Mathog wrote:
> >In most universities services like electricity, water, and 
> >A/C are paid for by the school.  To do so they take "overhead"
> >out of every grant.  Partially as a consequence of this they
> >typically have a very poor ability to meter usage on a room
> >by room basis.
> >
> >Now somewhere between the 10 node Pentium II beowulf sitting on
> >a lab bench and the 1000 node dual P4 Xeon beowulf in a machine
> >room that takes up half the basement the cost of the electricity
> >(both for power and A/C) goes from  a minor expense to a major
> >one.  Really major. For instance, in that hypothetical large machine,
> >at 10 cents per kilowatt hour (a round number), assuming 100 watts
> >per CPU (another round number) that's:
> >
> >  1000  (nodes) *
> >     2  (cpus/node) *
> >     .1 (kilowatts/cpu) *
> >     .1 (dollars/kilowatt-hour) *
> >  365   (days /year) *
> >   24   (hours/day) =
> >-----------------------
> >  175200 dollars/year
> 
> Complete academic nonsense calculation. If you use quite some electricity
> the electricity gets up to factor 20-40 cheaper. Getting a factor 10
> reduction in usage bill is pretty easy if you negotiate properly.

Well, it isn't complete nonsense, unless you care to dispute the
number of days in a year, hours in a day, or cpus in a dual node
computer!
  
The only term you're complaining about is the price of
electricity.  I'm not privy to the electrical rates that our
school pays, they may well be an order of magnitude lower.  My
home rates certainly aren't, but then, I don't buy as much
power as the campus.  It's also not at all clear that the
campus would sell power to the end users at the same rate
which it pays the utility.

I don't really understand your point about keeping the units
running versus restarting them.  Sure, it would be really bad
to try to boot all 1000 nodes simultaneously, in all likelihood
it wouldn't work.  That's why they are typically started at N
second intervals, where N depends on your hardware.
Surely there is some N large enough so that the peak current
draw during the restart never exceeds the random fluctuations
observed when all units are running normally.  Or is your
point that the electricity company doesn't want the facility
to draw _less_ current than it uses normally at
steady state?

On a somewhat related note, it would be nice if rack nodes
had some graceful way to conserve electricity.  For instance,
something along the lines of: if the CPU utilization goes
below 5% for 10 seconds ratchet the clock down by a factor of 10.
When CPU usage goes above 90% ratchet for 2 seconds move it back
up again.  Notebooks can do this sort of thing, but it seems not
to be a "feature" of most full size motherboards.  This should
also lower the average temperature in the case, at the expense
of increased thermal cycling.  Hard to say off hand if that's
a plus or a minus as far as hardware longevity goes.  Certainly
it would be a plus in terms of energy conservation.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From James.P.Lux at jpl.nasa.gov  Thu Feb 17 16:17:57 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Thu, 17 Feb 2005 16:17:57 -0800
Subject: [SPAM] [Beowulf] powering up 18 motherboards
In-Reply-To: <1108678933.3683.19.camel@terra>
References: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net>
	<1108678933.3683.19.camel@terra>
Message-ID: <6.1.1.1.2.20050217160903.07880988@mail.jpl.nasa.gov>

At 02:22 PM 2/17/2005, Dean Johnson wrote:
>Howdy,
>Correct me if I am wrong. Presumably your cpu's have fan kits and if you
>may have case fans (or something). Do the motors on fan's not draw more
>initial amperage at startup? Thus you would have an initial spike.


The fans probably don't draw a significant amount of startup current over 
their running current.

Disk drives, on the other hand, do draw a lot more current when spinning up.


>         -Dean
>
>On Wed, 2005-02-16 at 20:10 -0500, Alpay Kasal wrote:
> > Hello all
 I have a question about powering on motherboards
> > simulataneously

> >
> >
> >
> > I have 18 identical mobo?s right now with identical ram, cpu, and hard
> > disk. I hooked one up to a kill-a-watt and found that it draws 140-150
> > watts when powering on, and stays level at about 90-100 watts
> > afterwards. The problem is that I am setting this up at home, where I
> > only have 10 amp circuits (and only a couple of them can be freed up).
> > Correct me if I am wrong here please

> >
> >
> >
> > 1 mobo = 100 watts / 115 volts = .87amps each mobo while steady on
> >
> > 1 mobo = 150 watts / 115 volts = 1.3amps each mobo while turning on


You're basically right.

There's also a very high current spike that your Kill-A-Watt won't see when 
you first turn on the supply, as the capacitors in the input section charge 
up.  This might result in a nuisance trip of your breakers.

This current spike will be somewhat dependent on the phase of the AC line 
voltage when you first close the switch.  Some power supplies have inrush 
current limiting, others don't.

A 10 amp circuit would be highly unusual in the U.S., but might be common 
practice elsewhere.  In the U.S., a 15 amp circuit is standard.


> >
> >
> >
> > I won?t include the rest of the math, but needless to say, it?d be a
> > pain in arse to turn on the room in piecemeal without tripping a
> > circuit breaker. My questions is :
> >
> >
> >
> > Will a heavy duty UPS aid in getting me through powering up the room?
> > I don?t mind splitting up the 18 machines with 6 outlet surge strips.
> > Any advice?


No, the UPS won't help.  It might make things worse, because as you flip on 
all that load, the voltage will sag, causing the UPS to turn on, which then 
might trip from the overcurrent (assuming you're not out buying a 2kW UPS).

What you want is some way to sequence the power, conveniently.

The answer is to use some relays.  You could spend some tens of dollars 
(brand new, much less surplus) and get some time delay relays.  You could 
use the DC power out of the first power supply to turn on to charge a 
capacitor through a resistor that's hooked to a relay (12V coil, 110V 
contacts).. Most DC relays have a much higher pull-in than hold 
current/voltage.

You could use the X-10 type (aka Plug n Power) remote controlled relays 
(don't use Lamp modules.. you need Appliance modules, which are relays inside).

You could build a little power sequencing box that sends the appropriate 
signals to the power supplies to turn them on, one by one... I think you 
might be able to do this with the parallel printer port on the first mobo 
to fire up. I haven't looked at the control interface on an ATX power 
supply recently.


> >
> >
> >
> > Thanks.
> >
> > Alpay Kasal
> >
> >
> >

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From James.P.Lux at jpl.nasa.gov  Thu Feb 17 17:01:53 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Thu, 17 Feb 2005 17:01:53 -0800
Subject: [Beowulf] Academic sites: who pays for the electricity?
Message-ID: <6.1.1.1.2.20050217162000.07774090@mail.jpl.nasa.gov>

At 04:19 PM 2/17/2005, Jim Lux wrote:
>At 02:54 PM 2/17/2005, David Mathog wrote:
>>At Wed, 16 Feb 2005 19:08:05 +0100 Vincent Diepeveen wrote:
>>
>> > Date: Wed, 16 Feb 2005 19:08:05 +0100
>> > From: Vincent Diepeveen <diep at xs4all.nl>
>> > Subject: Re: [Beowulf] Academic sites: who pays for the electricity?
>> > To: "David Mathog" <mathog at mendel.bio.caltech.edu>,
>> >       beowulf at beowulf.org
>> > Message-ID: <3.0.32.20050216190804.0106fcc0 at pop.xs4all.nl>
>> > Content-Type: text/plain; charset="us-ascii"
>> >
>> > At 08:16 16-2-2005 -0800, David Mathog wrote:
>> > >In most universities services like electricity, water, and
>> > >A/C are paid for by the school.  To do so they take "overhead"
>> > >out of every grant.  Partially as a consequence of this they
>> > >typically have a very poor ability to meter usage on a room
>> > >by room basis.
>> > >
>> > >Now somewhere between the 10 node Pentium II beowulf sitting on
>> > >a lab bench and the 1000 node dual P4 Xeon beowulf in a machine
>> > >room that takes up half the basement the cost of the electricity
>> > >(both for power and A/C) goes from  a minor expense to a major
>> > >one.  Really major. For instance, in that hypothetical large machine,
>> > >at 10 cents per kilowatt hour (a round number), assuming 100 watts
>> > >per CPU (another round number) that's:
>> > >
>> > >  1000  (nodes) *
>> > >     2  (cpus/node) *
>> > >     .1 (kilowatts/cpu) *
>> > >     .1 (dollars/kilowatt-hour) *
>> > >  365   (days /year) *
>> > >   24   (hours/day) =
>> > >-----------------------
>> > >  175200 dollars/year
>> >
>> > Complete academic nonsense calculation. If you use quite some electricity
>> > the electricity gets up to factor 20-40 cheaper. Getting a factor 10
>> > reduction in usage bill is pretty easy if you negotiate properly.

Just where do you live that such negotiations are possible. Here's some 
real numbers from Southern California Edison. 
http://www.sce.com/CustomerService/RateInformation/BusinessRates/LargeBusiness/

First off, you're looking at a 200kW load for 1000 nodes, which is a hefty 
load, just for the computers (not counting lights, HVAC, etc.)  But, no 
matter, we'll assume your facility is sucking at least 500kW some of the 
time, so that would put you in the large business TOU-8 tariff.
http://www.sce.com/NR/sc3/tm2/pdf/ce54-12.pdf
  All the large consumer tariffs are time-of-use sensitive.  I assume you 
wouldn't want some sort of "Critical Peak Pricing Options" or "Demand 
Bidding Programs"

Let's assume you're being served at 240V (as opposed to having your own 
distribution transformers, etc., although as a 200kW consumer, that's 
something you should consider).

Looks like the rates break down as about 0.016/kWh for the delivery, and 
the actual power (generation) runs somewhere between 0.04/kWh  (off peak 
summer) to 0.12/kWh on peak summer.

There's also a raft of other charges (metering, demand (runs about $10/kW 
of instantaneous demand), power factor, etc.)

Compare this to Domestic service.. where the rates run from 0.11 to 
0.18/kWh, depending on where you sit relative to baseline, season, etc. 
(I'll also point out that I was paying SCE $0.26/kWh at home in the summer 
of 2001, but rates are lower now.) Now, you might consider Residential to 
be artificially constrained for political reasons, so we take a look at 
GS-1 (general service..)
Here, we have 0.07 for delivery and 0.085 (totalling 0.155/kWh) during the 
summer.

The point is, there isn't a 10:1 ratio... not even close. And, rgb's 
ballpark of $0.10/kWh is a perfectly reasonable estimate, if a bit low for 
Southern California. Even as far back as 1990,on peak, large customers were 
paying on the order of $0.11/kWh.

It is possible that if you were buying power directly (which is possible, 
as a large end user), you can pay "market rate" for each kWh you 
consume.  The price is quite volatile, though... At the peak of the market 
failure a few years back, a kWh on the open market was something like $20 
during peak times.  I doubt you want to schedule your cluster ops to take 
advantage of electricity rate fluctuations.


>>Well, it isn't complete nonsense, unless you care to dispute the
>>number of days in a year, hours in a day, or cpus in a dual node
>>computer!
>>
>>The only term you're complaining about is the price of
>>electricity.  I'm not privy to the electrical rates that our
>>school pays, they may well be an order of magnitude lower.  My
>>home rates certainly aren't, but then, I don't buy as much
>>power as the campus.  It's also not at all clear that the
>>campus would sell power to the end users at the same rate
>>which it pays the utility.

CalTech probably buys their power from City of Pasadena, but it's probably 
similar in rate structure to SCE's TOU-8.  Home rates are somewhat 
artificially low for political reasons.  The folks really getting the short 
end of the stick are small businesses who don't have the negotiating power 
that a large business does, nor the political clout of elderly pensioners 
dying from heat.


I'll also note that if you start paying for kWh, you're going to want to 
give serious consideration to buying more nodes than you strictly need, and 
shutting down the cluster during on-peak times. A typical pricing strategy 
might be 0.15/0.07/0.05 (on/mid/off peak): Peak lasts 6 hrs (1200-1800), 
mid is 0800-1200,1800-2300, and offpeak is the rest.  There's 93 hours of 
offpeak time out of 168 total in the week

For the pricing and schedules above, it turns out that the optimum is to 
shut down only during peak, but run during midpeak, for an average energy 
cost of 0.056/kWh, but using 22% more computers.


James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From James.P.Lux at jpl.nasa.gov  Thu Feb 17 17:21:05 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Thu, 17 Feb 2005 17:21:05 -0800
Subject: [Beowulf] Academic sites: who pays for the electricity?
In-Reply-To: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl>
References: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl>
Message-ID: <6.1.1.1.2.20050217170816.077e17a8@mail.jpl.nasa.gov>

A
>Please note that nothing beats the price of nuclear power

Nuclear power, if all the incidental costs (often absorbed into government 
budgets) for things like liability cover, waste disposal, etc. is not 
overwhelmingly competitive with other sources.   One must also consider the 
capital cost of the production equipment and it's retirement (which is 
quite a bit higher than for, say, coal fired, gas fired, etc.)

>Electricity production costs of nuclear power are hundreds of times cheaper
>than producing it with oil, oil produces it roughly for 5 dollar cent a
>kilowatt (if memory serves me well).
Implying that nuclear energy generation costs are <0.0005/kWh?  I find this 
quite hard to believe.  Can you cite a reasonable source for the data? Just 
the capital cost of the generating plant is more than that. (2 GW plant, 20 
yr life, 3.5E11 kWh.  If the plant costs $1B, you're at about $0.003/kWh)

I think that nuclear power, by the time you figure in all the stuff you 
need to, is a bit cheaper than fossil fuel, but not hugely cheaper, and 
certainly not an order of magnitude.

In fact, the chart at the end of  http://www.uic.com.au/nip08.htm shows 
that they're all within a factor of 2:1, except for high cost oil fired. 
(this chart has cost for OECD 1990)


>Coals have a CO2 problem for nations
>which are in Kyoto agreement (USA isn't), but also is nearly as cheap as
>nuclear power.
>
>So the actual price they deliver huge power for to big institutes is a very
>easy negotiation to get it factors down.

Now there's an interesting prospect... buy your electricity on the open 
traded market, and schedule your cluster computation as the price goes up 
and down. Don't laugh.. it's been done in other industries.


>Vincent Diepeveen
>ex-member of high voltage powerline forum.

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From rgb at phy.duke.edu  Thu Feb 17 20:31:33 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 17 Feb 2005 23:31:33 -0500 (EST)
Subject: [Beowulf] Academic sites: who pays for the electricity?
In-Reply-To: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl>
References: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl>
Message-ID: <Pine.LNX.4.58.0502172314590.3892@lilith.rgb.private.net>

On Wed, 16 Feb 2005, Vincent Diepeveen wrote:

> >  1000  (nodes) *
> >     2  (cpus/node) *
> >     .1 (kilowatts/cpu) *
> >     .1 (dollars/kilowatt-hour) *
> >  365   (days /year) *
> >   24   (hours/day) =
> >-----------------------
> >  175200 dollars/year
> 
> Complete academic nonsense calculation. If you use quite some electricity
> the electricity gets up to factor 20-40 cheaper. Getting a factor 10
> reduction in usage bill is pretty easy if you negotiate properly.
> 
> However you must avoid starting machines at peaktimes. Big fines get given
> for that. So it's cheaper to let them run 24 hours a day than to start them
> in the morning after say 7 AM (depending upon local habits).
> 
> Please note that nothing beats the price of nuclear power 
> 
> (as a member of the high voltage power forum i do not have an opinion on
> that). 
> 
> Electricity production costs of nuclear power are hundreds of times cheaper
> than producing it with oil, oil produces it roughly for 5 dollar cent a
> kilowatt (if memory serves me well). Coals have a CO2 problem for nations
> which are in Kyoto agreement (USA isn't), but also is nearly as cheap as
> nuclear power. 
> 
> So the actual price they deliver huge power for to big institutes is a very
> easy negotiation to get it factors down.

Actually, in spite of the fact that Duke (partly) owns its own power
company, I don't think that they get all that much of a discount.  I'm
also quite certain the the nuclear plant about 20 miles from here
doesn't sell its electricity to customers across the county line (it's
not a Duke Power plant but rather CP&L IIRC) for any less than Duke and
Durham get it.

There is a nifty map here:

  http://www.coaleducation.org/Ky_Coal_Facts/electricity/average_cost.htm

that shows the average electricity costs throughout the USA as of 2001
-- they are almost certainly a half-cent/KW-hr higher across the board
if not a whole cent higher due to the war and oil price boosts.  Note
that they range from $0.04 and change in major coal states to $0.11 and
change per KW-hr (where I'm sure a chunk of the difference is taxes in
different states).

My recollection from discussions at Duke is that Duke pays around
$0.06/KW-hr, not a huge discount over what I pay at home.  Maybe major
industrial consumers of electricity do better in states like Michigan,
but I don't think that there is enough margin even in the coal states to
drop prices to $0.01/KW-hr after paying for the fuel itself for anything
but nuclear.

Where David lives, in CA, electricity is about as expensive as it is
anywhere.  At a guess currently over ten cents/KW-hr, probably even for
Universities.  David?

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From jimlux at earthlink.net  Thu Feb 17 21:41:39 2005
From: jimlux at earthlink.net (Jim Lux)
Date: Thu, 17 Feb 2005 21:41:39 -0800
Subject: [SPAM] [Beowulf] powering up 18 motherboards
References: <0IC300G56DM77U@mta8.srv.hcvlny.cv.net>
Message-ID: <000601c5157c$855b5350$32a8a8c0@LAPTOP152422>


----- Original Message -----
From: "Alpay Kasal" <eno at dorsai.org>
To: "'Bari Ari'" <bari at onelabs.com>; "'Jim Lux'" <James.P.Lux at jpl.nasa.gov>
Cc: "'Dean Johnson'" <dtj at uberh4x0r.org>; <beowulf at beowulf.org>
Sent: Thursday, February 17, 2005 9:26 PM
Subject: RE: [SPAM] [Beowulf] powering up 18 motherboards


> I think you hit the nail on the head Bari, I'm in Brooklyn, New York. So I
> suppose it should be 15amp circuits but every circuit breaker in the box
is
> clearly a 10. This is an old house, seems like any renovations over the
> years have been only for aesthetics. The wiring in the walls is probably
> disintegrating - that would explain why the new looking circuit breakers
are
> rated for 10 amps.

Ahhh yes.. New York, where some of the (mostly undocumented) distribution
wiring dates from Edison himself, and dogs are electrocuted when urinating
on the street from stray currents in connection boxes.  The wiring is
probably knob and tube.

>
> I think I can get use of 3 circuits which gives me some room to play with
> all the nodes and hopefully the assortment of switches and power supply. I
> have to figure out what the draw will be on the rest of the equip. Now
where
> the hell am I going to plug in this air conditioner????
>
> Any advice on how to gang up 3 10amp circuits into a single 30amp? Sounds
> like a job for an electrician? Thanks for the help guys.

Why gang em up?  Just run three extension cords or plug strips, one into
each circuit. 6 machines on each circuit.  If the startup surge is too much,
stagger the spin up of the disk drives (maybe some sort of BIOS power
management option?).  Seriously think about cobbling together some
collection of relays.   The W.W.Grainger catalog is your friend, even if you
don't buy your stuff from them, you'll at least have a good description to
go googling for surplus places.

If that gets too dicey.. Honda and Yamaha make very nice, quiet inverter
generators.  2kW for about $700.

Good luck...
Jim.


From lindahl at pathscale.com  Thu Feb 17 22:44:34 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Thu, 17 Feb 2005 22:44:34 -0800
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <1108478089.4587.118.camel@s861954.sandia.gov>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com>
	<1108478089.4587.118.camel@s861954.sandia.gov>
Message-ID: <20050218064433.GA8147@greglaptop.attbi.com>

On Tue, Feb 15, 2005 at 07:34:50AM -0700, Keith D. Underwood wrote:
> 
> > c) ban of the ANY_SENDER wildcard: a world of optimization goes away 
> > with this convenience.
> 
> Um, our apps guys say this is more than a convenience.  Apparently,
> sometimes you don't exactly know who you are going to receive from. 
> Would you rather them post receives from 4000 nodes and cancel the ones
> that don't send to that node after a while?

That's a good reason to use it. The fundamental problem is that
supporting both ANY_TAG and ANY_SENDER efficiently is really
annoying. So most implementers would prefer to support ANY_TAG
efficiently and ANY_SENDER less efficiently.

And then those pesky users (life is much simpler when you only run
benchmarks that you write yourself, you know ;-) go use ANY_SENDER,
usually because they think it's faster when say 4 nodes are sending
you something at roughly the same time. Instead, we'd prefer that they
use Irecv for that.

-- greg


From bari at onelabs.com  Thu Feb 17 16:36:07 2005
From: bari at onelabs.com (Bari Ari)
Date: Thu, 17 Feb 2005 18:36:07 -0600
Subject: [Beowulf] powering up 18 motherboards
In-Reply-To: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net>
References: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net>
Message-ID: <42153877.3080101@onelabs.com>

Alpay Kasal wrote:

> Hello all? I have a question about powering on motherboards simulataneously?
> 
>  
> 
> I have 18 identical mobo?s right now with identical ram, cpu, and hard 
> disk. I hooked one up to a kill-a-watt and found that it draws 140-150 
> watts when powering on, and stays level at about 90-100 watts 
> afterwards. The problem is that I am setting this up at home, where I 
> only have 10 amp circuits (and only a couple of them can be freed up). 
> Correct me if I am wrong here please?

10 amp circuits? What country is this in? What line voltage and 
frequency? What's the wire size and insulator?

> 1 mobo = 100 watts / 115 volts = .87amps each mobo while steady on
> 
> 1 mobo = 150 watts / 115 volts = 1.3amps each mobo while turning on
> 
>  
> 
> I won?t include the rest of the math, but needless to say, it?d be a 
> pain in arse to turn on the room in piecemeal without tripping a circuit 
> breaker. My questions is :
> 
>  
> 
> Will a heavy duty UPS aid in getting me through powering up the room? I 
> don?t mind splitting up the 18 machines with 6 outlet surge strips. Any 
> advice?

Circuit breakers are designed and initially calibrated to run with a 
continuous load of around 80% of its current rating. Most electrical 
codes limit continuos loads on branch circuits to 80% of the current 
rating of the conductors and current protection. (This is an over 
simplification. The NEC is far more complicated on this.)

Using your rough numbers of 100W/mobo, you're just under 80% with 9 
mobo's per 10A/115VAC circuit. The startup current is 17% over the 
rating of the circuit breaker. The breakers will hold for a short time 
at this load. How long is dependent on the brand and age of the circuit 
breaker.

If you stagger the startup of the mobo's on each circuit (let's say to 5 
and then a min. later the other 4) it will help to keep the breakers 
from tripping.

-Bari


From bari at onelabs.com  Thu Feb 17 16:59:49 2005
From: bari at onelabs.com (Bari Ari)
Date: Thu, 17 Feb 2005 18:59:49 -0600
Subject: [SPAM] [Beowulf] powering up 18 motherboards
In-Reply-To: <6.1.1.1.2.20050217160903.07880988@mail.jpl.nasa.gov>
References: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net>	<1108678933.3683.19.camel@terra>
	<6.1.1.1.2.20050217160903.07880988@mail.jpl.nasa.gov>
Message-ID: <42153E05.7090801@onelabs.com>

Jim Lux wrote:

> A 10 amp circuit would be highly unusual in the U.S., but might be 
> common practice elsewhere.  In the U.S., a 15 amp circuit is standard.

I thought this was odd when I first read this as well. This may be a 
case where to save dollars or in rehabbing old buildings where you may 
have more than 3 current carrying conductors in one raceway and you have 
to derate the current protection. In this case it may be that they ran 
more than three #14 current carrying conductors (as defined by the 
NEC)in the same raceway and had to derate the usual 15 amp circuit 
protection down to 10 amps.

-Bari Ari


From Glen.Gardner at verizon.net  Thu Feb 17 17:12:27 2005
From: Glen.Gardner at verizon.net (Glen Gardner)
Date: Thu, 17 Feb 2005 20:12:27 -0500
Subject: [Beowulf] Academic sites: who pays for the electricity?
References: <E1D1uXH-0002X1-00@mendel.bio.caltech.edu>
Message-ID: <421540FB.2060103@verizon.net>


David;

It sounds to me as though you are seeking low power beowuf as a solution.
There have been a few people build such machines, and it is possible to 
build a fast, useful beowulf
cluster that uses very little electrical power and has sufficient muscle 
to do some serious work.

I have a small 14 node cluster which I built a year ago.  It  uses very 
little power , and runs so cool that no room air conditioning is needed.
In fact, my p4 machine makes more noise and heat than the beowulf 
cluster in my apartment.

http://mini-itx.com/projects/cluster/ 

The above link shows the original 12 node configuration.

At present, there are some motherboards available which give a very nice 
combination of cost, performance and low power use.
The trick is to "right-size" everything for your needs and available 
resources. The down side is  that the small low-power go-fast stuff is a 
little more pricey than the plain vanilla pc hardware a beowulf is 
usually buit from , but not insanely so.

Transmetta has some nice boards, and the minit-itx boards are not bad at 
all for the cost.  Also, there are some rather nice small form factor 
motherboards that use AMD's geode cpu. When I start comparing cost, 
power use, and performance, so far the most attractive motherboards seem 
to be the mini-itx boards with the nemiah core cpu.
However with some low power geode boards now running at up to 1500 MHz, 
that may change.  The Transmeta boards are probably the fastest of the 
low power boards, but the power use per MIPS is not as good as other 
boards if you believe the Transmeta printed specifications.


Glen

PS:

I have also made a few comments  below.

David Mathog wrote:

>At Wed, 16 Feb 2005 19:08:05 +0100 Vincent Diepeveen wrote:
>
>  
>
>>Date: Wed, 16 Feb 2005 19:08:05 +0100
>>From: Vincent Diepeveen <diep at xs4all.nl>
>>Subject: Re: [Beowulf] Academic sites: who pays for the electricity?
>>To: "David Mathog" <mathog at mendel.bio.caltech.edu>,
>>	beowulf at beowulf.org
>>Message-ID: <3.0.32.20050216190804.0106fcc0 at pop.xs4all.nl>
>>Content-Type: text/plain; charset="us-ascii"
>>
>>At 08:16 16-2-2005 -0800, David Mathog wrote:
>>    
>>
>>>In most universities services like electricity, water, and 
>>>A/C are paid for by the school.  To do so they take "overhead"
>>>out of every grant.  Partially as a consequence of this they
>>>typically have a very poor ability to meter usage on a room
>>>by room basis.
>>>
>>>Now somewhere between the 10 node Pentium II beowulf sitting on
>>>a lab bench and the 1000 node dual P4 Xeon beowulf in a machine
>>>room that takes up half the basement the cost of the electricity
>>>(both for power and A/C) goes from  a minor expense to a major
>>>one.  Really major. For instance, in that hypothetical large machine,
>>>at 10 cents per kilowatt hour (a round number), assuming 100 watts
>>>per CPU (another round number) that's:
>>>      
>>>
For a dual p4 xeon machine at full throttle, it comes out to about 250 
watts per node (or a little less) including the network adapters and 
switching.

>>> 1000  (nodes) *
>>>    2  (cpus/node) *
>>>    .1 (kilowatts/cpu) *
>>>    .1 (dollars/kilowatt-hour) *
>>> 365   (days /year) *
>>>  24   (hours/day) =
>>>-----------------------
>>> 175200 dollars/year
>>>      
>>>
>>Complete academic nonsense calculation. If you use quite some electricity
>>the electricity gets up to factor 20-40 cheaper. Getting a factor 10
>>reduction in usage bill is pretty easy if you negotiate properly.
>>    
>>
>
>Well, it isn't complete nonsense, unless you care to dispute the
>number of days in a year, hours in a day, or cpus in a dual node
>computer!
>  
>The only term you're complaining about is the price of
>electricity.  I'm not privy to the electrical rates that our
>school pays, they may well be an order of magnitude lower.  My
>home rates certainly aren't, but then, I don't buy as much
>power as the campus.  It's also not at all clear that the
>campus would sell power to the end users at the same rate
>which it pays the utility.
>  
>
You are forgetting the cost of cooling the cluster. Big machines make a 
lot of heat, and need a lot of cooling.

>I don't really understand your point about keeping the units
>running versus restarting them.  Sure, it would be really bad
>to try to boot all 1000 nodes simultaneously, in all likelihood
>it wouldn't work.  That's why they are typically started at N
>second intervals, where N depends on your hardware.
>Surely there is some N large enough so that the peak current
>draw during the restart never exceeds the random fluctuations
>observed when all units are running normally.  Or is your
>point that the electricity company doesn't want the facility
>to draw _less_ current than it uses normally at
>steady state?
>  
>
It is important to keep the cluster up and running, and only cycle the 
power when you must.
The inrush currents at turnon stress components and shorten the life of 
the nodes considerably.
Also, thermal cycling puts mechanical stresses on boards and components 
that can cause
components and connections to fail.

In a large cluster that is middle aged (@ 2 years old), you can 
reasonably expect to lose a couple of nodes
every time you power down and come back up. After a while, this can be 
expensive.
Shutting down a big machine is not a trivial thing.
 

>On a somewhat related note, it would be nice if rack nodes
>had some graceful way to conserve electricity.  For instance,
>something along the lines of: if the CPU utilization goes
>below 5% for 10 seconds ratchet the clock down by a factor of 10.
>When CPU usage goes above 90% ratchet for 2 seconds move it back
>up again.  Notebooks can do this sort of thing, but it seems not
>to be a "feature" of most full size motherboards.  This should
>also lower the average temperature in the case, at the expense
>of increased thermal cycling.  Hard to say off hand if that's
>a plus or a minus as far as hardware longevity goes.  Certainly
>it would be a plus in terms of energy conservation.
>  
>

A lot of modern cpu's have the ability to actually shut off unused 
internal circuitry.
VIA CPU's, AMD's geode, Transmeta, and some intel cpus have these features.

>Regards,
>
>David Mathog
>mathog at caltech.edu
>Manager, Sequence Analysis Facility, Biology Division, Caltech
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>

-- 
Glen E. Gardner, Jr.
AA8C
AMSAT MEMBER 10593
Glen.Gardner at verizon.net


http://members.bellatlantic.net/~vze24qhw/index.html


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050217/be311034/attachment.html>

From redboots at ufl.edu  Thu Feb 17 18:46:58 2005
From: redboots at ufl.edu (Paul Johnson)
Date: Thu, 17 Feb 2005 21:46:58 -0500
Subject: [Beowulf] scalability of cluster paper
Message-ID: <42155722.5060105@ufl.edu>

I am wondering if anyone could point me to a paper on the scalability 
issues of NOWs clusters or
Beowulf clusters using MPI.  Im curious what kind of scalability people 
see for clusters less than 10 nodes. 
Any reference to a paper would be greatly appreciated.  I've been doing 
alot of scholar.googling but
havent found what Im looking for yet.

Thanks,
Paul


From eno at dorsai.org  Thu Feb 17 21:43:00 2005
From: eno at dorsai.org (Alpay Kasal)
Date: Fri, 18 Feb 2005 00:43:00 -0500
Subject: [Beowulf] powering up 18 motherboards
In-Reply-To: <6.1.1.1.2.20050217160903.07880988@mail.jpl.nasa.gov>
Message-ID: <0IC300MSUEDY78@mta10.srv.hcvlny.cv.net>

James Lux, thanks for the extremely useful explanation. Btw, I'm in
Brooklyn, NY. 120volts, 60cycles, regular AC power. I don't know the gauge
of the wiring in the walls but (as mentioned in another response just now) I
suspect it is old wiring and is the reason for the strange 10amp circuit
breakers.

I looked at the x10 modules. Seems like it could be very useful, just script
all of them from my headend. For now I'm going to try to handle the power-on
sequence myself. I figured I could steal 3 10amp circuits from the house.

Follow me... Turn on 4 nodes (on 1 strip) which will peak at 5.2amps. let
that settle down to a steady 3.48amps and hit another strip of 4 nodes.
Total draw while the 2nd batch is starting is 8.68amps. It should steady at
6.96. I then have room to turn 1 more node on. Then one more after that. A 4
step process to get 10 nodes powered up without going over 10amps. Perform
the same exact steps on a 2nd circuit. Annoying but possible without
spending anymore money.

I was really hoping a decent $200-300 UPS would come to the rescue here. Oh
well.

I just had a thought... I planned on making use of wake-on-lan. I can just
start sending jobs to the whole network though if all of it is asleep, I'd
have to still be careful of the powerup-sequence. Grrrr. Maybe a script to
perform WOL before starting any number crunching.

Boy did I take nice big fat electrical lines for granted in the past!

Alpay


-----Original Message-----
From: Jim Lux [mailto:James.P.Lux at jpl.nasa.gov] 
Sent: Thursday, February 17, 2005 7:18 PM
To: Dean Johnson; Alpay Kasal
Cc: beowulf at beowulf.org
Subject: Re: [SPAM] [Beowulf] powering up 18 motherboards

No, the UPS won't help.  It might make things worse, because as you flip on 
all that load, the voltage will sag, causing the UPS to turn on, which then 
might trip from the overcurrent (assuming you're not out buying a 2kW UPS).

You could use the X-10 type (aka Plug n Power) remote controlled relays 
(don't use Lamp modules.. you need Appliance modules, which are relays
inside).


From rgb at phy.duke.edu  Fri Feb 18 04:12:09 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 18 Feb 2005 07:12:09 -0500 (EST)
Subject: [SPAM] [Beowulf] powering up 18 motherboards
In-Reply-To: <000601c5157c$855b5350$32a8a8c0@LAPTOP152422>
References: <0IC300G56DM77U@mta8.srv.hcvlny.cv.net>
	<000601c5157c$855b5350$32a8a8c0@LAPTOP152422>
Message-ID: <Pine.LNX.4.58.0502180653290.15876@lilith.rgb.private.net>

On Thu, 17 Feb 2005, Jim Lux wrote:

> 
> ----- Original Message -----
> From: "Alpay Kasal" <eno at dorsai.org>
> To: "'Bari Ari'" <bari at onelabs.com>; "'Jim Lux'" <James.P.Lux at jpl.nasa.gov>
> Cc: "'Dean Johnson'" <dtj at uberh4x0r.org>; <beowulf at beowulf.org>
> Sent: Thursday, February 17, 2005 9:26 PM
> Subject: RE: [SPAM] [Beowulf] powering up 18 motherboards
> 
> 
> > I think you hit the nail on the head Bari, I'm in Brooklyn, New York. So I
> > suppose it should be 15amp circuits but every circuit breaker in the box
> is
> > clearly a 10. This is an old house, seems like any renovations over the
> > years have been only for aesthetics. The wiring in the walls is probably

Circuit breaker?  What's wrong with good old fuses?  Good enough for my
granddaddy.... actually, CB's suggest that the wiring has been messed
with sometime in the last twenty or thirty years.  That's a good thing!
In my last house, the fuses DID feed directly into 14 Ga wire pairs that
went through the house six inches apart in porcelain insulators and
through porcelain tubes -- until I got to it and replaced it all, a
circuit at a time, with 12 Gauge three wire PVC.

The fabric coating those old wires DOES tend to disintegrate after 70
years or so... if your house has it it is probably a midlevel electrical
fire trap.

> > disintegrating - that would explain why the new looking circuit breakers
> are
> > rated for 10 amps.
> 
> Ahhh yes.. New York, where some of the (mostly undocumented) distribution
> wiring dates from Edison himself, and dogs are electrocuted when urinating
> on the street from stray currents in connection boxes.  The wiring is
> probably knob and tube.
> 
> >
> > I think I can get use of 3 circuits which gives me some room to play with
> > all the nodes and hopefully the assortment of switches and power supply. I
> > have to figure out what the draw will be on the rest of the equip. Now
> where
> > the hell am I going to plug in this air conditioner????
> >
> > Any advice on how to gang up 3 10amp circuits into a single 30amp? Sounds
> > like a job for an electrician? Thanks for the help guys.
> 
> Why gang em up?  Just run three extension cords or plug strips, one into

In case Jim didn't make it clear, DO NOT GANG THEM UP.  In principle one
can do this, IF all three of the circuits have the same phase.  If they
don't have the same phase (and aren't correctly connected in phase) you
will observe a brief, interesting flash while the circuit breaker does
bad things when you power up after trying it and see "midlevel
electrical fire trap" above.

However, the "right" way to gang them up is to go to the box and run
brand new, clean, NEC-compliant wire from the box to your cluster
location.  The only thing you'll have to worry about is removing as much
total amperage from the box as you add (or at least, staying withing the
distribution box's total capacity).  Here knowing what your house's
total service capacity is, and what the total capacity is of the main
distribution panel is (hopefully they are the same, but one never knows)
is useful.  Maybe the box already has spare capacity and you just don't
know it.

The rule with electrical wiring is that if you don't know EXACTLY what
you're doing, hire a professional electrician.  That is, if you have to
ask how to gang up circuits (thereby demonstrating an ignorance about 2
and 3 phase delivery, hot, neutral and cold/ground wires, ground loops,
etc) you have an unpleasantly high probabiliby of either killing
yourself or burning down your house (possibly months or years after your
renovation), and are probably breaking the law besides when you do it.

Jim's suggestions below are excellent for living with what you have.
Also consider stripping down the machines.  A cluster node these days
can be built out of a motherboard loaded with CPU and memory and with an
onboard, PXE-capable NIC (or at most a PCI PXE NIC).  No floppy, CD,
hard drive, video card, or other peripheral stuff.  A few weeks ago,
there was a lovely discussion on clusters made by mounting bare
motherboards on shelving and powering them off of shared large power
supplies strung together with simple extensions, three motherboards per
450 W supply.  This saves on both heat and power, as EACH peripheral
draws a base current when turned on, including each power supply.  There
are pictures out there if you look.

If you do this, you'll probably drop your load by as much as 30 watts
per node, and this should be enough to push you safely within the
capacity of your circuits at six nodes per circuit with room to spare.

   rgb

> each circuit. 6 machines on each circuit.  If the startup surge is too much,
> stagger the spin up of the disk drives (maybe some sort of BIOS power
> management option?).  Seriously think about cobbling together some
> collection of relays.   The W.W.Grainger catalog is your friend, even if you
> don't buy your stuff from them, you'll at least have a good description to
> go googling for surplus places.
> 
> If that gets too dicey.. Honda and Yamaha make very nice, quiet inverter
> generators.  2kW for about $700.
> 
> Good luck...
> Jim.
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From atp at piskorski.com  Fri Feb 18 04:45:28 2005
From: atp at piskorski.com (Andrew Piskorski)
Date: Fri, 18 Feb 2005 07:45:28 -0500
Subject: [Beowulf] Re: Re: Re: Home beowulf - NIC latencies (Patrick
	Geoffray)
In-Reply-To: <1108459690.25145.8.camel@pc-2.office.scali.no>
References: <1108459690.25145.8.camel@pc-2.office.scali.no>
Message-ID: <20050218124527.GA86169@piskorski.com>

On Tue, Feb 15, 2005 at 10:28:10AM +0100, Ole W. Saastad wrote:

> Patrick Geoffray wrote:
> > The one with the best potential would be to use HyperThreading on
> > Intel chips to have a polling thread burning cycles continuously;

> The simple finding is that the kinetics program got somewhat more 
> than 70% of the CPU cycles and that the polling waisted close to
> 30% of the CPU cycles, 30% is not for free.

Using what Linux kernel?  Using what feature to tell the kernel,
"Please run this polling process only on the extra HyperThreaded
virtual CPU, never on the real CPU." ?

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/


From rgb at phy.duke.edu  Fri Feb 18 05:11:17 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 18 Feb 2005 08:11:17 -0500 (EST)
Subject: [Beowulf] powering up 18 motherboards
In-Reply-To: <0IC300MSUEDY78@mta10.srv.hcvlny.cv.net>
References: <0IC300MSUEDY78@mta10.srv.hcvlny.cv.net>
Message-ID: <Pine.LNX.4.58.0502180805130.6001@ganesh.phy.duke.edu>

On Fri, 18 Feb 2005, Alpay Kasal wrote:

> James Lux, thanks for the extremely useful explanation. Btw, I'm in
> Brooklyn, NY. 120volts, 60cycles, regular AC power. I don't know the gauge
> of the wiring in the walls but (as mentioned in another response just now) I
> suspect it is old wiring and is the reason for the strange 10amp circuit
> breakers.
> 
> I looked at the x10 modules. Seems like it could be very useful, just script
> all of them from my headend. For now I'm going to try to handle the power-on
> sequence myself. I figured I could steal 3 10amp circuits from the house.
> 
> Follow me... Turn on 4 nodes (on 1 strip) which will peak at 5.2amps. let
> that settle down to a steady 3.48amps and hit another strip of 4 nodes.
> Total draw while the 2nd batch is starting is 8.68amps. It should steady at
> 6.96. I then have room to turn 1 more node on. Then one more after that. A 4
> step process to get 10 nodes powered up without going over 10amps. Perform
> the same exact steps on a 2nd circuit. Annoying but possible without
> spending anymore money.

You problem will occur when the power goes off and comes back on when
you aren't there.  We have rather frequent 5-10 second powerouts down
here -- without UPS's I used to go nuts in my house.

> I was really hoping a decent $200-300 UPS would come to the rescue here. Oh
> well.

I don't think that putting one of these on per circuit is a bad idea;
the real problem is that a UPS might draw more than your lines' capacity
when initially charging -- I don't know for sure how much of a load the
divert to charging when passing a load through.  If you have >>a<<
bigger circuit, you might be able to charge one fully on it, move it,
plug everything in and power everything up, and use it as a line buffer
of sorts.  Even a couple of very cheap $50 ones that only will give you
a minute or two might keep you from blowing the CBs every time the power
in your area bobbles.  Assuming that it does bobble -- maybe NYC never
has power issues, even when dogs piss on transformers...;-)

> I just had a thought... I planned on making use of wake-on-lan. I can just
> start sending jobs to the whole network though if all of it is asleep, I'd
> have to still be careful of the powerup-sequence. Grrrr. Maybe a script to
> perform WOL before starting any number crunching.

Yeah, that's an alternative.  Leave one box set to power up, set the
rest to NOT power up after a power outage, if you can, and power them up
with WOL from a script.  But yes, a PITA.

   rgb

> 
> Boy did I take nice big fat electrical lines for granted in the past!
> 
> Alpay
> 
> 
> -----Original Message-----
> From: Jim Lux [mailto:James.P.Lux at jpl.nasa.gov] 
> Sent: Thursday, February 17, 2005 7:18 PM
> To: Dean Johnson; Alpay Kasal
> Cc: beowulf at beowulf.org
> Subject: Re: [SPAM] [Beowulf] powering up 18 motherboards
> 
> No, the UPS won't help.  It might make things worse, because as you flip on 
> all that load, the voltage will sag, causing the UPS to turn on, which then 
> might trip from the overcurrent (assuming you're not out buying a 2kW UPS).
> 
> You could use the X-10 type (aka Plug n Power) remote controlled relays 
> (don't use Lamp modules.. you need Appliance modules, which are relays
> inside).
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Fri Feb 18 05:14:39 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 18 Feb 2005 08:14:39 -0500 (EST)
Subject: [Beowulf] scalability of cluster paper
In-Reply-To: <42155722.5060105@ufl.edu>
References: <42155722.5060105@ufl.edu>
Message-ID: <Pine.LNX.4.58.0502180811310.6001@ganesh.phy.duke.edu>

On Thu, 17 Feb 2005, Paul Johnson wrote:

> I am wondering if anyone could point me to a paper on the scalability 
> issues of NOWs clusters or
> Beowulf clusters using MPI.  Im curious what kind of scalability people 
> see for clusters less than 10 nodes. 
> Any reference to a paper would be greatly appreciated.  I've been doing 
> alot of scholar.googling but
> havent found what Im looking for yet.

There is a whole chapter on Amdahl's Law and scaling in my online
beowulf book (and several online talks and white papers ditto).  Look on

  http://www.phy.duke.edu/brahma

or

  http://www.phy.duke.edu/~rgb

under the Beowulf link.

Scalability isn't a question of number of nodes per se, it is a question
of the parallel speedup you observe as a FUNCTION of the number of nodes
participating in the problem, the problem "size" (really all other
parameters, not just size), the network characteristics, and the system
CPU/memory/bus characteristics.  The presentation I give is a simplified
one that lets you see the general idea, but the reality is much more
complex for some classes of problem.

HTH,

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From jcownie at etnus.com  Fri Feb 18 06:38:23 2005
From: jcownie at etnus.com (James Cownie)
Date: Fri, 18 Feb 2005 14:38:23 +0000
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies 
In-Reply-To: Message from Rossen Dimitrov <rossen@verarisoft.com> 
	of "Tue, 15 Feb 2005 09:28:28 EST." <4212070C.9050207@verarisoft.com> 
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de>
	<4212070C.9050207@verarisoft.com> 
Message-ID: <20050218143823.8694B1D9B3@amd64.cownie.net>


> In a conversation with MPI and tool developers and I once mentioned
> that not defining a standard/mandatory mpi.h was probably a missed
> opportunity for improving interoperability of MPI. I was then told by
> a member of the MPI-1 Forum that this was done on purpose. This makes
> me think that we will not see an ABI definition for MPI any time soon.

I think this is to misunderstand the process. 

The whole MPI process was informal. No-one gave the committee any power
to create a standard, it happened because people wanted it to happen and
were prepared to use the result.

MPI followed the format and style of HPF, and can be thought of as an
Open Source standard.

It was created as a result of user demand by people who were prepared to
put in the effort to do so, and was adopted because it met a need.

If an ABI for MPI is so important to you and of such value to your (and
Patrick's) clients, then there's nothing to stop you from formulating
such a standard, or, at least starting a project to create one.

If you're right about its importance, then all the MPI implementors will
follow your lead. If you're wrong, well, you wasted your time.

The point here is that doing this can be an informal process which
doesn't require "The MPI Forum" (whatever that is now !?)  to endorse
it, any more than a project on SourceForge requires endorsement by Linus
if it runs on Linux ;-)

(Or, if you prefer, don't keep whingeing about what the MPI Forum chose
to do, but get on and fix it for yourself).

-- 
-- Jim
--
James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


From rross at mcs.anl.gov  Fri Feb 18 07:56:26 2005
From: rross at mcs.anl.gov (Rob Ross)
Date: Fri, 18 Feb 2005 09:56:26 -0600 (CST)
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies 
In-Reply-To: <20050218143823.8694B1D9B3@amd64.cownie.net>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de>
	<4212070C.9050207@verarisoft.com>
	<20050218143823.8694B1D9B3@amd64.cownie.net>
Message-ID: <Pine.LNX.4.58.0502180950520.1141@terra.mcs.anl.gov>

On Fri, 18 Feb 2005, James Cownie wrote:

[snip]

> If an ABI for MPI is so important to you and of such value to your (and
> Patrick's) clients, then there's nothing to stop you from formulating
> such a standard, or, at least starting a project to create one.
> 
> If you're right about its importance, then all the MPI implementors will
> follow your lead. If you're wrong, well, you wasted your time.
> 
> The point here is that doing this can be an informal process which
> doesn't require "The MPI Forum" (whatever that is now !?)  to endorse
> it, any more than a project on SourceForge requires endorsement by Linus
> if it runs on Linux ;-)
> 
> (Or, if you prefer, don't keep whingeing about what the MPI Forum chose
> to do, but get on and fix it for yourself).

But keep in mind that some implementations encode meaning into the values 
in mpi.h, and as a result the values aren't even the same between multiple 
instances of the same implementation on different platforms!

For example, MPICH2 encodes the size of basic datatypes in the value of 
the type.  Saves us looking around in some table for these cases (which 
are the only ones that Patrick wants us to support :)!).

So this is probably a larger problem than it seems.

Rob
---
Rob Ross, Mathematics and Computer Science Division, Argonne National Lab


From mathog at mendel.bio.caltech.edu  Fri Feb 18 08:29:47 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Fri, 18 Feb 2005 08:29:47 -0800
Subject: [Beowulf] powering up 18 motherboards
Message-ID: <E1D2B0l-0004MO-00@mendel.bio.caltech.edu>

>I was really hoping a decent $200-300 UPS would come to
> the rescue here. Oh well.

APC makes Power Distribution Units that can be set to 
start loads at fixed intervals.  If you had access to a
208V/3 phase line the AP7990 costs about $650, uses that
as input, and outputs 24 120V sockets.  They make quite
a few of these PDUs in different configurations
so maybe you can find one that does what you want?

Another, much cheaper option, would be to set the slave
node BIOS to use "Wake on LAN" (if it works on your systems)
and NOT to start on power up.  Then when power came up the
headnode would boot and the others would just warm up
enough to listen to their ethernet cards.  I don't
expect that the start up current going to that mostly off 
state would be very high, even for 18 computers, since neither
the disks nor fans start spinning. Once the head node
comes up you can boot the slaves using etherwake.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From diep at xs4all.nl  Fri Feb 18 08:36:44 2005
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Fri, 18 Feb 2005 17:36:44 +0100
Subject: [Beowulf] Academic sites: who pays for the electricity?
Message-ID: <3.0.32.20050218173640.0109a100@pop.xs4all.nl>

At 17:01 17-2-2005 -0800, Jim Lux wrote:
>At 04:19 PM 2/17/2005, Jim Lux wrote:
>>At 02:54 PM 2/17/2005, David Mathog wrote:
>>>At Wed, 16 Feb 2005 19:08:05 +0100 Vincent Diepeveen wrote:
>>>
>>> > Date: Wed, 16 Feb 2005 19:08:05 +0100
>>> > From: Vincent Diepeveen <diep at xs4all.nl>
>>> > Subject: Re: [Beowulf] Academic sites: who pays for the electricity?
>>> > To: "David Mathog" <mathog at mendel.bio.caltech.edu>,
>>> >       beowulf at beowulf.org
>>> > Message-ID: <3.0.32.20050216190804.0106fcc0 at pop.xs4all.nl>
>>> > Content-Type: text/plain; charset="us-ascii"
>>> >
>>> > At 08:16 16-2-2005 -0800, David Mathog wrote:
>>> > >In most universities services like electricity, water, and
>>> > >A/C are paid for by the school.  To do so they take "overhead"
>>> > >out of every grant.  Partially as a consequence of this they
>>> > >typically have a very poor ability to meter usage on a room
>>> > >by room basis.
>>> > >
>>> > >Now somewhere between the 10 node Pentium II beowulf sitting on
>>> > >a lab bench and the 1000 node dual P4 Xeon beowulf in a machine
>>> > >room that takes up half the basement the cost of the electricity
>>> > >(both for power and A/C) goes from  a minor expense to a major
>>> > >one.  Really major. For instance, in that hypothetical large machine,
>>> > >at 10 cents per kilowatt hour (a round number), assuming 100 watts
>>> > >per CPU (another round number) that's:
>>> > >
>>> > >  1000  (nodes) *
>>> > >     2  (cpus/node) *
>>> > >     .1 (kilowatts/cpu) *
>>> > >     .1 (dollars/kilowatt-hour) *
>>> > >  365   (days /year) *
>>> > >   24   (hours/day) =
>>> > >-----------------------
>>> > >  175200 dollars/year
>>> >
>>> > Complete academic nonsense calculation. If you use quite some
electricity
>>> > the electricity gets up to factor 20-40 cheaper. Getting a factor 10
>>> > reduction in usage bill is pretty easy if you negotiate properly.
>
>Just where do you live that such negotiations are possible. Here's some 

You aren't going to negotiate about a single small room with a few
lightbulbs obviously.

We're talking about huge usage, like if all supercomputers are located at 1
central spot and the entire institute with thousands of working places gets
powered in a central way, and usually electricity offering companies aren't
going to put online their rates for reduced usage, as that would give them
a bad negotation starting point :) 

Nuclear power gets more and more exported from France to rest of Europe.

For example Italy is importing 25% of its total power, majority is from
France. Netherland and Germany import roughly 20% of their power. 

That will get more and more, simply because building electricity producing
central plants can only get build for a specific amount of time (like 25
year contract) and then must get cleaned up. Obviously such rules make
building your own producing central plants impossible for the electricity
producing companies.

I'm not taking a political viewpoint on what type of electricity is
damaging more than the other and what should happen in the future in that
respect. 

Yet closing eyes for reality is something else. More and more power gets
used. Computers take a part of that power, industry majority.

In USA last so many years no new nuclear plants have been built. Obviously
that means that the market there is different from Europe, where the energy
market seems to be more innovative than USA.

Even though for certain plans i have little respect for. Like the 150 meter
high windmills they want to build in Houten, just a few hundreds of meters
away from newly build houses, where tens of thousands live, that is IMHO a
wrong idea. 

Don't have the details here how big the diameter and speed of airtransport
is of those mills, but obviously they can only get build because the
government wastes money on them and at most kill huge number of birds who
have near zero chance to survive if they are near those mills. 

Yet it's another innovation in Europe.

The energy market is one of the most complex financial markets and not only
because politicians prefer to close their eyes for its problems. Another
major problem is who owns what in europe. Is the government again going to
own the transport infrastructure or can independant transport companies
keep doing the job? There is difference between energy producers and energy
consumer delivering companies and so on.

Yet a washing machine eats thousands of watts and nearly everybody has one
at home in Europe. Certain products we daily use and just throw away get
produced by throwing tens of thousands of watts into battle in heavy
industry. So complaining about energy usage of computers is a good thing,
but shouldn't get overreacted.

>real numbers from Southern California Edison. 
>http://www.sce.com/CustomerService/RateInformation/BusinessRates/LargeBusin
ess/
>
>First off, you're looking at a 200kW load for 1000 nodes, which is a hefty 
>load, just for the computers (not counting lights, HVAC, etc.)  But, no 
>matter, we'll assume your facility is sucking at least 500kW some of the 
>time, so that would put you in the large business TOU-8 tariff.
>http://www.sce.com/NR/sc3/tm2/pdf/ce54-12.pdf
>  All the large consumer tariffs are time-of-use sensitive.  I assume you 
>wouldn't want some sort of "Critical Peak Pricing Options" or "Demand 
>Bidding Programs"
>
>Let's assume you're being served at 240V (as opposed to having your own 
>distribution transformers, etc., although as a 200kW consumer, that's 
>something you should consider).
>
>Looks like the rates break down as about 0.016/kWh for the delivery, and 
>the actual power (generation) runs somewhere between 0.04/kWh  (off peak 
>summer) to 0.12/kWh on peak summer.
>
>There's also a raft of other charges (metering, demand (runs about $10/kW 
>of instantaneous demand), power factor, etc.)
>
>Compare this to Domestic service.. where the rates run from 0.11 to 
>0.18/kWh, depending on where you sit relative to baseline, season, etc. 
>(I'll also point out that I was paying SCE $0.26/kWh at home in the summer 
>of 2001, but rates are lower now.) Now, you might consider Residential to 
>be artificially constrained for political reasons, so we take a look at 
>GS-1 (general service..)
>Here, we have 0.07 for delivery and 0.085 (totalling 0.155/kWh) during the 
>summer.
>
>The point is, there isn't a 10:1 ratio... not even close. And, rgb's 
>ballpark of $0.10/kWh is a perfectly reasonable estimate, if a bit low for 
>Southern California. Even as far back as 1990,on peak, large customers were 
>paying on the order of $0.11/kWh.
>
>It is possible that if you were buying power directly (which is possible, 
>as a large end user), you can pay "market rate" for each kWh you 
>consume.  The price is quite volatile, though... At the peak of the market 
>failure a few years back, a kWh on the open market was something like $20 
>during peak times.  I doubt you want to schedule your cluster ops to take 
>advantage of electricity rate fluctuations.
>
>
>>>Well, it isn't complete nonsense, unless you care to dispute the
>>>number of days in a year, hours in a day, or cpus in a dual node
>>>computer!
>>>
>>>The only term you're complaining about is the price of
>>>electricity.  I'm not privy to the electrical rates that our
>>>school pays, they may well be an order of magnitude lower.  My
>>>home rates certainly aren't, but then, I don't buy as much
>>>power as the campus.  It's also not at all clear that the
>>>campus would sell power to the end users at the same rate
>>>which it pays the utility.
>
>CalTech probably buys their power from City of Pasadena, but it's probably 
>similar in rate structure to SCE's TOU-8.  Home rates are somewhat 
>artificially low for political reasons.  The folks really getting the short 
>end of the stick are small businesses who don't have the negotiating power 
>that a large business does, nor the political clout of elderly pensioners 
>dying from heat.
>
>
>I'll also note that if you start paying for kWh, you're going to want to 
>give serious consideration to buying more nodes than you strictly need, and 
>shutting down the cluster during on-peak times. A typical pricing strategy 
>might be 0.15/0.07/0.05 (on/mid/off peak): Peak lasts 6 hrs (1200-1800), 
>mid is 0800-1200,1800-2300, and offpeak is the rest.  There's 93 hours of 
>offpeak time out of 168 total in the week
>
>For the pricing and schedules above, it turns out that the optimum is to 
>shut down only during peak, but run during midpeak, for an average energy 
>cost of 0.056/kWh, but using 22% more computers.
>
>
>
>James Lux, P.E.
>Spacecraft Radio Frequency Subsystems Group
>Flight Communications Systems Section
>Jet Propulsion Laboratory, Mail Stop 161-213
>4800 Oak Grove Drive
>Pasadena CA 91109
>tel: (818)354-2075
>fax: (818)393-6875
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From rgb at phy.duke.edu  Fri Feb 18 09:56:22 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 18 Feb 2005 12:56:22 -0500 (EST)
Subject: [Beowulf] powering up 18 motherboards
In-Reply-To: <E1D2B0l-0004MO-00@mendel.bio.caltech.edu>
References: <E1D2B0l-0004MO-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.58.0502181208430.6001@ganesh.phy.duke.edu>

On Fri, 18 Feb 2005, David Mathog wrote:

> >I was really hoping a decent $200-300 UPS would come to
> > the rescue here. Oh well.
> 
> APC makes Power Distribution Units that can be set to 
> start loads at fixed intervals.  If you had access to a
> 208V/3 phase line the AP7990 costs about $650, uses that
> as input, and outputs 24 120V sockets.  They make quite
> a few of these PDUs in different configurations
> so maybe you can find one that does what you want?

Which brings to mind a safety as well as a practical point -- if you
indeed DO have three phases of power on the three separate circuits of
the room (not as unlikely as it sounds as in that case they may well
share a single common ground wire) then your power options are a little
different, because then you de facto run at a higher voltage and lower
line current, and can in principle get a small boost in what the lines
can safely tolerate by way of delivered power vs power dissipated in the
supply lines as heat if you use such a unit.

However, this then brings up a nasty, thorny issue concerning the power
factor and current draw pattern of "most" (cheap) PC power supplies.
Switching power supplies that are not power factor corrected tend to
draw most of their current only in the middle third of each voltage
half-sinusoid wave (true fact -- read about it on How Things Work
website or any of several websites that discuss computer room wiring,
some of which -- mirus international? -- are linked to some of my
beowulf pages on brahma).

This means several things:

  a) the peak current draw (for a given average power consumed) is much
higher than you'd expect based on simple RMS considerations -- the power
factor of the load is less than 1, if you know what that means.

  b) this causes higher order harmonics to appear in the voltage/current
curves -- in particular 3*60 Hz or 180 Hz (the "edge" frequency of where
the power draws switch on and off).  These voltage/current ripples may
make it past 60 Hz filters to reach internal components.

  c) three phases can share a neutral because three equal loads with
unit power factor (resistive loads like a light bulb) cause the neutral
current to >>cancel perfectly<<, and it can be shown (consider sin(wt) +
sin(wt + 2\pi/3)) that the neutral current has a strict upper bound in
this case that is within the safe limits of the neutral wire if the
loads on each delivery line are themselves safe.  This is NOT TRUE for
switched power supplies sharing a neutral.  In that case the currents
delivered to the neutral wire by each phase >>add<< instead of
cancelling, in three separate chunks per half cycle.  The neutral
current can actually approach 3I where I is the (average) current being
drawn by any single line (already high relative to RMS expectations
based on an assumption of unit power factor), see above).

This final point is both dangerous and annoying.  It is dangerous
because the neutral line can be carrying enough current to make it much
hotter than permitted by spec assumptions, and if the wiring job is in
any other way marginal, the margins can add and produce a fire (perhaps
during one of those current-draw spikes).  It is annoying because your
CB's will tend to overheat and pop prematurely, your PS's will tend to
overheat and break, your computers will tend to brown out and fail in
mid run due to a voltage ground loop (high backvoltage on your neutral
line relative to prevailing/local/plumbing ground), a condition that is
also actively dangerous on non-ground-fault-protected circuits.  If (as
one might reasonably expect) they are underfusing your line because they
have exceeded the spec length for the gauge of wire that they are using,
then the current carrying neutral ALREADY has a higher resistance than
is technically safe, and all of these conditions are likely to be
exacerbated, possibly into the Danger Zone especially the ground loop
thing.  Shorting that neutral to local building ground might be very
dangerous indeed.

Note that this state COULD ALREADY EXIST with your wiring, and that the
only way to test it is to measure the hot-to-hot voltage between
circuits, that is:

   v_a - v_b = (should be zero, could be 240 or 209 VAC)
   v_a - v_c = ( ditto )
   v_b - v_c = ( ditto )

where v_a is multimeter probe in VAC mode inserted into hot slot of
circuit a, v_b is other probe inserted into hot slot of circuit b.  If
these measurements are all zero, all three of your circuits have the
same phase (and actually could be combined "safely" into a 30 amp
circuit although you should NEVER DO THIS as there is nothing to prevent
somebody at the distribution panel from moving one of the lines onto a
different phase while rearranging things for some other reason so it
isn't at all safe, actually).

If they are all 209, you have three phase (wye) power and the power is
likely being delivered from the distribution panel as a single cable
with five internal wires, three of them insulated carrying one phase
each, a shared insulated neutral, and a bare ground.  If one pair is
zero and the other two are 240, you have two phase power at the
distribution box, and they are either running three separate lines
(likely) or (possibly) lines with four wires, two "hot" and carrying the
opposed phases, one neutral, and one ground and some other circuit in
your apartment is the partner of the odd line out.

They could be running a lower current limit on the lines because they
exceeded the run length for the gauge of wire they used in these cables.
It is certainly cheaper to re-fuse than it is to put additional primary
or secondary panels with thicker wire and/or additional transformers in
locations from which standard wiring can reach and be within code.  A
lot of state codes prohibit this sort of thing and require a building's
wiring to be brought up to code any time anything is renovated, but it
wouldn't surprise me to learn that NYC either makes a general exception
or that individual landlords grease their way to a local exception.

It is perhaps worth your while to figure this out -- I'd certainly want
to know if it were my cluster.  I'd also be CERTAIN to test the phase
per circuit (is the hot wire really hot and not the wire that is
SUPPOSED to be the neutral wire?).  There are some really "interesting"
things that can happen if you cross connect devices in certain ways
between two miswired circuits, where running on each circuit alone is
safe enough in the sense that nothing breaks.  Interesting like
spattering partially vaporized liquid metal is interesting.  You also
very definitely want to learn about the shared neutral thing, if you're
using stock power supplies.  If it is three phase wye, then you could
think about the APC option above and other thing.

Cluster room wiring is nontrivial (even when the "room" is in your
house), and where you are trying to push it to a limit, you're likely
going to have to educate yourself about it.

I think that I put some of this in my online book, in case this is
confusing.  There are also really good discussions of it in the list
archives (where Jim Lux and others made some great contributions).

   rgb

> 
> Another, much cheaper option, would be to set the slave
> node BIOS to use "Wake on LAN" (if it works on your systems)
> and NOT to start on power up.  Then when power came up the
> headnode would boot and the others would just warm up
> enough to listen to their ethernet cards.  I don't
> expect that the start up current going to that mostly off 
> state would be very high, even for 18 computers, since neither
> the disks nor fans start spinning. Once the head node
> comes up you can boot the slaves using etherwake.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From gropp at mcs.anl.gov  Fri Feb 18 10:29:10 2005
From: gropp at mcs.anl.gov (William Gropp)
Date: Fri, 18 Feb 2005 12:29:10 -0600
Subject: [Beowulf]  MPI ABI (Was Re: Re: Home beowulf - NIC
  latencies )
In-Reply-To: <20050218143823.8694B1D9B3@amd64.cownie.net>
References: <Pine.LNX.4.58.0502112034370.1141@terra.mcs.anl.gov>
	<web-520595@free.net>
	<20050214190737.GB1359@greglaptop.internal.keyresearch.com>
	<42112719.4060500@verarisoft.com>
	<Pine.LNX.4.58.0502142253450.1141@terra.mcs.anl.gov>
	<4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de>
	<4212070C.9050207@verarisoft.com>
	<20050218143823.8694B1D9B3@amd64.cownie.net>
Message-ID: <6.2.1.2.2.20050218121850.04c47cd0@pop.mcs.anl.gov>

At 08:38 AM 2/18/2005, James Cownie wrote:
>...
>
>If an ABI for MPI is so important to you and of such value to your (and
>Patrick's) clients, then there's nothing to stop you from formulating
>such a standard, or, at least starting a project to create one.
>

I wrote a paper that appeared in the EuroPVMMPI'02 meeting that discusses 
the issues of a common ABI.  The paper is "Building Library Components That 
Can Use Any MPI Implementation" and is available as 
http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/gmpishort.pdf .  This 
paper was relatively pragmatic and discussed an approach that allowed the 
user to link the same object files against two MPI libraries (MPICH and 
LAM/MPI were used in the example).  There are a few remaining tricky issues 
for handling the MPI opaque objects (specifically, how big are they) and 
the size and layout of MPI_Status (different implementations use different 
sizes, and the user is allowed to use an array of MPI_Status).  There are 
also some very minor tradeoffs in performance in the solution presented in 
the paper, but these probably aren't important in the context of clusters, 
and are likely to be smaller than requiring implementations to translate 
between internal and external representations.

The web site mentioned in the paper is out-of-date, mostly because there 
wasn't much interest in a (nearly) common ABI at the time.  I can make it 
available if there is interest now.

Bill


William Gropp
http://www.mcs.anl.gov/~gropp 


From James.P.Lux at jpl.nasa.gov  Fri Feb 18 10:30:39 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Fri, 18 Feb 2005 10:30:39 -0800
Subject: [Beowulf] Academic sites: who pays for the electricity?
In-Reply-To: <3.0.32.20050218173640.0109a100@pop.xs4all.nl>
References: <3.0.32.20050218173640.0109a100@pop.xs4all.nl>
Message-ID: <6.1.1.1.2.20050218102305.041ca4a0@mail.jpl.nasa.gov>

At 08:36 AM 2/18/2005, Vincent Diepeveen wrote:

> >>> > Complete academic nonsense calculation. If you use quite some
>electricity
> >>> > the electricity gets up to factor 20-40 cheaper. Getting a factor 10
> >>> > reduction in usage bill is pretty easy if you negotiate properly.
> >
> >Just where do you live that such negotiations are possible. Here's some
>
>You aren't going to negotiate about a single small room with a few
>lightbulbs obviously.
>
>We're talking about huge usage, like if all supercomputers are located at 1
>central spot and the entire institute with thousands of working places gets
>powered in a central way, and usually electricity offering companies aren't
>going to put online their rates for reduced usage, as that would give them
>a bad negotation starting point :)

In the U.S., at least, electricity at the retail level is fairly regulated, 
and anyone selling electricity must post official "tariffs" that give the 
rates and so forth. A significant part (perhaps 20-30%) of the "as 
delivered" cost of electricity is the amount you pay for the transmission 
and distribution system (all those HV power lines, etc.), which is, again, 
somewhat regulated (or at least, the past pricing data is readily 
available, by law and regulation).


The significant exception to this might be if you co-locate with an 
independent generator of power, in which case there's no interconnection to 
the grid.  But, even in this case, if your generator interties with the 
rest of the system, so you're not the only customer, then your supply will 
be affected by the fluctations in the supply and demand on the overall 
grid.  In general, the price that a generator is paid is almost totally 
unregulated (unlike retail rates, which is what caused the problems in 
California a few years back), and so, while you may be able to negotiate a 
very low price for power for some times of day, etc., you'll probably also 
get a "interruptible load" clause in the contract, or you'll have to pay 
high rates at other times.


James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From hahn at physics.mcmaster.ca  Fri Feb 18 20:25:28 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Fri, 18 Feb 2005 23:25:28 -0500 (EST)
Subject: [Beowulf] Mare Nostrum (not quite COTS)
In-Reply-To: <20050216103153.GH1404@leitl.org>
Message-ID: <Pine.LNX.4.44.0502182259350.1788-100000@coffee.psychology.mcmaster.ca>

> ranked number four in the world in speed in November 2004, is constructed of
> such totally off-the-shelf parts as IBM BladeCenter JS20 servers, 64-bit

it's funny how "off the shelf" means different things to different people.
I consider blades to be a qualitatively different category of hardware
than, say, a tyan motherboards in an AICPC chassis.  AFAIKT, blades still
run at a premium vs "normal" servers from the same vendor (which is also
at a premium vs whitebox.)

> The thinking behind MareNostrum's construction represents a new way of
> looking at these and other compute-intensive areas. Today's typical
> high-performance computing installation runs a large, parallel RISC-based
> UNIX? system with performance instead of reliability being of utmost
> importance.

egads, did the author of this really believe it?!?

> computer density in the industry, which results in high performance with a
> small footprint. The BladeCenter technology allows for 84 dual processor
> servers in a single 42 U rack,

uh, no, sorry, maybe someone should have fact-checked this.  HP has 
both xeon and opteron-based blades which put 96 duals in a rack.
it would be rather shocking if HP didn't jump on dual-core opterons, too...

> giving more than 1.4 teraflops of compute
> power in a single rack.

fused-mul-add is really a wonderful marketing tool, isn't it?

> When the power of MareNostrum is unleashed later this year, it will be at the

hmm, to me, the bigger the computer, the more money is evaporating
for every day between delivery and full user utilization.


From hahn at physics.mcmaster.ca  Fri Feb 18 21:18:31 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Sat, 19 Feb 2005 00:18:31 -0500 (EST)
Subject: [Beowulf] Academic sites: who pays for the electricity?
In-Reply-To: <Pine.LNX.4.58.0502161146060.883@ganesh.phy.duke.edu>
Message-ID: <Pine.LNX.4.44.0502182336580.1788-100000@coffee.psychology.mcmaster.ca>

> > room that takes up half the basement the cost of the electricity
> > (both for power and A/C) goes from  a minor expense to a major
> > one.

but it's not as if the university will fail to notice a new 1K-node cluster.
so if they choose not to install metering equipment, it's on them.
we recently built a machineroom for 1.5Kcpus, and sure enough, the U
made us buy metering crap (IMO, the U should pay for it.)  the U also
made us buy harmonic mitigating PDUs because they didn't understand 
what PFC means in a power supply.

> > "Typical" lab usage is widely variable but I'd be amazed
> > if most biology or chemistry labs burn through even 1/10th this
> > much for the equivalent lab area.  Some physics lab running
> > a tokamak might come close.

I guess, I figure around 300W/sq ft is pretty typical for a new HPC
machineroom.  that's obviously more than a bio lab, but how about 
comparing to that glassblower over in arts, or someone pouring a new
alloy in materials engineering?

> > Anyway, the question is, have any of the universities said "enough
> > is enough" and started charging these electricity costs directly?
> > If so, what did they use for a cutover level, where usage was
> > "above and beyond" overhead?
> 
> This issue has most definitely come up at Duke, although we're still
> seeking a formula that will permit us to deal with it equitably.  This

I guess I'm a little surprised - one main Canadian funding agency
(CFI) has a program that provides infrastructure operating funds
(you apply for this after being awarded a capital grant.)  it pays
for electricity as well as sysadmins.

> As our Dean of A&S recently remarked, if there aren't any checks and
> balances or cost-equity in funding and installing clusters, they may
> well continue to grow nearly exponentially, without bound (Duke's

I find that most faculty who have compute needs (and funding) will
seriously consider buying into a shared facility instead.  that's our
(SHARCnet's) usual pitch: let us help you spend your grant, and we'll
give you first cut at that resource, but otherwise take all the pain
off your hands.  most people realize that running a cluster is a pain:
heat/noise, but more importantly the fact that it soaks up tons of 
the most expensive resource, human attention.  do you want your grad
students spending even 20% of their time screwing around with cluster
maintenance?

not to mention the fact that most computer use is bursty, and therefore
very profitably pooled.  a shared resource means that a researcher 
can burst to 200 cpus, rather than just the 20 that his grant might 
have bought.  and after the burst, someone else can use them...

> to hold them, the power to run them, and the people to operate them, all
> grow roughly linearly with the number of nodes.  This much is known.

well, operator cost scales linearly, but that line certainly does not 
pass through zero, and is nearly flat (a 100p cluster takes almost the 
same effort as a 200p one.)

> Finding out isn't trivial -- it involves running down ALL the clusters
> on campus, figuring out whom ALL those nodes "belong" to, determining
> ALL the grant support associated with all those people and projects and

here at least, the office of research services sees all funding traffic,
and so is sensitized to the value of cluster pooling.  the major funding
agencies have also expressed some desire to see more facility centralization,
since the bad economics of a bunch of little clusters is so clear...

regards, mark hahn.


From srgadmin at cs.hku.hk  Thu Feb 17 21:41:23 2005
From: srgadmin at cs.hku.hk (srg-admin)
Date: Fri, 18 Feb 2005 13:41:23 +0800
Subject: [Beowulf] CCGrid 2005: CALL FOR PARTICIPATION
Message-ID: <42158003.9040708@cs.hku.hk>


Apologies if you received multiple copies of this message.

------------------------------------------
               CLUSTER COMPUTING AND GRID
                   (CCGrid 2005)
         http://www.cs.cf.ac.uk/ccgrid2005/
                      9-12 May 2005
                           Cardiff, UK
--------------------------------------------

*******************************************************
          CALL FOR PARTICIPATION
*******************************************************

SCOPE
=====

Commodity-based clusters and Grid computing technologies are rapidly
developing, and are key components in the emergence of a novel
service-based fabric for high capability computing. Cluster-powered
Grids not only provide access to cost-effective problem-solving power,
but also promise to enable a more collaborative approach to the use of
distributed resources, and new economic products and services.
CCGrid2005, sponsored by the IEEE Computer Society is designed to bring
together international leaders who are pioneering researchers,
developers, and users of clusters, networks, and Grid architectures and
applications. The symposium will also serve as a forum to present the
latest work, and highlight related activities from around the world.
CCGrid2005 is interested in topics including, but not limited to:

o    Hardware and Software (based on PCs, Workstations, SMPs or
Supercomputers)
o    Middleware for Clusters and Grids
o    Dynamic Optical Network Architectures for Grid Computing
o    Parallel File Systems, including wide area file systems, and
Parallel I/O
o    Scheduling and Load Balancing
o    Programming Models, Tools, and Environments
o    Performance Evaluation and Modeling
o    Resource Management and Scheduling
o    Computational, Data, and Information Grid Architectures and Systems
o    Grid Economies, Service Architectures, and Resource Exchange
Architectures
o    Grid-based Problem Solving Environments
o    Scientific, Engineering, and Commercial Grid Applications
o    Portal Computing / Science Portals

PROGRAMME
=========

The conference will contain:

o  Over 75 papers in the main track (33% Acceptance Rate)

o  9 workshops:  Chair: Craig Lee (Aerospace Corporation, US)

  Workshop 1: Collaborative and Learning Applications of Grid Technology
  Organisers:
      Oscar Ardaiz-Villanueva, Technical University Catalunya, Spain
      Miguel L. Bote-Lorenzo, University of Valladolid, Spain
      Amy Apon, University of Arkansas, US
      Barry Wilkinson, Western Carolina University, US

  Workshop 2: Cluster-Sec 2005: Cluster Security -- The Paradigm Shift
  Organiser: William Yurcik, NCSA, US

  Workshop 3: Semantic Infrastructure for Grid Computing Applications
  Organisers:
      Line C. Pouchard, Oak Ridge National Lab, US
      Luc Moreau, University of Southampton, UK
      Valentina Tamma, University of Liverpool, UK

  Workshop 4: Fifth International Workshop on Global and Peer-2-Peer
Computing:
  "Theory and Experience of Desktop Grids and P2P systems"
  Organisers:
      Franck Cappello, INRIA, France
      Adriana Iamnitchi, Duke University, US  
      Mitsuhisa Sato, Tsukuba University, Japan
 
  Workshop 5: DSM 2005: Fifth International Workshop on Distributed
Shared Memory
  Organisers:
      Laurent Lefevre, INRIA RESO/LIP, France
      Michael Schoettner, University of Ulm, Germany

  Workshop 6: GAN'05: Third Workshop on Grids and Advanced Networks
  Organisers:
      Laurent Lefevre, INRIA RESO/LIP, France
      Pascale Primet, INRIA RESO/LIP, France
 
  Workshop 7: Bio-Medical Computations on the Grid (BioGrid)
  Organisers:
      Chun-Hsi Huang, University of Connecticut, US
      Sanguthevar Rajasekaran, University of Connecticut, US

  Workshop 8: 1st International Workshop on Grid Performability
  Organisers:
      Nigel Thomas, University of Newcastle upon Tyne, UK
      Stephen Jarvis, University of Warwick, UK

  Workshop 9: Workshop on Agent-based Grid Economics (AGE-2005)
  Organisers:
      Daniel Veit, University of Karlsruhe, Germany
      Bjoern Schnizler, University of Karlsruhe, Germany

o  4 Tutorials (31% acceptance rate -- 13 tutorial proposals were 
submitted)

  Chair: Michael Gerndt (TUM, Germany)

  Tutorial 1:  High Performance I/O for Scientific Applications
  Robert Latham
  Mathematics and Computer Science Division, Argonne National Lab, IL,
  USA

  Tutorial 2: Practical Performance Measurement and Analysis of Parallel
Programs on
  Clusters (and Grids)
  Bernd Mohr
  Forschungszentrum Juelich, Zentralinstitut fuer Angewandte Mathematik,
  52428 Juelich, Germany

  Tutorial 3: Grid Computing Security - Issues, Concerns and
Counter-measures
  Anirban Chakrabarti
  Grid Computing Focus Group, Software Engineering Technology Lab,
  Infosys Technologies, Electronics City, Bangalore, Karnataka 560100,
  India

  Tutorial 4: The Gridbus Toolkit: Creating and Managing Utility Grids
for eScience
  and eBusiness Applications
  Rajkumar Buyya
  Senior Lecturer and StorageTek (USA) Fellow of Grid Computing
  Grid Computing and Distributed Systems (GRIDS) Lab
  Dept. of Computer Science and Software Engineering, The University of
  Melbourne, ICT Building, 111, Barry Street, Carlton, Melbourne, VIC
  3053, Australia

o A poster session with over 16 posters
   Posters Chair: Yan Huang (Cardiff University, UK)

o Two General Keynote Talks:

   Talk 1: "e-Science, Cyberinfrastructure and Web Service Grids"
   Professor Tony Hey, University of Southampton/ERSRC, UK

   Talk 2: "Experiences with System X"
   Professor Srinidhi Varadarajan, Virginia Tech, US

o An Industry Track
  Chair: Alistair Dunlop (OMII, UK)
  Industry Keynote: "WS-Agreement"
  Heiko Ludwig, IBM T.J.Watson Research Centre, US

o A "Work in Progress" Session


REGISTRATION DATES
==================
http://www.cs.cf.ac.uk/ccgrid2005/registration.html

Early bird registration  : March 21, 2005
Accommodation (cut-off) date: March 21, 2005


SPECIAL EVENT
=============
Conference Banquet will be at the National Museum and Galleries
of Wales (within walking distance of the conference venue and hotels).
More details at: http://www.nmgw.ac.uk/nmgc/

==========
Honorary Chair
--------------
    Tony Hey, EPSRC, UK

Conference Chairs
-----------------
    David W. Walker, Cardiff University, UK
    Carl Kesselman, USC/ISI, US

Programme Committee Chair
-------------------------
    Omer F. Rana, Cardiff University, UK

Programme Committee Vice-Chairs
-------------------------------
    Jack Dongarra, University of Tenneesee, US
    Luc Moreau, University of Southampton, UK
    Sven Graupner, HP Labs, US
    Peter Sloot, University of Amsterdam, The Netherlands
    Craig Lee, The Aerospace Corporation, US

Publications Chair: Rajkumar Buyya, University of Melbourne, Australia
Workshops Chair: Craig Lee, Aerospace Corporation, US
Tutorials Chair : Michael Gerndt, TU Munich, Germany
Industry Track Chair: Alistair Dunlop, OMII, UK
Exhibits Chair: Steven Newhouse, OMII, UK
Posters Chair : Yan Huang, Cardiff University, UK
Finance Chair: John Oliver, Welsh eScience Centre, UK
Registration Chair : Tracey Lavis, Cardiff University, UK
Local Arrangements Chair: Linda Wilson, Welsh eScience Centre, UK

Publicity Chairs
----------------
    Vladimir Getov, University of Westminster, UK (Europe)
    Marcin Paprzycki, Oklahoma State University, US (Europe)
    Cho-Li Wang, University of Hong Kong (Asia Pacific)
    Ken Hawick, Massey University, New Zealand (Asia Pacific)
    Manish Parashar, Rutgers University, US (America)


PROGRAMME COMMITTEE
-------------------
Thierry Priol, IRISA, France
Seif Haridi, KTH Stockholm, Sweden
Bruno Schulze, Laboratario Nacional de Computacio Cientifica, Brazil
David Abramson, Monash University, Australia
Steven Willmott, Universitat Polithcnica de Catalunya, Spain
Xian-He Sun, Illinois Institute of Technology, US
Yun-Heh (Jessica) Chen-Burger, University of Edinburgh, UK
Thilo Kielmann, Vrije Universiteit, The Netherlands
Brian Matthews, RAL/CCLRC and Oxford Brookes University, UK
Maozhen Li, Brunel University, UK
Greg Astfalk, HP Labs, US
Marty Humphrey, University of Virginia, US
Geoffrey Fox, University of Indiana, US
Martin Berzins, University of Leeds, UK
Hai Jin, Huazhong University of Science and Technology, China
Giovanni Chiola, Universita' di Genova, Italy
Domenico Talia, Universita' della Calabria/ICAR-CNR, Italy
Josi Cunha, Universidade Nova de Lisboa, Portugal
Ron Perrott, Queens University Belfast, UK
Ewa Deelman, ISI/USC, US
Stephen Jarvis, Warwick University, UK
Niclas Andersson, Linkvping University, Sweden
Putchong Uthayopas, Kasetsart University, Thailand
John Morrison, University College Cork, Ireland
Stephen Scott, Oak Ridge National Lab, US
Luciano Serafini, ITC-IRST, Italy
David A. Bader, University of New Mexico, US
Mark Baker, University of Portsmouth, UK
Emilio Luque, Universitat Autrnoma de Barcelona, Spain
Akhil Sahai, HP Labs, US
Gregor von Laszewski, Argonne National Lab, US
Fethi Rabhi, University of New South Wales, Sydney, Australia
Fabrizio Petrini, Los Alamos National Lab, US
Kate Keahey, Argonne National Lab, US
Sergei Gorlatch, Universitdt M|nster, Germany
Brian Tierney, Lawrence Berkeley National Lab, US
Rauf Izmailov, NEC Labs, US
Stephen J. Turner, Nanyang Technological University, Singapore
Savas Parastatidis, University of Newcastle, UK
Elias Houstis, University of Thessaly, Greece -- and Purdue University, US
Karl Aberer, EPFL, Switzerland
Rolf Hempel, DLR, Germany
Anne Elster, NTNU, Norway
Artur Andrzejak, Zuse Institute Berlin, Germany
Jennifer Schopf, Argonne National Laboratory, US
John Gurd, University of Manchester, UK
Domenico Laforenza, ISTI/CNR, Italy
Wolfgang Rehm, TU Chemnitz, Germany
Gabriel Antoniu, IRISA, France
Beniamino di Martino, Seconda Universita di Napoli, Italy
Frank Z. Wang, Cranfield University, UK
Daniel S. Katz, JPL/Caltech, US
Nigel Thomas, University of Newcastle, UK
Moustafa Ghanem, Imperial College, London, UK
Kenneth Hurst, JPL/Caltech, US


From eno at dorsai.org  Thu Feb 17 23:25:22 2005
From: eno at dorsai.org (Alpay Kasal)
Date: Fri, 18 Feb 2005 02:25:22 -0500
Subject: [Beowulf] powering up 18 motherboards
Message-ID: <0IC30057AJ4KZ3@mta7.srv.hcvlny.cv.net>

I think you hit the nail on the head Bari, I'm in Brooklyn, New York. So I
suppose it should be 15amp circuits but every circuit breaker in the box is
clearly a 10. This is an old house, seems like any renovations over the
years have been only for aesthetics. The wiring in the walls is probably
disintegrating - that would explain why the new looking circuit breakers are
rated for 10 amps.

I think I can get use of 3 circuits which gives me some room to play with
all the nodes and hopefully the assortment of switches and power supply. I
have to figure out what the draw will be on the rest of the equip. Now where
the hell am I going to plug in this air conditioner????

Any advice on how to gang up 3 10amp circuits into a single 30amp? Sounds
like a job for an electrician? Thanks for the help guys.

Alpay

-----Original Message-----
From: Bari Ari [mailto:bari at onelabs.com] 
Sent: Thursday, February 17, 2005 8:00 PM
To: Jim Lux
Cc: Dean Johnson; Alpay Kasal; beowulf at beowulf.org
Subject: Re: [SPAM] [Beowulf] powering up 18 motherboards

Jim Lux wrote:

> A 10 amp circuit would be highly unusual in the U.S., but might be 
> common practice elsewhere.  In the U.S., a 15 amp circuit is standard.

I thought this was odd when I first read this as well. This may be a 
case where to save dollars or in rehabbing old buildings where you may 
have more than 3 current carrying conductors in one raceway and you have 
to derate the current protection. In this case it may be that they ran 
more than three #14 current carrying conductors (as defined by the 
NEC)in the same raceway and had to derate the usual 15 amp circuit 
protection down to 10 amps.

-Bari Ari


From emac at cybergps.net  Thu Feb 17 23:36:34 2005
From: emac at cybergps.net (Eric Machala)
Date: Fri, 18 Feb 2005 02:36:34 -0500
Subject: [Beowulf] powering up 18 motherboards
References: <0IC300MSUEDY78@mta10.srv.hcvlny.cv.net>
Message-ID: <00ed01c5158c$92df7550$6e45a8c0@masstivy>

This is not true the Ups will not draw ever any more from the wall that its 
is set to, ups has a set trickle rate that is able to be set  to a cost 
effective trickle rate unless it is load overloaded it, if this is the case 
it is double the trickle rate becuase systems wants to restore full battery 
before power drain err loss of power.... i could get the actual specs on 
this but for your actual needed load if u were in a 50-65% load im sure 
there would be no spikes in power draw over normal trickle
----- Original Message ----- 
From: "Alpay Kasal" <eno at dorsai.org>
To: "'Jim Lux'" <James.P.Lux at jpl.nasa.gov>; "'Dean Johnson'" 
<dtj at uberh4x0r.org>
Cc: <beowulf at beowulf.org>
Sent: Friday, February 18, 2005 12:43 AM
Subject: RE: [Beowulf] powering up 18 motherboards


> James Lux, thanks for the extremely useful explanation. Btw, I'm in
> Brooklyn, NY. 120volts, 60cycles, regular AC power. I don't know the gauge
> of the wiring in the walls but (as mentioned in another response just now) 
> I
> suspect it is old wiring and is the reason for the strange 10amp circuit
> breakers.
>
> I looked at the x10 modules. Seems like it could be very useful, just 
> script
> all of them from my headend. For now I'm going to try to handle the 
> power-on
> sequence myself. I figured I could steal 3 10amp circuits from the house.
>
> Follow me... Turn on 4 nodes (on 1 strip) which will peak at 5.2amps. let
> that settle down to a steady 3.48amps and hit another strip of 4 nodes.
> Total draw while the 2nd batch is starting is 8.68amps. It should steady 
> at
> 6.96. I then have room to turn 1 more node on. Then one more after that. A 
> 4
> step process to get 10 nodes powered up without going over 10amps. Perform
> the same exact steps on a 2nd circuit. Annoying but possible without
> spending anymore money.
>
> I was really hoping a decent $200-300 UPS would come to the rescue here. 
> Oh
> well.
>
> I just had a thought... I planned on making use of wake-on-lan. I can just
> start sending jobs to the whole network though if all of it is asleep, I'd
> have to still be careful of the powerup-sequence. Grrrr. Maybe a script to
> perform WOL before starting any number crunching.
>
> Boy did I take nice big fat electrical lines for granted in the past!
>
> Alpay
>
>
> -----Original Message-----
> From: Jim Lux [mailto:James.P.Lux at jpl.nasa.gov]
> Sent: Thursday, February 17, 2005 7:18 PM
> To: Dean Johnson; Alpay Kasal
> Cc: beowulf at beowulf.org
> Subject: Re: [SPAM] [Beowulf] powering up 18 motherboards
>
> No, the UPS won't help.  It might make things worse, because as you flip 
> on
> all that load, the voltage will sag, causing the UPS to turn on, which 
> then
> might trip from the overcurrent (assuming you're not out buying a 2kW 
> UPS).
>
> You could use the X-10 type (aka Plug n Power) remote controlled relays
> (don't use Lamp modules.. you need Appliance modules, which are relays
> inside).
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From ole at scali.com  Fri Feb 18 02:20:04 2005
From: ole at scali.com (Ole W. Saastad)
Date: Fri, 18 Feb 2005 11:20:04 +0100
Subject: [Beowulf] Re: Home beowulf - NIC latencies
In-Reply-To: <200502162001.j1GK0vgt019156@bluewest.scyld.com>
References: <200502162001.j1GK0vgt019156@bluewest.scyld.com>
Message-ID: <1108722004.16480.36.camel@pc-2.office.scali.no>

Hi,
with all the argument about performance of so called Swiss
Army Knife (SAK) MPIs I have uploaded four runs of the HPCC 
benchmark to the HPCC benchmark web site to show the performance 
of a single cluster with a single SAK MPI running with four different
Interconnects, GigaBit Ethernet (tcp), SCI, Myrinet and InfiniBand.

The results can be found at :

http://icl.cs.utk.edu/hpcc/

Look for the Dell PowerEdge 2650 cluster with 32 CPUs.

The results show that it is possible to have a SAK MPI that 
show acceptable performance for a multiple of interconnects.  

The SAK MPIs are of great value for the application
vendors as they are free from the extra work involved with 
a new MPI implementation for every interconnect. 
In addition an application can be moved without changes 
from one cluster with one interconnect to another cluster 
with yet another interconnect.


-- 
Ole W. Saastad, Dr.Scient. 
Manager Cluster Expert Center
dir. +47 22 62 89 68
fax. +47 22 62 89 51
mob. +47 93 05 74 87
ole at scali.com

Scali - www.scali.com
High Performance Clustering


From vaidya.anand at gmail.com  Fri Feb 18 19:25:19 2005
From: vaidya.anand at gmail.com (vaidya.anand at gmail.com)
Date: Fri, 18 Feb 2005 19:25:19 -0800
Subject: [Beowulf] Entertaining article on cooling hot processors on IBM
	developerWorks
Message-ID: <200502181925.20179.ar3107@gmail.com>

http://www-106.ibm.com/developerworks/library/pa-chipschall5/?ca=dnl


From bari at onelabs.com  Fri Feb 18 06:28:37 2005
From: bari at onelabs.com (Bari Ari)
Date: Fri, 18 Feb 2005 08:28:37 -0600
Subject: [Beowulf] powering up 18 motherboards
In-Reply-To: <0IC30057AJ4KZ3@mta7.srv.hcvlny.cv.net>
References: <0IC30057AJ4KZ3@mta7.srv.hcvlny.cv.net>
Message-ID: <4215FB95.5030908@onelabs.com>

Alpay Kasal wrote:

> Any advice on how to gang up 3 10amp circuits into a single 30amp? Sounds
> like a job for an electrician? Thanks for the help guys.

Electrical codes don't allow paralleling conductors to increase the 
current handling capacity for small circuits like yours. It looks like 
you'll have to try an idea like Jim's and use a staggered startup or 
install a new 30A circuit. If you look around Brooklyn you'll see lots 
of this done on the outside of homes in order to run large window A/C's.

-Bari


From camm at enhanced.com  Fri Feb 18 07:26:46 2005
From: camm at enhanced.com (Camm Maguire)
Date: 18 Feb 2005 10:26:46 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <Pine.LNX.4.58.0502170833510.3892@lilith.rgb.private.net>
References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl>
	<Pine.LNX.4.58.0502161245530.883@ganesh.phy.duke.edu>
	<20050217000455.GF2018@greglaptop.internal.keyresearch.com>
	<Pine.LNX.4.58.0502170833510.3892@lilith.rgb.private.net>
Message-ID: <54k6p5nb2h.fsf@intech19.enhanced.com>

Greetings!

"Robert G. Brown" <rgb at phy.duke.edu> writes:

> But maybe this is all too complicated, or doesn't belong in the standard
> per se.  It is indeed like the ATLAS thing, but then, I think that ATLAS
> is sheer genius although it is also cumbersome and clunky to build...;-)
> I just dream of the day that ATLAS-like runtime optimization isn't so
> clunky and is based on tools that create tables of microbenchmark
> numbers that ARE sufficiently accurate and rich to achieve
> near-optimization without running a build loop that sweeps and searches
> a high-dimensional space...:-)
> 

I'm of the opinion that the bulk of the benefit can be had by
providing alternate binaries tuned for the coarse grained cpu
differentials - isa extension, cache, etc. -- coupled with some smarts
in ld.so to select the proper version at runtime depending on the
running cpu.  We do something like this, (alternate isa extension
only), in the Debian atlas packages, which are now at the base of
quite a large application tree -- no recompilation required.

Am always appreciative of other thoughts on these matters.

Take care,


>   rgb
> 
> -- 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> 

-- 
Camm Maguire			     			camm at enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah


From streich at uwm.edu  Fri Feb 18 08:44:16 2005
From: streich at uwm.edu (streich at uwm.edu)
Date: Fri, 18 Feb 2005 10:44:16 -0600
Subject: [Beowulf] A hello, and an introduction
Message-ID: <1108745056.42161b60d311c@panthermail.uwm.edu>

Hello all,

I'm new to the list and just thought I'd introduce myself, as will probably be
posting to the list a bit.  I'm a system administrator for a Beowulf cluster at
UW-Milwaukee.  It's a 22 node 2.4GHz Intel cluster running Linux that is
dedicated to studying clouds (using wrf and COAMPS (MPI based software)).  It's
a student job, and a lot of fun.  I'm a Computer Science major, and have all
the Computer Science course done (just have a few math classes left).

I'm starting to think about Grad school and Master Thesis stuff, though that is
a little way off.  Along this vein, if anyone has any suggestions as to hot
research topics a CS major with access to a few spare clock cycles Beowulf
cluster might be interested in, please feel free to send them to me. ;)

I've only been admin-ing the cluster for about a year, so I don't know how much
I'll be able to help people with questions...  But I may throw an idea out once
in a while.  I suppose here I may be asking more than answering the questions,
as it seems a lot of you have quite a bit of experience with larger clusters.

- Jeremy


From mit2005 at vreme.yubc.net  Fri Feb 18 08:57:09 2005
From: mit2005 at vreme.yubc.net (IPSI-2005 Italy and USA)
Date: Fri, 18 Feb 2005 17:57:09 +0100
Subject: [Beowulf] Invitation to Italy and USA 2005; c/bb
Message-ID: <200502181657.j1IGv9uW024833@vreme.yubc.net>

Dear potential Speaker:

On behalf of the organizing committee, I would like to extend a cordial invitation for you to attend one or both of the upcoming IPSI BgD multidisciplinary, interdisciplinary, and transdisciplinary conferences.

The first one will take place in Cambridge, Massachusetts, USA:

IPSI-2005 USA
Hotel at MIT, Cambridge (arrival: 7 July 05 / departure: 10 July 05)
New deadlines: 20 February 05 (abstract) / 15 April 05 (full paper)

The second one will take place in Loreto Aprutino, Italy:

IPSI-2005 ITALY
Hotel Castello Chiola (arrival: 27 July 05 / departure: 1 August 05)
New deadlines: 20 February 05 (abstract) / 15 April 05 (full paper)

All IPSI BgD conferences are non-profit. They bring together the elite of the world science; so far, we have had seven Nobel Laureates speaking at the opening ceremonies. The conferences always take place in some of the most attractive places of the world. All those who come to IPSI conferences once, always love to come back (because of the unique professional quality and the extremely creative atmosphere); lists of past participants are on the web, as well as details of future conferences.

These conferences are in line with the newest recommendations of the US National Science Foundation and of the EU research sponsoring agencies, to stress multidisciplinary, interdisciplinary, and transdisciplinary research (M+I+T++ research). The speakers and activities at the conferences truly support this type of scientific interaction.

One of the main topics of this conference is "E-education and E-business with Special Emphasis on Semantic Web and Web Datamining"

Other topics of interest include, but are not limited to:

* Internet
* Computer Science and Engineering
* Mobile Communications/Computing for Science and Business
* Management and Business Administration
* Education
* e-Medicine
* e-Oriented Bio Engineering/Science and Molecular Engineering/Science
* Environmental Protection
* e-Economy
* e-Law
* Technology Based Art and Art to Inspire Technology Developments
* Internet Psychology

If you would like more information on either conference, please reply to this e-mail message.

If you plan to submit an abstract and paper, please let us know immediately for planning purposes. Note that you can submit your paper also to the IPSI Transactions journal.

Sincerely Yours,

Prof. V. Milutinovic, Chairman,
IPSI BgD Conferences


* * * CONTROLLING OUR E-MAILS TO YOU * * *

If you would like to continue to be informed about future IPSI BgD conferences, please reply to this e-mail message with a subject line of SUBSCRIBE.

If you would like to be removed from our mailing list, please reply to this e-mail message with a subject line of REMOVE.


From laytonjb at charter.net  Sat Feb 19 04:53:55 2005
From: laytonjb at charter.net (Jeffrey B. Layton)
Date: Sat, 19 Feb 2005 07:53:55 -0500
Subject: [Beowulf] Mare Nostrum (not quite COTS)
In-Reply-To: <Pine.LNX.4.44.0502182259350.1788-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0502182259350.1788-100000@coffee.psychology.mcmaster.ca>
Message-ID: <421736E3.3050001@charter.net>

Mark Hahn wrote:

>>When the power of MareNostrum is unleashed later this year, it will be at the
>>    
>>
>
>hmm, to me, the bigger the computer, the more money is evaporating
>for every day between delivery and full user utilization.
>  
>

This is a very interesting comment and one I agree with.
Does anyone care to post numbers of some sort regarding
the size of the cluster (and the size of the storage
system) and the time to get the system up and stabilized?

Thanks!

Jeff


From hahn at physics.mcmaster.ca  Sat Feb 19 06:23:20 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Sat, 19 Feb 2005 09:23:20 -0500 (EST)
Subject: [Beowulf] powering up 18 motherboards
In-Reply-To: <E1D2B0l-0004MO-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.44.0502181953230.1433-100000@coffee.psychology.mcmaster.ca>

> Another, much cheaper option, would be to set the slave
> node BIOS to use "Wake on LAN" (if it works on your systems)

I really, really like lan-IPMI.  not only do you get a nice way
to turn on/off/reset machines remotely, but you also can query
their internal sensors.  not to mention that it's an open standard.
my main experience with it is a cluster of HP DL145's, which are 
rebadged Celestica white-ish-box dual-opterons.  I've heard that 
lan-IPMI is also available for real whitebox (tyan, supermicro),
but have never managed to get hands on. 

IPMI-like functionality is one of those odd places where customers
are often hurt by vendors who produce proprietary, less-interoperable
mechanisms.  sort of embrace-extend-bastardize-rebrand.  product teams
should not listen to marketing/business people...


From rgb at phy.duke.edu  Sat Feb 19 11:49:28 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sat, 19 Feb 2005 14:49:28 -0500 (EST)
Subject: [Beowulf] Academic sites: who pays for the electricity?
In-Reply-To: <Pine.LNX.4.44.0502182336580.1788-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0502182336580.1788-100000@coffee.psychology.mcmaster.ca>
Message-ID: <Pine.LNX.4.58.0502191417100.7876@lilith.rgb.private.net>

On Sat, 19 Feb 2005, Mark Hahn wrote:

> > As our Dean of A&S recently remarked, if there aren't any checks and
> > balances or cost-equity in funding and installing clusters, they may
> > well continue to grow nearly exponentially, without bound (Duke's
> 
> I find that most faculty who have compute needs (and funding) will
> seriously consider buying into a shared facility instead.  that's our
> (SHARCnet's) usual pitch: let us help you spend your grant, and we'll
> give you first cut at that resource, but otherwise take all the pain
> off your hands.  most people realize that running a cluster is a pain:
> heat/noise, but more importantly the fact that it soaks up tons of 
> the most expensive resource, human attention.  do you want your grad
> students spending even 20% of their time screwing around with cluster
> maintenance?
> 
> not to mention the fact that most computer use is bursty, and therefore
> very profitably pooled.  a shared resource means that a researcher 
> can burst to 200 cpus, rather than just the 20 that his grant might 
> have bought.  and after the burst, someone else can use them...

A model that we have emulated at Duke, actually, and I mean emulated
literally.  You'll recall that we had that offline discussion about your
share cluster operation a few years ago -- that was a fairly important
component that I took into various discussions with provosty-level folks
that led ultimately to the creation of Duke's CSEM.  So we owe you a
debt of gratitude, and possibly even a beer (standard compensation for
cluster support services on list, is it not:-).  However we, like many
institutions, are not pure anything -- the high-cycle-consuming
departments (including physics, but also CPS, chemistry, statistics,
biology, econ) still run their own clusters, although individuals with
BIG needs are encouraged to join the campus grid project and share their
unused cycles.

I feel like we're likely in a gradual and purely voluntary transition
to a point where this model will ultimately dominate and perhaps even
become universal because of the various economies of scale.  Frankly,
even where people DON'T participate in the sharing of resources there
will have to be colocation, as physical space with adequate
infrastructure is both scarce and expensive.  So physics may become part
of the campus grid, etc, although the way things look now there will
continue to be a half dozen separate server-class spaces all over campus
for the hardware.  

I actually feel that this "decentralized centralization" is a good thing
-- we have historical reasons to strongly mistrust overcentralization of
resources at Duke.  One gets economies of scale, perhaps, but at the
expense of permitting empire builders in the bureaucracy entrench and
start to dictate policy, often to groups who are a hell of a lot smarter
and cost effective when left to their own devices.  I mean, we'd still
be using mainframes if it were up to the folks that run centralized
mainframe compute centers.  Hell, at Duke I believe we still ARE running
some mainframes.  Beowulfs themselves are another example of an
innovation that could only have come about from the bottom up.  So we're
trying to operate according to a model where individual enterprise and
innovation aren't squelched and "power" (planning and control) aren't
totally centralized, but we still can centralize primary infrastructure
(the network, the main server rooms) and help coordinate the planning
and operation and sharing of cluster resources without trying to dictate
them.

This isn't as difficult as it might sound, at a University.  All the
faculty are innately mistrustful, cynical, and jealous of control, and
at the same time tend to be really bright and recognize the benefits of
cooperation on some issues.  We've also been blessed for the last 5-8
years with some really good people in Arts and Science Network
Administrators group (from Melissa Mills as the assistant dean at the
top, through the actual sysadmins down to, well, me:-), at the Provost
level (leading to the birth of CSEM), and in the Office of Information
Technology (OIT, which runs the campus network and student/academic
computing).  When you have enlightened leadership that ISN'T empire
building, things are good and even if you try something and it turns out
to be wrong it doesn't matter -- you just fix it or try something else
and move on.

> > to hold them, the power to run them, and the people to operate them, all
> > grow roughly linearly with the number of nodes.  This much is known.
> 
> well, operator cost scales linearly, but that line certainly does not 
> pass through zero, and is nearly flat (a 100p cluster takes almost the 
> same effort as a 200p one.)

Yeah, yeah, yeah -- it's an irregular scale with flat patches and a
minimum buy-in.  However, the discussion concerns the planning of a 1000
node, 2000 CPU cluster, and at that point I think you can start to talk
meaningfully about the number of nodes per support person in planning
discussions where you recognize that you CAN'T run 1000 nodes with the
same effort as 100 nodes.  The hardware support component most
definitely scales per node, and that large a cluster could eat a person
alive with hardware maintenance as the cluster ages or if you hit a bad
patch.  Hardware support is in some sense the scale limiting resource
for a well-designed cluster (whether you provide it locally or contract
it out), especially now that PXE, yum, etc have made it possible for a
single person to install and maintain 1000 nodes on the SOFTWARE side of
things.

But to avoid going into all this is why I used the word "roughly".
Perhaps I should have said "in the limit of a large number of machines".

> > Finding out isn't trivial -- it involves running down ALL the clusters
> > on campus, figuring out whom ALL those nodes "belong" to, determining
> > ALL the grant support associated with all those people and projects and
> 
> here at least, the office of research services sees all funding traffic,
> and so is sensitized to the value of cluster pooling.  the major funding
> agencies have also expressed some desire to see more facility centralization,
> since the bad economics of a bunch of little clusters is so clear...

Yes, ditto here as well, with exceptions as noted above.  Although
granting agencies can swing both ways -- they like to see resource and
cost sharing, but they are also jealous of resource "ownership" and
don't want to fund project A (including cluster) and then find that that
cluster has been hijacked by project B, possibly run by somebody else
entirely.  There's also a WIDE range of what the different agencies view
as reasonable "cost sharing".  What Duke has tried to do is use a
thoughtful model and not a one-size-fits-all plan with mandatory
participation.  The one thing that I think Duke still really needs is
the detailed CBA of existing cluster operations.  I suspect that they'd
find that they are remarkably efficient as is, but I'm not certain.

As always, a pleasure to read what you write.

  rgb

> 
> regards, mark hahn.
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From john.hearns at streamline-computing.com  Sun Feb 20 00:53:21 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sun, 20 Feb 2005 08:53:21 +0000
Subject: [Beowulf] powering up 18 motherboards
In-Reply-To: <Pine.LNX.4.44.0502181953230.1433-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0502181953230.1433-100000@coffee.psychology.mcmaster.ca>
Message-ID: <1108889601.5617.3.camel@Vigor11>

On Sat, 2005-02-19 at 09:23 -0500, Mark Hahn wrote
> I really, really like lan-IPMI.  not only do you get a nice way
> to turn on/off/reset machines remotely, but you also can query
> their internal sensors.  not to mention that it's an open standard.
> my main experience with it is a cluster of HP DL145's, which are 
> rebadged Celestica white-ish-box dual-opterons.  I've heard that 
> lan-IPMI is also available for real whitebox (tyan, supermicro),
> but have never managed to get hands on. 

Tyan need an extra card, which isn't included by default, if I'm not
wrong.

We have MSI systems which have IPMI.

Also don't forget locatelight capability.
It is SO useful if you have 500 nodes, and have to ask someone to swap a
component on a problem node, for instance.


From patrick at myri.com  Sun Feb 20 23:46:59 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Mon, 21 Feb 2005 02:46:59 -0500
Subject: [Beowulf] Mare Nostrum (not quite COTS)
In-Reply-To: <3.0.32.20050216131329.0106f960@pop.xs4all.nl>
References: <3.0.32.20050216131329.0106f960@pop.xs4all.nl>
Message-ID: <421991F3.3060803@myri.com>

Hi Vincent,

Vincent Diepeveen wrote:
> Which myrinet cards are in Mare Nostrum?

Single link PCI-X D cards, with 4 MB of SRAM to be comfortable with 
multiples routes between a lot of nodes. The NICs for the BladeCenter 
have a specific format: smaller than regular half-size PCI boards, NIC 
is // to the mainboard and has no fiber transceiver (goes to the OPM via 
the Bladecenter backplane). Available only via IBM.

> What one way pingpong latency can it get from 1 end of the machine to the
> other end of the machine?

I don't know, I didn't work on this machine. It would depend on the 
number of crossbars and lengths of fiber. I uses the new switches with 
32-port crossbars, so the max path should not be longer than 7 hops if 
my memory is right. At one time, the PCI-X on the JS-20 blades was 
clocked at 100 MHz, I don't know if it has been bumped to 133 MHz since.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From patrick at myri.com  Mon Feb 21 00:02:44 2005
From: patrick at myri.com (Patrick Geoffray)
Date: Mon, 21 Feb 2005 03:02:44 -0500
Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies
In-Reply-To: <3.0.32.20050216141222.00924be0@pop.xs4all.nl>
References: <3.0.32.20050216141222.00924be0@pop.xs4all.nl>
Message-ID: <421995A4.6030801@myri.com>

Vincent Diepeveen wrote:
> A problem of MPI over DSM type forms of parallellism has been described
> very well by Chrilly Donninger with respect to his chessprogram Hydra which
> runs at a few nodes MPI :
> 
> For every write :
> 
> MPI_Isend(....)
> MPI_Test(&Reg,&flg,&Stat)
> while(!flg) {
>     Hydra_MsgPending();  // Important, read in messages and process them
> while waiting on complete. Otherwise the own Input-Buffer can overflow
>                                          // and we get a deadlock.
>     MPI_Test(&Reg,&flg,&Stat);
> }
> 
> The above is dead slow simply and delays the software.

You are effectively waiting for the send completion, and that can 
require synchronization with the receive side if the message size is 
large enough.

> In a DSM model like Quadrics you don't have all these delays.

You don't have these delays with message passing if you do it 
differently. You can post multiple sends and wait on all of them at the 
same time, or post a send and the compute the next step before waiting. 
RMA would remove the synchronization with the remote side, but you need 
to know where to Put the data over there.

> Can Myri memory on the card (4MB and 8MB in the $1500 version) get used to
> directly write to the RAM on a remote network card?

The memory on the NIC is not related to Remote Memory Access. The SRAM 
is used to host the firmware code and some data such as the routes, 
physical addresses, the name of the captain, whatever. More memory means 
that you can fit more routes (on a 7 hops topologies, you need to store 
8 bytes, 7 routing bytes and a length, for every routes, and you need 8 
different routes for each destination, per link, to have an effecive 
route dispersion scheme) or do something special (you can write your own 
firmware if you are crazy or you know what you are doing). 2MB is fine 
for most cases.

> If so which library can i download for that for myri cards?

GM supports RMA (PUT and GET) but do not expect the same latency as 
Quadrics. MX does not have been available with one-sided operations yet, 
but the latency is much better.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com


From ashley at quadrics.com  Mon Feb 21 03:22:18 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Mon, 21 Feb 2005 11:22:18 +0000
Subject: [Beowulf] Mare Nostrum (not quite COTS)
In-Reply-To: <421736E3.3050001@charter.net>
References: <Pine.LNX.4.44.0502182259350.1788-100000@coffee.psychology.mcmaster.ca>
	<421736E3.3050001@charter.net>
Message-ID: <1108984938.6139.4.camel@localhost.localdomain>

On Sat, 2005-02-19 at 07:53 -0500, Jeffrey B. Layton wrote:
> Mark Hahn wrote:
> 
> >>When the power of MareNostrum is unleashed later this year, it will be at the
> >>    
> >>
> >
> >hmm, to me, the bigger the computer, the more money is evaporating
> >for every day between delivery and full user utilization.
> >  
> >
> 
> This is a very interesting comment and one I agree with.
> Does anyone care to post numbers of some sort regarding
> the size of the cluster (and the size of the storage
> system) and the time to get the system up and stabilized?

In my experience it's not a lot to do with size, the first one of a
given size is always a new experience but the second and third ones
onwards are fairly safe.

Far more relevant is the exact hardware used and experience, expect as
many problems making a 64 way machine with a new
motherboard/file-system/integrator as you would ramping this same system
up to 1024 way.

Ashley,


From lindahl at pathscale.com  Mon Feb 21 21:08:24 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Mon, 21 Feb 2005 21:08:24 -0800
Subject: [Beowulf] The Case for an MPI ABI
Message-ID: <20050222050824.GA2195@greglaptop.attbi.com>

Those of you who were at the Open IB conference last week saw me give
a talk entitled "The Case for an MPI ABI". It seems that Patrick and I
have been channeling each other AGAIN; see what happens when I move to
California?

The first question is: Does an ABI provide enough benefit for people
to care? To care enough to sit on a committee?

If the answer is "yes", then I think we'll have one. The minimum
technical issues revolve around the contents of <mpi.h> and the names
of shared libraries. The amount of work for MPICH or OpenMPI to
support that part of an ABI is modest.

If we wanted to go farther, I have a strawman proposal which addresses
a generic startup procedure which would allow user applications, MPI
implementations, and queue systems to all live in peace and harmony.

This talk:

http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf

mostly talks about why we need an ABI, who wins and loses as a result
of having one, and the pieces that could be in it. Please give it a
look.

-- greg


From eugen at leitl.org  Tue Feb 22 08:32:37 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Tue, 22 Feb 2005 17:32:37 +0100
Subject: [Beowulf] [Clusters_sig] The kernel and cluster issues (fwd from
	cherry@osdl.org)
Message-ID: <20050222163237.GT1404@leitl.org>

----- Forwarded message from John Cherry <cherry at osdl.org> -----

From: John Cherry <cherry at osdl.org>
Date: Tue, 22 Feb 2005 08:29:47 -0800
To: clusters_sig at osdl.org
Cc: 
Subject: [Clusters_sig] The kernel and cluster issues
X-Mailer: Evolution 2.0.1 

LinuxWorld Conference and Expo (August 8-11) has a content track for
"Kernel and Cluster Issues".

http://www.linuxworldexpo.com/live/12/speakers//callforpapers

Is anyone in this forum planning to present a paper for this conference?
This may be a good venue to let the "cluster community" have a unified
voice for proposing a minimal set of common cluster services/hooks for
the kernel.  OLS may be another good forum for this.

John


_______________________________________________
Clusters_sig mailing list
Clusters_sig at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/clusters_sig

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050222/8b066a4a/attachment.sig>

From lusk at mcs.anl.gov  Tue Feb 22 09:06:05 2005
From: lusk at mcs.anl.gov (Rusty Lusk)
Date: Tue, 22 Feb 2005 11:06:05 -0600 (CST)
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <20050222050824.GA2195@greglaptop.attbi.com>
References: <20050222050824.GA2195@greglaptop.attbi.com>
Message-ID: <20050222.110605.03066059.lusk@localhost>

From: Greg Lindahl <lindahl at pathscale.com>
Subject: [Beowulf] The Case for an MPI ABI
Date: Mon, 21 Feb 2005 21:08:24 -0800
> This talk:
> 
> http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf
> 
> mostly talks about why we need an ABI, who wins and loses as a result
> of having one, and the pieces that could be in it. Please give it a
> look.

One piece you include is the replacement of the non-portable mpirun with
an mpistart with standard arguments.  You might note that the MPI-2
forum addressed this issue with the specification of an mpiexec with
standard arguments.  MPICH2 implements it.

Rusty


From rcmanglekar at rediffmail.com  Fri Feb 18 22:38:23 2005
From: rcmanglekar at rediffmail.com (Rahul Manglekar)
Date: 19 Feb 2005 06:38:23 -0000
Subject: [Beowulf] How to run services/daemon on Cluster.
Message-ID: <20050219063823.10268.qmail@webmail17.rediffmail.com>

  
Hi all,

  I need more processing resources for services/daemons on my server machine. I have service that consumes much cpu resources, like mysqld and httpd etc., it consumes around 90-94% processor(CPU) usage. 

  Can we build cluster, that will share all client/nodes processor usage to Server machine. So that services/daemons (eg., mysqld,apache etc.) running on server, will get more processing power.

Can any/one guide me..!


Thanks in advance.

Regards..,

-- Rahul.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050219/864ef698/attachment.html>

From ipsitrans at vreme.yubc.net  Sat Feb 19 00:46:10 2005
From: ipsitrans at vreme.yubc.net (IPSI Transactions Special Issues)
Date: Sat, 19 Feb 2005 09:46:10 +0100
Subject: [Beowulf] Call for IPSI Transactions Special Issues in 2005/6; c/ba
Message-ID: <200502190846.j1J8kArD008261@vreme.yubc.net>

Dear potential Speaker:

We are pleased to inform you that both IPSI Transactions journals are planing some special issues in late 2005 and early 2006, and you are welcome to submit your paper(s), until the deadlines listed below!

IPSI Transactions on Internet Research:

March 31, 2005 -
Special Issue on E-Education: Concepts and Infrastructure

June 30, 2005 -
Special Issue on E-Business: Concepts and Infrastructure


IPSI Transactions on Advanced Research:

March 31, 2005 -
Special Issue on the Research with Multidisciplinary Elements

June 30, 2005 -
Special Issue on the Research with Interdisciplinary Elements

Each submitted paper first undergoes the editor review, and those who pass this first stage are sent to 12 external experts for a rigorous review; decisions are made after at least 6 external reviewers respond! The review is free of charge, but the authors of the accepted papers ar expected to pay the publication fee of E400 per paper (if 4 or 5 or 6 pages of the TIR/TAR format), and the additional fee of E100 per page, for each extra page, till the maximum of 10 pages.

Rigorous reviewing is the major strength of IPSI journals, which is the major contributor to their high quality! Soft copies of the existing issues of TIR and TAR can be seen at the web, and hard copies can be obtained on a special request by email, as indicated on the web, where can find all information!

Sincerely yours,

Prof. Dr. Veljko Milutinovic, Editor-in-Chief

P.S. If you need aditional information, please reply to this e-mail.


From eno at dorsai.org  Sat Feb 19 02:58:59 2005
From: eno at dorsai.org (Alpay Kasal)
Date: Sat, 19 Feb 2005 05:58:59 -0500
Subject: [Beowulf] powering up 18 motherboards
In-Reply-To: <00ed01c5158c$92df7550$6e45a8c0@masstivy>
Message-ID: <0IC500I3ZNOKNN@mta6.srv.hcvlny.cv.net>

Thanks to all for the info over the past few days. I decided to write a WOL
app that turns the machines on from a windows box on the network with a user
defined delay. I'll stick it on the web if anyone is interested in taking a
look at it. I plan to add the ability to selectively shut machines down too
(since the whole thing doesn't do much good from a cold boot). 

@Eric Machala and David Mathog
Thanks for the info on the UPS's. I will be looking into some of the bigger
refurbed APC's on eBay (I don't have 240v here though). I'll leave some
safety room on the circuits for peaks, now I figure I'd want the UPS's to
get me through the problems RGB was describing - I don't get many brownouts
in NY with our underground cabling but that occasional power hiccup would
drive me nuts.

@Patrick Michael Kane
I must say that I have NOT been cranking away with the 3D apps I'll be using
regularly. Just trying to put load on the cpu's for my tests. I'll be
properly set up to do everything right over the weekend. I needed to first
finish the build-out for cooling and these power issues. I'll be sure to
report back with my details.

@Jim and RGB and Bari
Understood. Loud and clear. No ganging of circuits. I didn't know if it was
easy run in parallel to support bigger peaks. I won't be touching whatever
is behind the circuit breakers by myself.

@RGB again...
VERY informative stuff about power and PS's. Thank you.

-Alpay


-----Original Message-----
From: Eric Machala [mailto:emac at cybergps.net] 
Sent: Friday, February 18, 2005 2:37 AM
To: Alpay Kasal; 'Jim Lux'; 'Dean Johnson'
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] powering up 18 motherboards

This is not true the Ups will not draw ever any more from the wall that its 
is set to, ups has a set trickle rate that is able to be set  to a cost 
effective trickle rate unless it is load overloaded it, if this is the case 
it is double the trickle rate becuase systems wants to restore full battery 
before power drain err loss of power.... i could get the actual specs on 
this but for your actual needed load if u were in a 50-65% load im sure 
there would be no spikes in power draw over normal trickle


From maurice at harddata.com  Sat Feb 19 10:00:05 2005
From: maurice at harddata.com (Maurice Hilarius)
Date: Sat, 19 Feb 2005 11:00:05 -0700
Subject: [Beowulf] Re: "Off teh shelf"  (was:Mare Nostrum)
In-Reply-To: <200502190603.j1J62SnO003210@bluewest.scyld.com>
References: <200502190603.j1J62SnO003210@bluewest.scyld.com>
Message-ID: <42177EA5.4070606@harddata.com>

Mark Hahn <hahn at physics.mcmaster.ca> wrote:

>it's funny how "off the shelf" means different things to different people.
>I consider blades to be a qualitatively different category of hardware
>than, say, a tyan motherboards in an AICPC chassis.  AFAIKT, blades still
>run at a premium vs "normal" servers from the same vendor (which is also
>at a premium vs whitebox.)
>  
>
No Kidding!!

If the "blades" use proprietary components, then the useful lifespan of 
the investment is what? 2 years perhaps?
If, OTOH, they truly do use "off the shelf" components, they can be 
readily upgraded.
As you mention, commodity motherboards share standard form factors, and 
usually power requirements.
If one builds a "blade box" using these it is fairly trivial to change 
out motherboards and CPUs on a scheduled basis, easily doubling, and 
often tripling the lifespan of the investment. Further these upgrades 
may be done incrementally, reducing downtime to nothing, in practical terms.

Buying a "solution" using proprietary components is the ages old 
"suckers game" that has been played by the larger vendors for decades to 
allow forced obsolescence.


Maurice W. Hilarius
Hard Data Ltd.
maurice at harddata.com


From billk01 at metrumrg.com  Sun Feb 20 15:41:19 2005
From: billk01 at metrumrg.com (BillKnebel)
Date: Sun, 20 Feb 2005 15:41:19 -0800
Subject: [Beowulf] sun grid engine on Scyld beowulf cluster
In-Reply-To: <421516B7.3060906@sonsorol.org>
References: <4214A0E5.3010804@metrumrg.com> <421516B7.3060906@sonsorol.org>
Message-ID: <4219201F.70105@metrumrg.com>


Chris,

I was able to get grid engine to run on the Scyld cluster using the 
approach of setting the master (head) node as the submit, admin, and 
execute host.  Unfortunately, starting a set of jobs on the cluster 
results in all jobs being run on the head node only (if grid engine only 
commands are used) or I can integrate grid engine "qsub" command with  
some of the Scyld tools to get jobs started then migrated ( to a point) 
over the cluster.  However, I am still running into problems becuase all 
of the queueing variables for grid engine read the headnode info and 
since all jobs run on the compute nodes, the headnode appears to be 
always free which results in all jobs being started at once. This is not 
ideal. 

I am waiting on some feedback from Scyld/Penguin computing on some 
related issues that will hopefully solve some of these problems. 

Bill
Chris Dagdigian wrote:

>
> I know Grid Engine well but not Scyld so forgive my ignorance if I say 
> something stupid and given the level of expertise on this list I'm 
> quite certain I'm about to make a fool myself :)
>
> If Scyld is presenting you with a single system image (ie a single 
> linux server that can farm out tasks to all those nodes) then you 
> would install SGE in the same way that you would install it on a big 
> SMP box:
>
> 1. Install the SGE qmaster and scheduler on the master node
> 2. Install the execution host on the master node as well
>
> You will only have 1 execd per queue but each queue can be configured 
> with N number of "job slots" which actually control how many jobs can 
> run at the same time on the same machine.
>
> Try setting your # of job slots within your single SGE queue to the 
> number of nodes in your cluster. This is simlar to what you would do 
> on a big SMP machine -- small number of queues each supporting a 
> decent jobslot count.
>
> Then submit a bunch of jobs and see if SGE causes the master node to 
> fall over under load. If not then Scyld is doing its thing behind the 
> scenes to migrate stuff around to the other nodes.
>
> -Chris
>
>
>
> billk01 wrote:
>
>> I am in the process of installing SGE on a Scyld beowulf cluster.  As
>> most people are aware, the Scyld cluster runs a complete OS (linux) only
>> on the master node and the compute nodes are simply for executing.
>> During the SGE install, it requires adding the compute nodes as execute
>> hosts.  I do not understand how to do this given the current setup of a
>> scyld cluster since you can't "login" to the nodes to execute the
>> install script.  The script does exist on an NFS shared directory
>> (cluster wide).  Has anybody else ran into this problem?
>>
>
>
>


From cflau at clustertech.com  Sun Feb 20 19:01:16 2005
From: cflau at clustertech.com (John Lau)
Date: Mon, 21 Feb 2005 11:01:16 +0800
Subject: [Beowulf] MPICH question
Message-ID: <1108954876.15965.16.camel@cattail.clustertech.com>

Hi,

I have a question on the number of processes spawned by MPICH. I am
using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to
start a mpi process, total 8 processes will be spawned. But only 4
processes have loadings on CPUs. I would like to know if it is the
correct behavior of MPICH? And what's the use of the 4 no-loading
process? Thank you.

Best regards,
John Lau

-- 
John Lau Chi Fai
Center For Large-Scale Computation
Tel: (852) 2994-3727
Fax: (852) 2994-2101


From brian at cypher.acomp.usf.edu  Mon Feb 21 06:22:39 2005
From: brian at cypher.acomp.usf.edu (Brian R Smith)
Date: Mon, 21 Feb 2005 09:22:39 -0500
Subject: [Beowulf] A hello, and an introduction
In-Reply-To: <1108745056.42161b60d311c@panthermail.uwm.edu>
References: <1108745056.42161b60d311c@panthermail.uwm.edu>
Message-ID: <1108995759.24297.41.camel@daemon>

Hey Jeremy,

Its good to see another student admin at a university on here.  Welcome
to the list.  There are a lot of top-notch people on here that you can
learn a lot from.  I've been admining at my univ. for about 3 years now
and plan on doing so even after I graduate.  

With a C.S. background, you'll probably find lots of interesting things
involved with Cryptography or Image/Video processing.  I'm working on a
video compression format right now and will likely write up a parallel
encoder for AVI's into my format.

Maybe my boss will post and give you some ideas on what to research as
he's working on his PhD and probably has a better idea than I do on what
you can accomplish on a cluster as a C.S. major.  And I'm sure RGB could
come up with some mind-blowers if you are really up to the task.

Good luck and welcome to the list.

Brian Smith

On Fri, 2005-02-18 at 10:44 -0600, streich at uwm.edu wrote:
> Hello all,
> 
> I'm new to the list and just thought I'd introduce myself, as will probably be
> posting to the list a bit.  I'm a system administrator for a Beowulf cluster at
> UW-Milwaukee.  It's a 22 node 2.4GHz Intel cluster running Linux that is
> dedicated to studying clouds (using wrf and COAMPS (MPI based software)).  It's
> a student job, and a lot of fun.  I'm a Computer Science major, and have all
> the Computer Science course done (just have a few math classes left).
> 
> I'm starting to think about Grad school and Master Thesis stuff, though that is
> a little way off.  Along this vein, if anyone has any suggestions as to hot
> research topics a CS major with access to a few spare clock cycles Beowulf
> cluster might be interested in, please feel free to send them to me. ;)
> 
> I've only been admin-ing the cluster for about a year, so I don't know how much
> I'll be able to help people with questions...  But I may throw an idea out once
> in a while.  I suppose here I may be asking more than answering the questions,
> as it seems a lot of you have quite a bit of experience with larger clusters.
> 
> - Jeremy
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-- 
_______________________________________
|  Brian R Smith                      |
|  Systems Administrator              |
|  Research Computing Core Facility   |
|  University of South Florida        |
|  Phone: 1(813)974-1467              |
|  4202 E Fowler Ave, LIB 613         |
_______________________________________


From cflau at clustertech.com  Mon Feb 21 17:30:29 2005
From: cflau at clustertech.com (John Lau)
Date: Tue, 22 Feb 2005 09:30:29 +0800
Subject: [Beowulf] MPICH question
Message-ID: <1109035829.18387.5.camel@cattail.clustertech.com>

Hi,

I have a question on the number of processes spawned by MPICH. I am
using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to
start a mpi process, total 8 processes will be spawned. But only 4
processes have loadings on CPUs. I would like to know if it is the
correct behavior of MPICH? And what's the use of the 4 no-loading
process? Thank you.

Best regards,
John Lau
-- 
John Lau Chi Fai
Cluster Technology Ltd.
cflau at clustertech.com
Tel: (852) 2994-3727
Fax: (852) 2994-2101


From mark.westwood at ohmsurveys.com  Tue Feb 22 00:23:42 2005
From: mark.westwood at ohmsurveys.com (Mark Westwood)
Date: Tue, 22 Feb 2005 08:23:42 +0000
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <20050222050824.GA2195@greglaptop.attbi.com>
References: <20050222050824.GA2195@greglaptop.attbi.com>
Message-ID: <421AEC0E.402@ohmsurveys.com>

Greg

You make a very persuasive case for an ABI.  As an end-user of MPI, and 
with no ambitions to be anything else, many of the benefits of an ABI 
you suggest would be very useful.  My recent experience of porting our 
MPI / Fortran codes to other platforms has been that getting the code to 
compile has been almost trivial (replace 'call flush(6)' by 'call 
flush_(6)' a few times, that sort of thing) but that getting to grips 
with the foreign environment (memory management, job submission, job 
start-up) is a real pain.

So, to all you salesman in the group, come back and try to sell me the 
ABI when it's ready.

Regards
Mark

Greg Lindahl wrote:
> Those of you who were at the Open IB conference last week saw me give
> a talk entitled "The Case for an MPI ABI". It seems that Patrick and I
> have been channeling each other AGAIN; see what happens when I move to
> California?
> 
> The first question is: Does an ABI provide enough benefit for people
> to care? To care enough to sit on a committee?
> 
> If the answer is "yes", then I think we'll have one. The minimum
> technical issues revolve around the contents of <mpi.h> and the names
> of shared libraries. The amount of work for MPICH or OpenMPI to
> support that part of an ABI is modest.
> 
> If we wanted to go farther, I have a strawman proposal which addresses
> a generic startup procedure which would allow user applications, MPI
> implementations, and queue systems to all live in peace and harmony.
> 
> This talk:
> 
> http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf
> 
> mostly talks about why we need an ABI, who wins and loses as a result
> of having one, and the pieces that could be in it. Please give it a
> look.
> 
> -- greg
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> 

-- 
Mark Westwood
Parallel Programmer
OHM Ltd
The Technology Centre
Offshore Technology Park
Claymore Drive
Aberdeen
AB23 8GD
United Kingdom

+44 (0)870 429 6586
www.ohmsurveys.com


From cousins at limpet.umeoce.maine.edu  Tue Feb 22 11:18:19 2005
From: cousins at limpet.umeoce.maine.edu (Steve Cousins)
Date: Tue, 22 Feb 2005 14:18:19 -0500 (EST)
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <200502212000.j1LK0B42019265@bluewest.scyld.com>
Message-ID: <Pine.LNX.4.10.10502221343480.6214-100000@limpet.umeoce.maine.edu>


I'm in the process of getting a couple 16 Bay 6.4 TB RAID units.  The
vendors who have given me quotes have a SATA-SCSI version for around
$14,000 each.  I can get a similarly equipped StorCase unit for around
$11,000. With the StorCase I envision that I'd be more on my own if
anything went wrong.  However, with the cheaper price, I'd be able to buy
a spare RAID controller to have on hand in case one of them failed and
still save a couple grand.

All three of these (the other two are Infortrend and I believe a re-badged
Jet) use the same Intel i80321 CPU on their controllers.  

All are configured the same (with spare PS module and Fan module)  except
the StorCase doesn't include a 3 year Express Swap warranty. This is why
I'd also want to get the spare RAID controller to share between the two
units in case one went bad.  So, for price comparison, it is probably
closer to $28,000 for two "Name-brand" units or $25,500 for two StorCase
units with spares of most everything.

Does anyone have experience with any or all of these?  Is it worth the
extra money to have a "burned-in" device supported by some company?  

I know this isn't a Beowulf specific question but it seems that storage is
a big part of Beowulfery and I'd bet that a lot of people are looking into
similar devices.  I hope it is relevant. 

I'm not after any more quotes from vendors (please don't email or call).
I'm happy with the alternatives that I have right now.

Thanks,

Steve
______________________________________________________________________
 Steve Cousins, Ocean Modeling Group    Email: cousins at umit.maine.edu
 Marine Sciences, 208 Libby Hall        http://rocky.umeoce.maine.edu
 Univ. of Maine, Orono, ME 04469        Phone: (207) 581-4302


From rgb at phy.duke.edu  Tue Feb 22 11:56:45 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 22 Feb 2005 14:56:45 -0500 (EST)
Subject: [Beowulf] A hello, and an introduction
In-Reply-To: <1108995759.24297.41.camel@daemon>
References: <1108745056.42161b60d311c@panthermail.uwm.edu>
	<1108995759.24297.41.camel@daemon>
Message-ID: <Pine.LNX.4.58.0502221445210.7746@ganesh.phy.duke.edu>

On Mon, 21 Feb 2005, Brian R Smith wrote:

> Hey Jeremy,
> 
> Its good to see another student admin at a university on here.  Welcome
> to the list.  There are a lot of top-notch people on here that you can
> learn a lot from.  I've been admining at my univ. for about 3 years now
> and plan on doing so even after I graduate.  
> 
> With a C.S. background, you'll probably find lots of interesting things
> involved with Cryptography or Image/Video processing.  I'm working on a
> video compression format right now and will likely write up a parallel
> encoder for AVI's into my format.
> 
> Maybe my boss will post and give you some ideas on what to research as
> he's working on his PhD and probably has a better idea than I do on what
> you can accomplish on a cluster as a C.S. major.  And I'm sure RGB could
> come up with some mind-blowers if you are really up to the task.

I don't know about mind blowers, but my column for I think April's CWM
is on "things you can do with your starter cluster".  It's far from
exhaustive, but it provides a bit of direction for this perennial
question.

As far as RESEARCH topics are concerned, you should probably contact me
off the list if you really do want any suggestions.  There is a bit of
difference between "generally interesting stuff you can do with a
beowulf" and "computer science research you can do" with a beowulf.  The
former is concerned with applications and simple demonstrations -- the
latter with tools, algorithms, timings, latency and so forth.

I do have one idea for god's own project that I've offered up to the
list a few times before (one inspired by work of Jack Dongarra and
others, in case you were wondering which god:-).  In a nutshell, it is
to build a microbenchmarking daemon that could be included in a standard
linux distribution and run as an initd-controlled tasks during startup.
During normal operation, it would borrow idle cycles and accumulate
benchmark/performance statistics and make them available via a socket
interface (probably UDP)

Any application, local or remote, could then query any host/node and get
a performance profile -- a matrix of key microbenchmark numbers.  These
in turn could be used to "autotune" both serial and parallel/distributed
applications.

Once the daemon existed and a fairly standard set of numbers developed
for a first cut at its output, one could then start thinking about
e.g. rewriting ATLAS so that it autotunes from the daemon results
instead of during build, so that a parallel application that partitions
does so automatically to take advantage of superlinear speedups that
might occur for certain partitionings, and so forth.

I'd think that there were all sorts of papers in there, no?  And a damn
nice GPL toolset in the end that could be tremendously useful to lots of
people.

And as a final benefit for those seeking fame, it would obviously become
THE microbenchmark tool for linux and likely other distros, as it would
be the one that is built right in.  In fact, the very first application
that uses it would be a simple command line or GUI interface to read and
plot its cumulated results...

I'd be happy to direct this and maybe even contribute, if any CPS
student-geeks out there find this interesting...

   rgb

> 
> Good luck and welcome to the list.
> 
> Brian Smith
> 
> On Fri, 2005-02-18 at 10:44 -0600, streich at uwm.edu wrote:
> > Hello all,
> > 
> > I'm new to the list and just thought I'd introduce myself, as will probably be
> > posting to the list a bit.  I'm a system administrator for a Beowulf cluster at
> > UW-Milwaukee.  It's a 22 node 2.4GHz Intel cluster running Linux that is
> > dedicated to studying clouds (using wrf and COAMPS (MPI based software)).  It's
> > a student job, and a lot of fun.  I'm a Computer Science major, and have all
> > the Computer Science course done (just have a few math classes left).
> > 
> > I'm starting to think about Grad school and Master Thesis stuff, though that is
> > a little way off.  Along this vein, if anyone has any suggestions as to hot
> > research topics a CS major with access to a few spare clock cycles Beowulf
> > cluster might be interested in, please feel free to send them to me. ;)
> > 
> > I've only been admin-ing the cluster for about a year, so I don't know how much
> > I'll be able to help people with questions...  But I may throw an idea out once
> > in a while.  I suppose here I may be asking more than answering the questions,
> > as it seems a lot of you have quite a bit of experience with larger clusters.
> > 
> > - Jeremy
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From mathog at mendel.bio.caltech.edu  Tue Feb 22 12:23:45 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Tue, 22 Feb 2005 12:23:45 -0800
Subject: [Beowulf] S2466 Wake on Lan working, anyone?
Message-ID: <E1D3gZN-00073w-00@mendel.bio.caltech.edu>

Does _anybody_ have wake on lan working with Tyan's S2466
motherboards?  

I know this has been asked before but hopefully
since the last time around somebody has made it work.

With:
   Tyan S2466M
   2.6.8-1 kernel
   3c59x driver 
   v4.06 BIOS

I tried putting:

options 3c59x enable_wol=1

in /etc/modprobe.conf then did poweroff. This modified from
the instructions here (URL may wrap):

http://homepage.mac.com/felipe_alfaro/iblog/B1004527421/C1515218762/E66260423/

Unfortunately a subsequent 

  ether-wake -D -i eth1  00:e0:81:22:ba:84  (also with -b)

did nothing.  The ethernet activity light is blinking on the powered
off node, so there is at least power to the (on board) NIC.

/var/log/messages shows these possibly relevant lines:

Feb 22 11:55:04 monkey04 kernel: PCI: PCI BIOS revision 2.10 entry at
0xfd7d0, last bus=2
Feb 22 11:55:04 monkey04 kernel: 0000:02:08.0: 3Com PCI 3c905C Tornado
at 0x2000. Vers LK1.1.19

Now as I understand it PCI 2.1 requires a header cable for WOL.
The little blue Tyan user's manual does not indicate the location
of such a header.  It could be on J12 I suppose, since half the
pins there are not documented.  The book does document a
LAN disable header which describes the on board ethernet
as "3COM 3C905C."  The little blue book also says that the board
is a PCI 2.2 spec, and 2.2 doesn't require such a header.  In any
case as far as the PCI version goes there's some discrepancy
between what is showing up in messages and what Tyan has documented.

The BIOS appears not to have a WOL entry that can be turned on/off
or a password set.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From alvin at Mail.Linux-Consulting.com  Tue Feb 22 13:34:36 2005
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Tue, 22 Feb 2005 13:34:36 -0800 (PST)
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <Pine.LNX.4.10.10502221343480.6214-100000@limpet.umeoce.maine.edu>
Message-ID: <Pine.LNX.3.96.1050222132458.18109A-100000@Maggie.Linux-Consulting.com>


hi ya steve

On Tue, 22 Feb 2005, Steve Cousins wrote:

> I'm in the process of getting a couple 16 Bay 6.4 TB RAID units.  The
> vendors who have given me quotes have a SATA-SCSI version for around
> $14,000 each.  I can get a similarly equipped StorCase unit for around
> $11,000. With the StorCase I envision that I'd be more on my own if
> anything went wrong.  However, with the cheaper price, I'd be able to buy
> a spare RAID controller to have on hand in case one of them failed and
> still save a couple grand.

let's say $300 for 300GB (sata) disks ==> $4,800 for 16 disks ...
	- pata disk is $150-$200 for 300GB disks

	- i'd put 2 or 3 raid controllers in it instead of 1 ..
	unless there is an absolute requirement that there is only
	1 disk subsystem that has the capacity of 6TB on "one disk"
	
i'd prefer to have 2 6TB systems for the same $$$ than to have one
name brand or commercial system

backup of the 6TB or 100TB/rack of data is more important in my book than
"brand name"

> Does anyone have experience with any or all of these?  Is it worth the
> extra money to have a "burned-in" device supported by some company?  

a good burn in tests will lasts about 30 days .. of 24x7x30 continuous
disk exercise ... 
 
it's not worth the "burn in time" and it'd be more important to find
out how and what their rma process and timing is for replacing bad
parts/subsystems ( 4hr turn around vs 4 day turn around etc )

after it passes, and it ships to you, all burn in tests is sorta void,
since the disks and cages could shift a few tenths of mm, and the cards
and disks wont be seated properly again ... things move(shift) in 
"cheap cases"

c ya
alvin


From eugen at leitl.org  Tue Feb 22 14:19:30 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Tue, 22 Feb 2005 23:19:30 +0100
Subject: [Beowulf] New MPI tutorials (fwd from d@daugerresearch.com)
Message-ID: <20050222221930.GD1404@leitl.org>

----- Forwarded message from "Dr. Dean Dauger" <d at daugerresearch.com> -----

From: "Dr. Dean Dauger" <d at daugerresearch.com>
Date: Tue, 22 Feb 2005 11:47:10 -0800
To: scitech at lists.apple.com
Cc: "Dr. Dean Dauger" <d at daugerresearch.com>
Subject: New MPI tutorials
X-Mailer: Apple Mail (2.619)

Hello All,

I wanted to let you know about the debut of three tutorials about 
programming using MPI we just posted:

http://daugerresearch.com/pooch/parallellife.html
http://daugerresearch.com/pooch/parallelcirclepi.html
http://daugerresearch.com/pooch/macmpitutorial.html

featuring working source code examples using MPI and descriptions of 
the parallel code.  We've also updated three earlier source-code 
tutorials on writing parallel code:

http://daugerresearch.com/pooch/parallelknock.html
http://daugerresearch.com/pooch/paralleladder.html
http://daugerresearch.com/pooch/parallelpascalstriangle.html

and updated an introduction to parallelization and an exhibition of 
parallel computing types:

http://daugerresearch.com/pooch/parallelization.html
http://daugerresearch.com/pooch/parallelzoology.html

Also, Dauger Research authored a new Apple Developer Connection article 
about MPI:

http://developer.apple.com/hardware/hpc/mpionmacosx.html

All of these are linked from the Tutorials page:

http://daugerresearch.com/pooch/tutorials.html

Thank you very much,
    Dean

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Scitech mailing list      (Scitech at lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/scitech/eugen%40leitl.org

This email sent to eugen at leitl.org

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050222/86c7e860/attachment.sig>

From Ron.Jerome at nrc-cnrc.gc.ca  Tue Feb 22 12:11:08 2005
From: Ron.Jerome at nrc-cnrc.gc.ca (Jerome, Ron)
Date: Tue, 22 Feb 2005 15:11:08 -0500
Subject: [Beowulf] RAID storage: Vendor vs. parts
Message-ID: <A55621C72DDF974491D432AEE2F7D9410CB8879E@nrcmrdex1b.imsb.nrc.ca>

I've been running a 16 bay Maxtronix ATA raid unit on my 80 node cluster,
24x7 for the last couple of years without a single issue.

In fact I just ordered another similar SATA unit here...
http://www.raidweb.com/sata.html.  

_________________________________________
Ron Jerome
Programmer/Analyst
National Research Council Canada
M-2, 1200 Montreal Road, Ottawa, Ontario K1A 0R6
Government of Canada
Phone: 613-993-5346
FAX:   613-941-1571
_________________________________________

> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On
> Behalf Of Steve Cousins
> Sent: Tuesday, February 22, 2005 2:18 PM
> To: beowulf at beowulf.org
> Subject: [Beowulf] RAID storage: Vendor vs. parts
> 
> 
> I'm in the process of getting a couple 16 Bay 6.4 TB RAID units.  The
> vendors who have given me quotes have a SATA-SCSI version for around
> $14,000 each.  I can get a similarly equipped StorCase unit for around
> $11,000. With the StorCase I envision that I'd be more on my own if
> anything went wrong.  However, with the cheaper price, I'd be able to buy
> a spare RAID controller to have on hand in case one of them failed and
> still save a couple grand.
> 
> All three of these (the other two are Infortrend and I believe a re-badged
> Jet) use the same Intel i80321 CPU on their controllers.
> 
> All are configured the same (with spare PS module and Fan module)  except
> the StorCase doesn't include a 3 year Express Swap warranty. This is why
> I'd also want to get the spare RAID controller to share between the two
> units in case one went bad.  So, for price comparison, it is probably
> closer to $28,000 for two "Name-brand" units or $25,500 for two StorCase
> units with spares of most everything.
> 
> Does anyone have experience with any or all of these?  Is it worth the
> extra money to have a "burned-in" device supported by some company?
> 
> I know this isn't a Beowulf specific question but it seems that storage is
> a big part of Beowulfery and I'd bet that a lot of people are looking into
> similar devices.  I hope it is relevant.
> 
> I'm not after any more quotes from vendors (please don't email or call).
> I'm happy with the alternatives that I have right now.
> 
> Thanks,
> 
> Steve
> ______________________________________________________________________
>  Steve Cousins, Ocean Modeling Group    Email: cousins at umit.maine.edu
>  Marine Sciences, 208 Libby Hall        http://rocky.umeoce.maine.edu
>  Univ. of Maine, Orono, ME 04469        Phone: (207) 581-4302
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


From emac at cybergps.net  Tue Feb 22 12:43:55 2005
From: emac at cybergps.net (Eric Machala)
Date: Tue, 22 Feb 2005 15:43:55 -0500
Subject: [BEOwulf] WW Fedora Student questions help
Message-ID: <001d01c5191f$3ac29b40$6e45a8c0@masstivy>

Hi for my networking and linux class i setup this beowulf cluster useing fedora 2 and warewulf its up and running, but part of this class is also setting up monitoring and benchmarking tools and tests to get overall proformance and what not of my cluster... And setting up and running some parrallel Applications ... Im very interested in getting more into clusters so was wondering if anyone has any tools or scripts or anything i can setup and test on my fedora warewulf setup just to get experience Also i really need some computational or Sim type app's or any type of cluster parrallel application i can play with to get the hang of them so i can move on to makeing my own applications this would be a HUGE help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050222/4bfa6f59/attachment.html>

From canon at nersc.gov  Tue Feb 22 12:54:41 2005
From: canon at nersc.gov (Shane Canon)
Date: Tue, 22 Feb 2005 12:54:41 -0800
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <Pine.LNX.4.10.10502221343480.6214-100000@limpet.umeoce.maine.edu>
References: <Pine.LNX.4.10.10502221343480.6214-100000@limpet.umeoce.maine.edu>
Message-ID: <421B9C11.1070204@nersc.gov>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Steve,

We currently have an Infotrend based box (older FC/IDE) and I'm
currently testing a JetStor (AC&NC) box (FC/SATA).

The Infotrend box has been in production for over a year now.  It has
been reasonably stable. However, in my opinion, the management interface
stinks.  They clearly were thinking more about the Winders crowd than
the Linux crowd.

The JetStor box is pretty nice so far.  You can have redundant
controllers, PS, etc.  The management interface is very nice and very
flexible.  For example, you can take one pool of disks and create
multiple LUNs on the same pool with different RAID levels.  This means
you can create several LUNs under the 2TB limit and not waste disks on
extra parity drives for example.

I don't have any strong opinion on the support question though.

- --Shane

Steve Cousins wrote:
| I'm in the process of getting a couple 16 Bay 6.4 TB RAID units.  The
| vendors who have given me quotes have a SATA-SCSI version for around
| $14,000 each.  I can get a similarly equipped StorCase unit for around
| $11,000. With the StorCase I envision that I'd be more on my own if
| anything went wrong.  However, with the cheaper price, I'd be able to buy
| a spare RAID controller to have on hand in case one of them failed and
| still save a couple grand.
|
| All three of these (the other two are Infortrend and I believe a re-badged
| Jet) use the same Intel i80321 CPU on their controllers.
|
| All are configured the same (with spare PS module and Fan module)  except
| the StorCase doesn't include a 3 year Express Swap warranty. This is why
| I'd also want to get the spare RAID controller to share between the two
| units in case one went bad.  So, for price comparison, it is probably
| closer to $28,000 for two "Name-brand" units or $25,500 for two StorCase
| units with spares of most everything.
|
| Does anyone have experience with any or all of these?  Is it worth the
| extra money to have a "burned-in" device supported by some company?
|
| I know this isn't a Beowulf specific question but it seems that storage is
| a big part of Beowulfery and I'd bet that a lot of people are looking into
| similar devices.  I hope it is relevant.
|
| I'm not after any more quotes from vendors (please don't email or call).
| I'm happy with the alternatives that I have right now.
|
| Thanks,
|
| Steve
| ______________________________________________________________________
|  Steve Cousins, Ocean Modeling Group    Email: cousins at umit.maine.edu
|  Marine Sciences, 208 Libby Hall        http://rocky.umeoce.maine.edu
|  Univ. of Maine, Orono, ME 04469        Phone: (207) 581-4302
|
|
|
| _______________________________________________
| Beowulf mailing list, Beowulf at beowulf.org
| To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCG5wRZd/2zrI5CioRAvDSAKCLkHuIIKe1pv3n0HkM4fYRRAbPsQCgq/6/
1t1D1DvbJea8PUp6K49iajE=
=cZ8o
-----END PGP SIGNATURE-----


From cousins at limpet.umeoce.maine.edu  Tue Feb 22 16:05:39 2005
From: cousins at limpet.umeoce.maine.edu (Steve Cousins)
Date: Tue, 22 Feb 2005 19:05:39 -0500 (EST)
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <fc.004c4d191db8d4273b9aca00148260b1.1db8d428@umit.maine.edu>
Message-ID: <Pine.LNX.4.10.10502221805160.6214-100000@limpet.umeoce.maine.edu>


On Tue, 22 Feb 2005, Alvin Oga wrote:

> hi ya steve
> 
> On Tue, 22 Feb 2005, Steve Cousins wrote:
> 
> >> I'm in the process of getting a couple 16 Bay 6.4 TB RAID units.  The
> >> vendors who have given me quotes have a SATA-SCSI version for around
> >> $14,000 each.  I can get a similarly equipped StorCase unit for around
> >> $11,000. With the StorCase I envision that I'd be more on my own if
> >> anything went wrong.  However, with the cheaper price, I'd be able to buy
> >> a spare RAID controller to have on hand in case one of them failed and
> >> still save a couple grand.
> >
> let's say $300 for 300GB (sata) disks ==> $4,800 for 16 disks ...
> 	- pata disk is $150-$200 for 300GB disks
> 
> 	- i'd put 2 or 3 raid controllers in it instead of 1 ..
> 	unless there is an absolute requirement that there is only
> 	1 disk subsystem that has the capacity of 6TB on "one disk"

That's what I'm shooting for. Anybody have good luck with volumes greater
than 2 TB with Linux?  I think LSI SCSI cards are needed (?) and the 2.6
Kernel is needed with CONFIG_LBD=y.  Any hints or notes about doing this
would be greatly appreciated.  Google has not been much of a friend on
this unfortunatlely. I'm guessing I'd run into NFS limits too.

Also, am I being overly cautious about having a spare RAID controller on
hand?  How frequent do RAID controllers go bad compared to disks, power
supplies and fan modules?  I'd guess that it would be very infrequent.
Looking back at my own experience I think I've had to return one out of 15
in the last eight years, and that was bad as soon as I bought it.

If this is too off-topic let me know and I'll move it elsewhere.

Thanks,

Steve


From lindahl at pathscale.com  Tue Feb 22 16:31:03 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Tue, 22 Feb 2005 16:31:03 -0800
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <Pine.LNX.4.10.10502221805160.6214-100000@limpet.umeoce.maine.edu>
References: <fc.004c4d191db8d4273b9aca00148260b1.1db8d428@umit.maine.edu>
	<Pine.LNX.4.10.10502221805160.6214-100000@limpet.umeoce.maine.edu>
Message-ID: <20050223003103.GB3291@greglaptop.internal.keyresearch.com>

On Tue, Feb 22, 2005 at 07:05:39PM -0500, Steve Cousins wrote:

> Also, am I being overly cautious about having a spare RAID controller on
> hand?

No. It depends on what kind of uptime you want to support.

>  How frequent do RAID controllers go bad compared to disks, power
> supplies and fan modules?

Which of these can you buy at Fry's?

-- greg


From alvin at Mail.Linux-Consulting.com  Tue Feb 22 16:43:34 2005
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Tue, 22 Feb 2005 16:43:34 -0800 (PST)
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <Pine.LNX.4.10.10502221805160.6214-100000@limpet.umeoce.maine.edu>
Message-ID: <Pine.LNX.3.96.1050222163505.4645B-100000@Maggie.Linux-Consulting.com>


hi ya steve

On Tue, 22 Feb 2005, Steve Cousins wrote:

> That's what I'm shooting for. Anybody have good luck with volumes greater
> than 2 TB with Linux?  I think LSI SCSI cards are needed (?) and the 2.6
> Kernel is needed with CONFIG_LBD=y.  Any hints or notes about doing this
> would be greatly appreciated.  Google has not been much of a friend on
> this unfortunatlely. I'm guessing I'd run into NFS limits too.

for files/volumes over 2TB ... it's a question of libs, apps and kernel 
	everything has to work ... which is not always the case

	i don't play much with 2.6 kernels other than on suse-9.x boxes
  
> Also, am I being overly cautious about having a spare RAID controller on
> hand?  How frequent do RAID controllers go bad compared to disks, power
> supplies and fan modules?  I'd guess that it would be very infrequent.

it's always better to have spare parts ... ( part of my requirement ) if
they expect the systems to be available 24x7 ... 

	- more importantly, how long can they wait, when silly inexpensive
	things die, before it gets replaced

	- dead fans is $2.oo - $15 each to keep the disks cool

	- power supply is $50 range ... but if one bought n+1 powersupply
	than its supposed to not be an issue anymore, but you will need to
	have its replacement handy

	- raid controllers should NOT die, nor cpu, mem, mb, nic, etc
	and it's not cheap to have these items floating around as spare
	parts

	- ethernet cables will go funky if random people have access
	to the patch panels ... ( keep the fingers away )

	- ups will go bonkers too

	- what failure mode can one protect against and what will happen
	if "it" dies 

	- best protection against downtime for users is to have an
	warm-swap server which is updated a hourly or daily ... 
	( my preference - 2nd identical or bigger-disk capacity system )

> Looking back at my own experience I think I've had to return one out of 15
> in the last eight years, and that was bad as soon as I bought it.

seems too high of a return rate ?? 1 out of 15 ??

> If this is too off-topic let me know and I'll move it elsewhere.

ditto here 

24x7x365 uptime compute environment is fun/frustrating stuff on tight
budgets

c ya
alvin


From alvin at Mail.Linux-Consulting.com  Tue Feb 22 16:50:08 2005
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Tue, 22 Feb 2005 16:50:08 -0800 (PST)
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <20050223003103.GB3291@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.3.96.1050222164513.4645D-100000@Maggie.Linux-Consulting.com>


hi ya greg

On Tue, 22 Feb 2005, Greg Lindahl wrote:

> >  How frequent do RAID controllers go bad compared to disks, power
> > supplies and fan modules?
> 
> Which of these can you buy at Fry's?

if you're buying parts/systems from "fries" or compusa/dell/etc..
than one is in deep kaka ...

 	- consumer grade parts is not as good as "industrial strength"
	that does not necessarily mean higher prices
	
	- fries et.al carry the lower grade junk of the same parts

	marginal mtbf parts of the same identical items, sold by
	higher end distributors vs retail store

	my conspiracy theory about why the same Model xx from Manufacturer
	are good for some and bad for others 
	( it'd depend on where you bought it )

- most consumer stores carry 3ware cards ... but not necessarily
  adaptec/lsi raid cards but some do carry all 3 of um but none
  carry the $1K - $20k raid controllers in stock

c ya
alvin


From lusk at mcs.anl.gov  Tue Feb 22 21:12:17 2005
From: lusk at mcs.anl.gov (Rusty Lusk)
Date: Tue, 22 Feb 2005 23:12:17 -0600 (CST)
Subject: [Beowulf] MPICH question
In-Reply-To: <1109035829.18387.5.camel@cattail.clustertech.com>
References: <1109035829.18387.5.camel@cattail.clustertech.com>
Message-ID: <20050222.231217.74740023.lusk@localhost>

From: John Lau <cflau at clustertech.com>
Subject: [Beowulf] MPICH question
Date: Tue, 22 Feb 2005 09:30:29 +0800

> Hi,
> 
> I have a question on the number of processes spawned by MPICH. I am
> using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to
> start a mpi process, total 8 processes will be spawned. But only 4
> processes have loadings on CPUs. I would like to know if it is the
> correct behavior of MPICH? And what's the use of the 4 no-loading
> process? Thank you.

The other 4 processes are "listener" processes that permit dynamic
creation of connections as they are needed.  What you see is the correct
behavior.  We recommend that MPICH1 users switch to MPICH2.

Regards,
Rusty Lusk


From joachim at ccrl-nece.de  Wed Feb 23 03:19:55 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Wed, 23 Feb 2005 12:19:55 +0100
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <20050222050824.GA2195@greglaptop.attbi.com>
References: <20050222050824.GA2195@greglaptop.attbi.com>
Message-ID: <421C66DB.5080309@ccrl-nece.de>

Greg Lindahl wrote:
> The first question is: Does an ABI provide enough benefit for people
> to care? To care enough to sit on a committee?

Unfortunately, the value of an ABI is much reduced by the fact that the 
most important target platform Linux itself has no stable ABI (think of 
libc and other version nightmares). On a OS like Solaris or Windows, 
this is much more of a benefit.

Another problem are i.e. vendor-specific assertions that could conflict. 
A solution for this could be "numerical namespaces" for such extensions, 
but how should they be managed?

And what about the different calling-conventions in Fortran? Different 
library names for each variant? The different symbol names are also a 
problem, but a solvable one if a limited, but sufficient set of 
uppercase-lowercase-underscore permutations is defined.

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From ashley at quadrics.com  Wed Feb 23 03:31:35 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Wed, 23 Feb 2005 11:31:35 +0000
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <20050222.110605.03066059.lusk@localhost>
References: <20050222050824.GA2195@greglaptop.attbi.com>
	<20050222.110605.03066059.lusk@localhost>
Message-ID: <1109158295.9025.88.camel@localhost.localdomain>

On Tue, 2005-02-22 at 11:06 -0600, Rusty Lusk wrote:
> From: Greg Lindahl <lindahl at pathscale.com>
> Subject: [Beowulf] The Case for an MPI ABI
> Date: Mon, 21 Feb 2005 21:08:24 -0800
> > This talk:
> > 
> > http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf
> > 
> > mostly talks about why we need an ABI, who wins and loses as a result
> > of having one, and the pieces that could be in it. Please give it a
> > look.
> 
> One piece you include is the replacement of the non-portable mpirun with
> an mpistart with standard arguments.  You might note that the MPI-2
> forum addressed this issue with the specification of an mpiexec with
> standard arguments.  MPICH2 implements it.

Can you provide a link to the the part of the MPI-2 spec which says what
these arguments are, I can't seem to find it on-line.

Ashley,


From gropp at mcs.anl.gov  Tue Feb 22 20:21:33 2005
From: gropp at mcs.anl.gov (William Gropp)
Date: Tue, 22 Feb 2005 22:21:33 -0600
Subject: [Beowulf] MPICH question
In-Reply-To: <1109035829.18387.5.camel@cattail.clustertech.com>
References: <1109035829.18387.5.camel@cattail.clustertech.com>
Message-ID: <6.2.1.2.2.20050222221818.04faf120@pop.mcs.anl.gov>

At 07:30 PM 2/21/2005, John Lau wrote:
>Hi,
>
>I have a question on the number of processes spawned by MPICH. I am
>using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to
>start a mpi process, total 8 processes will be spawned. But only 4
>processes have loadings on CPUs. I would like to know if it is the
>correct behavior of MPICH? And what's the use of the 4 no-loading
>process? Thank you.

Those four processes are used to listen for connection requests.  They are 
part of the ch_p4 device, which is built on top of the p4 communication 
layer.  The p4 layer is quite venerable (it may be older than you are), and 
predates the wide use of threads.  There is an option to use a thread 
instead of a process to listen for connection requests, but you best bet is 
to switch to MPICH2, which uses a very different (and more scalable and 
modern) architecture.  Specific questions should be directed to 
mpi-maint at mcs.anl.gov (for MPICH) or mpich2-maint at mcs.anl.gov (for MPICH2).

Bill


>Best regards,
>John Lau
>--
>John Lau Chi Fai
>Cluster Technology Ltd.
>cflau at clustertech.com
>Tel: (852) 2994-3727
>Fax: (852) 2994-2101
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

William Gropp
http://www.mcs.anl.gov/~gropp 


From gropp at mcs.anl.gov  Wed Feb 23 05:38:03 2005
From: gropp at mcs.anl.gov (William Gropp)
Date: Wed, 23 Feb 2005 07:38:03 -0600
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <1109158295.9025.88.camel@localhost.localdomain>
References: <20050222050824.GA2195@greglaptop.attbi.com>
	<20050222.110605.03066059.lusk@localhost>
	<1109158295.9025.88.camel@localhost.localdomain>
Message-ID: <6.2.1.2.2.20050223073608.04f50400@pop.mcs.anl.gov>

At 05:31 AM 2/23/2005, Ashley Pittman wrote:
>On Tue, 2005-02-22 at 11:06 -0600, Rusty Lusk wrote:
> > From: Greg Lindahl <lindahl at pathscale.com>
> > Subject: [Beowulf] The Case for an MPI ABI
> > Date: Mon, 21 Feb 2005 21:08:24 -0800
> > > This talk:
> > >
> > > http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf
> > >
> > > mostly talks about why we need an ABI, who wins and loses as a result
> > > of having one, and the pieces that could be in it. Please give it a
> > > look.
> >
> > One piece you include is the replacement of the non-portable mpirun with
> > an mpistart with standard arguments.  You might note that the MPI-2
> > forum addressed this issue with the specification of an mpiexec with
> > standard arguments.  MPICH2 implements it.
>
>Can you provide a link to the the part of the MPI-2 spec which says what
>these arguments are, I can't seem to find it on-line.

It is under "Portable MPI Process Startup"; see 
http://www.mpi-forum.org/docs/mpi-20-html/node42.htm#Node42 .

Bill


>Ashley,
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

William Gropp
http://www.mcs.anl.gov/~gropp 


From cousins at limpet.umeoce.maine.edu  Wed Feb 23 07:36:19 2005
From: cousins at limpet.umeoce.maine.edu (Steve Cousins)
Date: Wed, 23 Feb 2005 10:36:19 -0500 (EST)
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <421C0D6D.5020008@harddata.com>
Message-ID: <Pine.LNX.4.10.10502230957330.6214-100000@limpet.umeoce.maine.edu>


On Tue, 22 Feb 2005, Maurice Hilarius wrote:

> I am not sure what you are asking here?
> If you had the experience to build this yourself with confidence, then it is not a question.
> And if you do not, then why the uncertainty? You will NEED the support.

My question was simply:

  Does anyone have experience with any or all of these?  Is it worth the
  extra money to have a "burned-in" device supported by some company?  

where "these" refered to Storcase, Infortrend, and Jetstor 16 Bay
SATA-SCSI RAID units.  I've heard good things about Infortrend and Jetstor
but nothing about the Storcase unit so I primarily was interested in
hearing if anyone had used these and what their impression was vs. what
I'd heard about the others.  Based partly on this, I will make a decision
on what to buy.
 
> Or am I missing a vital part of your question?
> 
> BTW, those prices are WAY too high.  Further, there is a very strong
> case now for a purely software RAID. Still use the RAID controllers as
> disk interface controllers, such as LSI or AMCC (Was 3Ware). But use a
> commodity dual CPU motherboard as the "RAID controller" under mdadm.
> LOTS of people are doing that very successfully, and it is both
> economical and generally higher performance than pure RAID controllers
> under RAID5 or 6.

I agree that you can do it cheaper this way however it does have its
drawbacks.  I set up a 2 TB system a year ago using a 3Ware 8506-12 card
and 10 250 GB drives.  It took months to get it to be rock-solid due to a
combination of problems between the 3Ware driver, the 2.6 kernel using
Opterons, and failing Maxtor hard drives as they worked their way to the
bottom of the Bathtub curve. When a drive would fail or show signs of
failing the driver would OOPS the system.  We're now trying to get away
from this mode of storage and transition to more of a SAN system (we may
get FC versions of these boxes instead of SCSI) which will give better
performance to all nodes as well as (hopefully) have less management
overhead (my time).
 
In any case, back to the main question:  I haven't heard of anyone with
experience with the Storcase systems.  Has anyone considered them and
decided to pay more for a name brand?  If so, can you tell me why?

Thanks,

Steve


From josip at lanl.gov  Wed Feb 23 08:26:21 2005
From: josip at lanl.gov (Josip Loncaric)
Date: Wed, 23 Feb 2005 09:26:21 -0700
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <Pine.LNX.3.96.1050222164513.4645D-100000@Maggie.Linux-Consulting.com>
References: <Pine.LNX.3.96.1050222164513.4645D-100000@Maggie.Linux-Consulting.com>
Message-ID: <421CAEAD.8070207@lanl.gov>

Alvin Oga wrote:
> if you're buying parts/systems from "fries" or compusa/dell/etc..
> than one is in deep kaka ...
> 
>  	- consumer grade parts is not as good as "industrial strength"
> 	that does not necessarily mean higher prices
> 	
> 	- fries et.al carry the lower grade junk of the same parts
> 
> 	marginal mtbf parts of the same identical items, sold by
> 	higher end distributors vs retail store
> 
> 	my conspiracy theory about why the same Model xx from Manufacturer
> 	are good for some and bad for others 
> 	( it'd depend on where you bought it )

My experience with boxed drives bought from retailers was better than 
OEM bare drives from reputable sources.  Retail boxed drives often carry 
3yr warranty, so there is more at stake for the manufacturer if they go bad.

This is just one observation, possibly not statistically significant, 
but my retail store purchases worked out just fine.

Sincerely,
Josip

P.S.  You would not want to build a cluster that way (limited selection 
& higher prices) but for spare parts, retail stores are quick and 
convenient.

----------------------------------------------------------------------
  "Technical data or Software Publicly Available" or "Correspondence".
----------------------------------------------------------------------


From reuti at staff.uni-marburg.de  Tue Feb 22 15:23:41 2005
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed, 23 Feb 2005 00:23:41 +0100
Subject: [Beowulf] MPICH question
In-Reply-To: <1109035829.18387.5.camel@cattail.clustertech.com>
References: <1109035829.18387.5.camel@cattail.clustertech.com>
Message-ID: <1109114621.421bbefda1ca0@home.staff.uni-marburg.de>

Hi,

you are not using shared memory, and so you will also see the rsh tasks. This 
is the usual behavior of mpich (and the other forks are most likely responsible 
for the async network communication between the tasks, although they are just 
on the same machine).

But you try this, and use in addition the following switch to ./configure 
mpich: -comm=shared , recompile (or at least relink) your program, and use a 
machinefile with only one line:

myhost:4

and you we now have only 4 processes and use shared memory.

Cheers - Reuti


Quoting John Lau <cflau at clustertech.com>:

> Hi,
> 
> I have a question on the number of processes spawned by MPICH. I am
> using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to
> start a mpi process, total 8 processes will be spawned. But only 4
> processes have loadings on CPUs. I would like to know if it is the
> correct behavior of MPICH? And what's the use of the 4 no-loading
> process? Thank you.
> 
> Best regards,
> John Lau
> -- 
> John Lau Chi Fai
> Cluster Technology Ltd.
> cflau at clustertech.com
> Tel: (852) 2994-3727
> Fax: (852) 2994-2101
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From michael at halligan.org  Tue Feb 22 16:24:48 2005
From: michael at halligan.org (Michael T. Halligan)
Date: Tue, 22 Feb 2005 16:24:48 -0800 (PST)
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <Pine.LNX.4.10.10502221805160.6214-100000@limpet.umeoce.maine.edu>
References: <fc.004c4d191db8d4273b9aca00148260b1.1db8d428@umit.maine.edu>
	<Pine.LNX.4.10.10502221805160.6214-100000@limpet.umeoce.maine.edu>
Message-ID: <56919.66.150.251.142.1109118288.squirrel@mail3.bitpusher.com>


>> >> I'm in the process of getting a couple 16 Bay 6.4 TB RAID units.  The
>> >> vendors who have given me quotes have a SATA-SCSI version for around
>> >> $14,000 each.  I can get a similarly equipped StorCase unit for
>> around
>> >> $11,000. With the StorCase I envision that I'd be more on my own if
>> >> anything went wrong.  However, with the cheaper price, I'd be able to
>> buy
>> >> a spare RAID controller to have on hand in case one of them failed
>> and
>> >> still save a couple grand.
>> >
>> let's say $300 for 300GB (sata) disks ==> $4,800 for 16 disks ...
>> 	- pata disk is $150-$200 for 300GB disks
>>
>> 	- i'd put 2 or 3 raid controllers in it instead of 1 ..
>> 	unless there is an absolute requirement that there is only
>> 	1 disk subsystem that has the capacity of 6TB on "one disk"
>
> That's what I'm shooting for. Anybody have good luck with volumes greater
> than 2 TB with Linux?  I think LSI SCSI cards are needed (?) and the 2.6
> Kernel is needed with CONFIG_LBD=y.  Any hints or notes about doing this
> would be greatly appreciated.  Google has not been much of a friend on
> this unfortunatlely. I'm guessing I'd run into NFS limits too.
>
> Also, am I being overly cautious about having a spare RAID controller on
> hand?  How frequent do RAID controllers go bad compared to disks, power
> supplies and fan modules?  I'd guess that it would be very infrequent.
> Looking back at my own experience I think I've had to return one out of 15
> in the last eight years, and that was bad as soon as I bought it.
>
> If this is too off-topic let me know and I'll move it elsewhere.
>
> Thanks,
>


I've had no problems with raid voumes 4-8 terabytes in size.  recently,
but I use the 2.6.8 kernel and only use ICP Vortex cards, I've found the
LSI cards to be sketchy at best.


-------------------
BitPusher, LLC
http://www.bitpusher.com/
1.888.9PUSHER
(415) 724.7998 - Mobile


From maurice at harddata.com  Tue Feb 22 20:58:21 2005
From: maurice at harddata.com (Maurice Hilarius)
Date: Tue, 22 Feb 2005 21:58:21 -0700
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <200502222000.j1MK0AEf018502@bluewest.scyld.com>
References: <200502222000.j1MK0AEf018502@bluewest.scyld.com>
Message-ID: <421C0D6D.5020008@harddata.com>

Steve Cousins wrote:
Subject: [Beowulf] RAID storage: Vendor vs. parts


>I'm in the process of getting a couple 16 Bay 6.4 TB RAID units.  The
>vendors who have given me quotes have a SATA-SCSI version for around
>$14,000 each.  I can get a similarly equipped StorCase unit for around
>$11,000. With the StorCase I envision that I'd be more on my own if
>anything went wrong.  However, with the cheaper price, I'd be able to buy
>a spare RAID controller to have on hand in case one of them failed and
>still save a couple grand.
>
>All three of these (the other two are Infortrend and I believe a re-badged
>Jet) use the same Intel i80321 CPU on their controllers.  
>
>All are configured the same (with spare PS module and Fan module)  except
>the StorCase doesn't include a 3 year Express Swap warranty. This is why
>I'd also want to get the spare RAID controller to share between the two
>units in case one went bad.  So, for price comparison, it is probably
>closer to $28,000 for two "Name-brand" units or $25,500 for two StorCase
>units with spares of most everything.
>
>Does anyone have experience with any or all of these?  Is it worth the
>extra money to have a "burned-in" device supported by some company?  
>
>I know this isn't a Beowulf specific question but it seems that storage is
>a big part of Beowulfery and I'd bet that a lot of people are looking into
>similar devices.  I hope it is relevant. 
>
>I'm not after any more quotes from vendors (please don't email or call).
>I'm happy with the alternatives that I have right now.
>
>Thanks,
>
>Steve


I am not sure what you are asking here?
If you had the experience to build this yourself with confidence, then it is not a question.
And if you do not, then why the uncertainty? You will NEED the support.

Or am I missing a vital part of your question?

BTW, those prices are WAY too high.
Further, there is a very strong case now for a purely software RAID. Still use the RAID controllers as disk interface controllers, such as LSI or AMCC (Was 3Ware).
But use a commodity dual CPU motherboard as the "RAID controller" under mdadm.
LOTS of people are doing that very successfully, and it is both economical and generally higher performance than pure RAID controllers under RAID5 or 6.


With our best regards,

Maurice W. Hilarius        Telephone: 01-780-456-9771
Hard Data Ltd.  FAX:       01-780-456-9772
11060 - 166 Avenue         email:maurice at harddata.com
Edmonton, AB, Canada       http://www.harddata.com/
   T5X 1Y3


From list-beowulf at onerussian.com  Tue Feb 22 21:44:08 2005
From: list-beowulf at onerussian.com (Yaroslav Halchenko)
Date: Wed, 23 Feb 2005 00:44:08 -0500
Subject: [Beowulf] managing debian packages
Message-ID: <20050223054408.GS3124@washoe.rutgers.edu>

Hello to all Beowulfers,

A simple question: so we have cfengine2 to manage configs  through the
hosts. But its "packages" section is quite handicaped so there is a
question: how do you manage installing packages on  nodes which
some times might differ a bit but most often have the same set of
packages. I have in mind Debian packaging system

In my case what I do is

1. Install required package on a main node, so if it has any dialog
which tweaks configuration - I adjust it so it fits my needs.

2. I ran cfegines through the cluster, so they pick up updated 
/var/cache/debconf/config.dat

3. Using favorite dsh I install the same package in parallel on all the
nodes in non-interactive regime, so it grabs answers for possible
questions from debconf.

This way everything is kinda right.

Other ways would be: install on all the nodes from the beginning in
non-interactive. I don't like such option because it is quite often that
default config has to be tweaked in  slight way suggested by debconf,
but if you don't get dialog - you will not tweak it... 

or indeed my way is overkill?

-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050223/d6055ea9/attachment.sig>

From streich at uwm.edu  Tue Feb 22 23:12:33 2005
From: streich at uwm.edu (streich at uwm.edu)
Date: Wed, 23 Feb 2005 01:12:33 -0600
Subject: [Beowulf] Re: WW Fedora Student questions help
Message-ID: <1109142753.421c2ce1638f8@panthermail.uwm.edu>

*SNIP*
> but part of this class is also setting up monitoring and benchmarking tools
> and tests to get overall proformance and what not of my cluster...

Check out MRTG and Ganglia on the net for monitoring.  I know that MRTG is
avalible as a RedHat package that can be installed from distro's install CDs,
but it is probably best to get it in source code and configure it to meet your
needs.  Ganglia is powerful, it is used by a lot of clusters running ROCKs.

> And setting up and running some parrallel Applications ... Im very interested
in
> getting more into clusters so was wondering if anyone has any tools or scripts
> or anything i can setup and test on my fedora warewulf setup just to get
> experience Also i really need some computational or Sim type app's or any type
> of cluster parrallel application i can play with to get the hang of them so i
> can move on to makeing my own applications this would be a HUGE help

Hot topics that are using clusters include astronomical models, biology (human
genome stuff), and all sorts of things.  Our cluster is dedicated to
atmospheric research (in particular studying fluffy white clouds most
atmospheric researchers aren't as intrested in as storms and such) running
COAMPs and wrf (MPI based programs).

I suppose the real question is: what type of problem are you intrested in?


From pauln at psc.edu  Tue Feb 22 23:15:48 2005
From: pauln at psc.edu (Paul Nowoczynski)
Date: Wed, 23 Feb 2005 02:15:48 -0500
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <Pine.LNX.3.96.1050222163505.4645B-100000@Maggie.Linux-Consulting.com>
References: <Pine.LNX.3.96.1050222163505.4645B-100000@Maggie.Linux-Consulting.com>
Message-ID: <421C2DA4.8090608@psc.edu>

Alvin Oga wrote:

>hi ya steve
>
>On Tue, 22 Feb 2005, Steve Cousins wrote:
>
>  
>
>>That's what I'm shooting for. Anybody have good luck with volumes greater
>>than 2 TB with Linux?  I think LSI SCSI cards are needed (?) and the 2.6
>>Kernel is needed with CONFIG_LBD=y.  Any hints or notes about doing this
>>would be greatly appreciated.  Google has not been much of a friend on
>>this unfortunatlely. I'm guessing I'd run into NFS limits too.
>>    
>>
>
>for files/volumes over 2TB ... it's a question of libs, apps and kernel 
>	everything has to work ... which is not always the case
>
>  
>
We've got this working at PSC without too much pain.. even with scsi 
block devices >2TB.  The  LBD is needed but it
doesn't solve all the problems with large disks, especially if you have 
a single volume which is larger than
2TB.  The issue we ran into was that many disk related apps like mdadm 
and [s]fdisk don't support
the BLKGETSIZE64 ioctl.  So even though your kernel is using 64 bits, 
some needed apps are not. 
There are also issues with disklabels for devices >2TB.  The normal 
dos-style disklabel used by linux
doesn't support them so you'll need a kernel patch for the "plaintext" 
partition table made by Andries Brouwer.
If you're interested in running this on 2.6 I can give you the patch.   
As far as cards go I think the adaptec u320 cards
are better.  I've seen less scsi timeout weirdness with them (this could 
be related to our disks).  Performance wise
the lsi and adaptec are about the same.. we see ~400MB/sec when using 
both channels - even with a sub pci-x bus. 
For a couple hundred bucks a card this is really good news. 

--paul

>	i don't play much with 2.6 kernels other than on suse-9.x boxes
>  
>  
>
>>Also, am I being overly cautious about having a spare RAID controller on
>>hand?  How frequent do RAID controllers go bad compared to disks, power
>>supplies and fan modules?  I'd guess that it would be very infrequent.
>>    
>>
>
>it's always better to have spare parts ... ( part of my requirement ) if
>they expect the systems to be available 24x7 ... 
>
>	- more importantly, how long can they wait, when silly inexpensive
>	things die, before it gets replaced
>
>	- dead fans is $2.oo - $15 each to keep the disks cool
>
>	- power supply is $50 range ... but if one bought n+1 powersupply
>	than its supposed to not be an issue anymore, but you will need to
>	have its replacement handy
>
>	- raid controllers should NOT die, nor cpu, mem, mb, nic, etc
>	and it's not cheap to have these items floating around as spare
>	parts
>
>	- ethernet cables will go funky if random people have access
>	to the patch panels ... ( keep the fingers away )
>
>	- ups will go bonkers too
>
>	- what failure mode can one protect against and what will happen
>	if "it" dies 
>
>	- best protection against downtime for users is to have an
>	warm-swap server which is updated a hourly or daily ... 
>	( my preference - 2nd identical or bigger-disk capacity system )
>
>  
>
>>Looking back at my own experience I think I've had to return one out of 15
>>in the last eight years, and that was bad as soon as I bought it.
>>    
>>
>
>seems too high of a return rate ?? 1 out of 15 ??
>
>  
>
>>If this is too off-topic let me know and I'll move it elsewhere.
>>    
>>
>
>ditto here 
>
>24x7x365 uptime compute environment is fun/frustrating stuff on tight
>budgets
>
>c ya
>alvin
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>


From gotero at linuxprophet.com  Wed Feb 23 00:48:53 2005
From: gotero at linuxprophet.com (Glen Otero)
Date: Wed, 23 Feb 2005 00:48:53 -0800
Subject: [Beowulf] New BioBrew Release
Message-ID: <0b130210c33a2235b11fc4ff101381f2@linuxprophet.com>

BioBrew-3.1 for x86 is here.

BioBrew is an open source Linux cluster distribution based on the 
popular Rocks (www.rocksclusters.org) cluster software and enhanced for 
bioinformatics. BioBrew includes popular cluster software e.g. MPICH, 
PVM, Modules, PVFS, Myrinet GM, Sun Grid Engine, gcc, Ganglia, and 
Globus, *and* popular bioinformatics software e.g. the NCBI toolkit, 
BLAST, mpiBLAST, HMMER, ClustalW, GROMACS, PHYLIP, WISE, FASTA, 
MrBayes, and EMBOSS. A BioBrew DVD iso for x86 is freely available for 
download at BioBrew.org, a Bioinformatics.org sponsored and hosted 
website. README and INSTALL docs are also available on the website.

Features you'll find in this release that differ slightly from 
Rocks-3.1 include:
-Infiniband support from Mellanox (/usr/mellanox)
-Myrinet support with gm-2.0.11
-Virtual Machine Interface (VMI) 2.0. (/opt/vmi-2.0-gcc)
-mpich built for VMI (/opt/mpich-vmi-2.0-gcc)

Application upgrades for 3.1:
added modules back to distro with modulesenv 3.1.6-2
upgraded modulefiles to 1.0.2, including corrections to profile.d 
scripts submitted by Humberto Zuazaga
upgraded hmmer to 2.3.2-1
upgraded gromacs to 3.2.1-1
upgraded EMBOSS to 2.9.0-6; EMBOSS now installs to /usr/share and not 
/opt/BioBrew
upgraded NCBI BLAST with ncbitoools 6.1.0-2 and ncbi.tar.gz from 
10/20/04)
upgraded mpiBLAST to 1.3; mpiBLAST lives under /opt/NCBI/6.1.0/bin with 
the rest of the binaries
upgraded Phylip to 3.61-5 and separated it from EMBOSS

New applications:
MrBayes 3.0-1 has been added to BioBrew

BioBrew rolls are also here:
BioBrew rolls for Rocks-3.1 and Rocks-3.3 on x86 are also available on 
the website, as is a BioBrew roll for Rocks-3.3 on x86_64. The rolls 
contain the same bio apps as the full BioBrew release. But know that if 
you build a cluster with Rocks + BioBrew roll, you will not have access 
to the vmi, mpich-vmi, Infiniband, or Myrinet packages in the full 
BioBrew release. You will be relying on Rocks' Infiniband, mpich, and 
Myrinet support, which is slightly different than BioBrew's, when using 
these rolls. The BioBrew rolls include the SRPMS as well as the RPMS.

This is the first BioBrew release for x86_64:
The BioBrew roll for Rocks-3.3 on x86_64 includes the same apps as the 
x86 releases, except for Java and EMBOSS. EMBOSS is not included 
because Java for Rocks-3.3-x86_64 was not available at the time this 
release was built.

Future BioBrew development is likely to consist of BioBrew rolls only. 
I've got more apps planned. I'm also soliciting requests. Hint, hint...

Thanks to Joe Landman of Scalable Informatics and Luc Ducazu of 
BioLinux for the SRPMS they make available to the community, and from 
which I borrowed liberally. They helped make this release a reality. 
Thanks to the beta testers that downloaded the releases over the past 
few weeks and pointed out a few problems. Thanks to Aaron Darling and 
Humberto Zuazaga for their help with mpiBLAST and Modules, 
respectively. Special thanks to the folks running BioBrew mirrors!

Glen Otero Ph.D.
Linux Prophet
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3161 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050223/2fbbf51d/attachment.bin>

From reuti at staff.uni-marburg.de  Wed Feb 23 00:23:07 2005
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed, 23 Feb 2005 09:23:07 +0100
Subject: [Beowulf] sun grid engine on Scyld beowulf cluster
In-Reply-To: <4219201F.70105@metrumrg.com>
References: <4214A0E5.3010804@metrumrg.com> <421516B7.3060906@sonsorol.org>
	<4219201F.70105@metrumrg.com>
Message-ID: <1109146987.421c3d6b7c926@home.staff.uni-marburg.de>

Hi,

maybe this is of help:

http://noel.feld.cvut.cz/magi/sge+bproc.html

Cheers - Reuti


Quoting BillKnebel <billk01 at metrumrg.com>:

> 
> Chris,
> 
> I was able to get grid engine to run on the Scyld cluster using the 
> approach of setting the master (head) node as the submit, admin, and 
> execute host.  Unfortunately, starting a set of jobs on the cluster 
> results in all jobs being run on the head node only (if grid engine only 
> commands are used) or I can integrate grid engine "qsub" command with  
> some of the Scyld tools to get jobs started then migrated ( to a point) 
> over the cluster.  However, I am still running into problems becuase all 
> of the queueing variables for grid engine read the headnode info and 
> since all jobs run on the compute nodes, the headnode appears to be 
> always free which results in all jobs being started at once. This is not 
> ideal. 
> 
> I am waiting on some feedback from Scyld/Penguin computing on some 
> related issues that will hopefully solve some of these problems. 
> 
> Bill
> Chris Dagdigian wrote:
> 
> >
> > I know Grid Engine well but not Scyld so forgive my ignorance if I say 
> > something stupid and given the level of expertise on this list I'm 
> > quite certain I'm about to make a fool myself :)
> >
> > If Scyld is presenting you with a single system image (ie a single 
> > linux server that can farm out tasks to all those nodes) then you 
> > would install SGE in the same way that you would install it on a big 
> > SMP box:
> >
> > 1. Install the SGE qmaster and scheduler on the master node
> > 2. Install the execution host on the master node as well
> >
> > You will only have 1 execd per queue but each queue can be configured 
> > with N number of "job slots" which actually control how many jobs can 
> > run at the same time on the same machine.
> >
> > Try setting your # of job slots within your single SGE queue to the 
> > number of nodes in your cluster. This is simlar to what you would do 
> > on a big SMP machine -- small number of queues each supporting a 
> > decent jobslot count.
> >
> > Then submit a bunch of jobs and see if SGE causes the master node to 
> > fall over under load. If not then Scyld is doing its thing behind the 
> > scenes to migrate stuff around to the other nodes.
> >
> > -Chris
> >
> >
> >
> > billk01 wrote:
> >
> >> I am in the process of installing SGE on a Scyld beowulf cluster.  As
> >> most people are aware, the Scyld cluster runs a complete OS (linux) only
> >> on the master node and the compute nodes are simply for executing.
> >> During the SGE install, it requires adding the compute nodes as execute
> >> hosts.  I do not understand how to do this given the current setup of a
> >> scyld cluster since you can't "login" to the nodes to execute the
> >> install script.  The script does exist on an NFS shared directory
> >> (cluster wide).  Has anybody else ran into this problem?
> >>
> >
> >
> >
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From eugen at leitl.org  Wed Feb 23 12:02:46 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Wed, 23 Feb 2005 21:02:46 +0100
Subject: [Beowulf] [BioBrew-discuss] Re: Biobrew 3.1 (fwd from
	jeff@bioinformatics.org)
Message-ID: <20050223200245.GJ1404@leitl.org>


Site's hammered; no mirrors nor torrents, so won't be of much use yet.

----- Forwarded message from "J.W. Bizzaro" <jeff at bioinformatics.org> -----

From: "J.W. Bizzaro" <jeff at bioinformatics.org>
Date: Wed, 23 Feb 2005 14:42:11 -0500
To: The Virtual BioBrew Think Tank <biobrew-discuss at bioinformatics.org>
Subject: [BioBrew-discuss] Re: Biobrew 3.1
User-Agent: Mozilla Thunderbird 1.0 (X11/20041206)
Reply-To: The Virtual BioBrew Think Tank <biobrew-discuss at bioinformatics.org>

Hi guys.

Since the DVD ISOs are > 2 GB, you need to use FTP:

  ftp://ftp.bioinformatics.org/pub/biobrew/BioBrew-v3.1/x86/

Apache is probably responsible for all of the screwy errors vis-a-vis these 
files.

Cheers.
Jeff

>On Feb 23, 2005, at 10:59 AM, Eugen Leitl wrote:
>    The access rights on
>    http://ftp.bioinformatics.org/pub/biobrew/BioBrew-v3.1/x86/BioBrew-Pro-3.1.0.i386.iso
>
>    seem to be screwed.

-- 
J.W. Bizzaro
Bioinformatics Organization, Inc. (Bioinformatics.Org)
E-mail: jeff at bioinformatics.org
Phone:  +1 508 890 8600
--
_______________________________________________
BioBrew-discuss mailing list
BioBrew-discuss at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/biobrew-discuss

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050223/ca5536ec/attachment.sig>

From lindahl at pathscale.com  Wed Feb 23 12:04:41 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Wed, 23 Feb 2005 12:04:41 -0800
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <421C66DB.5080309@ccrl-nece.de>
References: <20050222050824.GA2195@greglaptop.attbi.com>
	<421C66DB.5080309@ccrl-nece.de>
Message-ID: <20050223200440.GD2227@greglaptop.internal.keyresearch.com>

On Wed, Feb 23, 2005 at 12:19:55PM +0100, Joachim Worringen wrote:

> Unfortunately, the value of an ABI is much reduced by the fact that the 
> most important target platform Linux itself has no stable ABI (think of 
> libc and other version nightmares). On a OS like Solaris or Windows, 
> this is much more of a benefit.

I don't think it's "much reduced" by this, but I think it's clear this
would be a matter of opinion. What you'll definitely be able to do is
run an application built on a particular Linux version with different
MPI libraries compiled for that same Linux version.  You are correct
that if the MPI library was built for a wildly different Linux distro
than the app, you can't necessarily put them together.

> Another problem are i.e. vendor-specific assertions that could conflict. 
> A solution for this could be "numerical namespaces" for such extensions, 
> but how should they be managed?

This is certainly something that a committe would discuss. There are
plenty of examples of this problem being solved successfully by
handing out numeric ranges.

> And what about the different calling-conventions in Fortran?

The calling conventions differences (in Linux) revolve around the
f2c-abi issue, and it so happens that no MPI routines trip on this
issue, as it only affects functions that return REAL*4 or COMPLEX
types. Did I miss a function that has those return types?

-- greg


From rgb at phy.duke.edu  Wed Feb 23 12:09:34 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 23 Feb 2005 15:09:34 -0500 (EST)
Subject: [Beowulf] managing debian packages
In-Reply-To: <20050223054408.GS3124@washoe.rutgers.edu>
References: <20050223054408.GS3124@washoe.rutgers.edu>
Message-ID: <Pine.LNX.4.58.0502231504460.15737@ganesh.phy.duke.edu>

On Wed, 23 Feb 2005, Yaroslav Halchenko wrote:

> Hello to all Beowulfers,
> 
> A simple question: so we have cfengine2 to manage configs  through the
> hosts. But its "packages" section is quite handicaped so there is a
> question: how do you manage installing packages on  nodes which
> some times might differ a bit but most often have the same set of
> packages. I have in mind Debian packaging system

Does this mean you only want answers that apply to debian-based
clusters?  I mean, another answer for rpm-based systems might involve
kickstart and yum, with or without cfe or dsh.  Another KIND of
possibility altogether is warewulf (beowulf in a non-commercial box),
which runs a single template on all nodes (so updating the template
updates everything).  Still another is e.g. scyld (beowulf in a
commercial box).  And this still isn't exhaustive, I'm sure.

So there are many ways to do it, but which sort of solution you look for
is likely predicated as much on the particular linux distro you choose
for a base as anything, and beyond that on whether you choose to use a
cluster-specific packaging that manages all this with provided tools.

   rgb

> 
> In my case what I do is
> 
> 1. Install required package on a main node, so if it has any dialog
> which tweaks configuration - I adjust it so it fits my needs.
> 
> 2. I ran cfegines through the cluster, so they pick up updated 
> /var/cache/debconf/config.dat
> 
> 3. Using favorite dsh I install the same package in parallel on all the
> nodes in non-interactive regime, so it grabs answers for possible
> questions from debconf.
> 
> This way everything is kinda right.
> 
> Other ways would be: install on all the nodes from the beginning in
> non-interactive. I don't like such option because it is quite often that
> default config has to be tweaked in  slight way suggested by debconf,
> but if you don't get dialog - you will not tweak it... 
> 
> or indeed my way is overkill?
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Wed Feb 23 12:15:13 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 23 Feb 2005 15:15:13 -0500 (EST)
Subject: [Beowulf] Re: WW Fedora Student questions help
In-Reply-To: <1109142753.421c2ce1638f8@panthermail.uwm.edu>
References: <1109142753.421c2ce1638f8@panthermail.uwm.edu>
Message-ID: <Pine.LNX.4.58.0502231510320.15737@ganesh.phy.duke.edu>

On Wed, 23 Feb 2005 streich at uwm.edu wrote:

> *SNIP*
> > but part of this class is also setting up monitoring and benchmarking tools
> > and tests to get overall proformance and what not of my cluster...
> 
> Check out MRTG and Ganglia on the net for monitoring.  I know that MRTG is
> avalible as a RedHat package that can be installed from distro's install CDs,
> but it is probably best to get it in source code and configure it to meet your
> needs.  Ganglia is powerful, it is used by a lot of clusters running ROCKs.

Or you can try wulfware (xmlsysd and e.g. wulfstat or wulflogger).
Depends on what you want to do with the data.  wulflogger makes it very
easy to log a variety of cluster load metrics into a file at a selected
interval, in case you want to actually run programs to analyze it or
plot it with standalone tools.

It should be prebuilt for FC2, RH9 and Centos 3.3 here:

  http://www.phy.duke.edu/~rgb/Beowulf/wulfware.php

(or available in source rpm or tarball if you prefer).  I've tried to
make this a yummified repository, so you can if you wish autoupdate from
it via yum, or of course you can just grab source or binary rpms and put
them into a local repository.

    rgb

> 
> > And setting up and running some parrallel Applications ... Im very interested
> in
> > getting more into clusters so was wondering if anyone has any tools or scripts
> > or anything i can setup and test on my fedora warewulf setup just to get
> > experience Also i really need some computational or Sim type app's or any type
> > of cluster parrallel application i can play with to get the hang of them so i
> > can move on to makeing my own applications this would be a HUGE help
> 
> Hot topics that are using clusters include astronomical models, biology (human
> genome stuff), and all sorts of things.  Our cluster is dedicated to
> atmospheric research (in particular studying fluffy white clouds most
> atmospheric researchers aren't as intrested in as storms and such) running
> COAMPs and wrf (MPI based programs).
> 
> I suppose the real question is: what type of problem are you intrested in?
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From landman at scalableinformatics.com  Wed Feb 23 12:32:42 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 23 Feb 2005 15:32:42 -0500
Subject: [Beowulf] final CFP: HiPCoMB-2005
Message-ID: <421CE86A.7000707@scalableinformatics.com>

I didn't see it posted here, so I figured I would send it out. 
Apologies for those with low meeting spam tolerance


********************************************************************************
We apologize if you received multiple copies of this Call for Papers.
Please feel free to distribute it to those who might be interested.
********************************************************************************
___________________________________________________________________

                          CALL FOR PAPERS

1st IEEE Workshop on High Performance Computing in Medicine and Biology
                               (HiPCoMB-2005)


                         held in conjunction with
The 11th International Conference on Parallel and Distributed Systems
                                (ICPADS 2005)

         Fukuoka Institute of Technology (FIT), Fukuoka, Japan
                              July 20-22, 2005

HiPCoMB-2005 Home Page: http://www.pdcl.wayne.edu/HiPCoMB-2005

(Apologies if you receive multiple copies of this Call for Papers)
_______________________________________________________________________

IMPORTANT DEADLINES:

    Paper submission:        February 28, 2005
    Author Notification:     April 07, 2005
    Camera-Ready Papers:     April 21, 2005

WORKSHOP INFORMATION:

The  First Workshop  on High Performance    Computing in Medicine  and
Biology (HiPCoMB-05), held in conjunction with  ICPADS 2005 in Fukuoka
City,   Japan brings together researchers    in  computer science  and
engineering, medicine, and biology that use high performance computing
to solve computationally expensive  problems in medicine  and biology.
The workshop  will provide a forum  for presenting  and exchanging new
ideas and experiences in this area.

Topics of   interest  include high performance   algorithms,  systems,
architecture, and tools for the following: (but are not limited to the
following list)

      * Microarray Analysis
      * RNAi Analysis
      * Systems Biology
      * Computational Genomics
      * Comparative Genomics
      * DNA Assembly, Clustering, and Mapping
      * Gene identification and annotation
      * Computational Proteomics
      * Evolution and Phylogenetics
      * Protein Structure Predication and Modeling
      * Medical Image Processing
      * Computer Assisted Surgery
      * Computational Medicine Modeling
      * Computational Biology Modeling
      * Augmented Reality
      * Medical Informatics

SUBMISSION INFORMATION:

Talks will  be accepted on the  basis of a  paper  of approximately 15
single-column pages that describes the work, its significance, and the
current status  of the  research.   Submit one  electronic copy of the
paper in PostScript or PDF format by February  15, 2005.  Please visit
the workshop home page for submission instructions.

Notification of   acceptance  will be  given by  March  22, 2005,  and
camera-ready papers  will be due  April 21, 2005. Accepted papers will
be   given  guidelines    in  preparing  and    submitting the   final
manuscript(s) together  with  the   notification  of acceptance.   All
accepted papers  will  be presented at  the  workshop  and included in
proceedings that will  be distributed at  the workshop.  In  addition,
authors of selected papers from the workshop will be invited to submit
extended versions  of their papers for publication  in a special issue
of  the International    Journal   of  Bioinformatics    Research  and
Applications.


GENERAL INFORMATION:

      GENERAL CO-CHAIRS:
         Laurence T. Yang      St. Francis Xavier University, Canada
                                     email: lyang at stfx.ca

         Albert Zomaya              University of Sydney, Australia
                                     email:  zomaya at it.usyd.edu.au

      PROGRAM CO-CHAIRS:
          Vipin Chaudhary       Wayne State University, USA
                                      email: vipin at wayne.edu
          Andrei Doncescu       LAAS, National Center for Scientific
Research,
                                      France
                                      email:  adoncesc at laas.fr
          Yi Pan                      Georgia State University, USA
                                      email:  pan at cs.gsu.edu

      PROGRAM COMMITTEE MEMBERS:

         David Abramson, Monash University, Australia
         davida at csse.monash.edu.au

         Enrique Alba, University of Malaga, Spain
         eat at lcc.uma.es

         Srinivas Aluru, Iowa State University, USA
         aluru at iastate.edu
         http://www.ece.iastate.edu/~aluru

         Shahid H. Bokhari, University of Engineering & Technology, Pakistan
         shb at acm.org

         Vincent Breton, CNRS/IN2P3, LPC Clermont-Ferrand, France
         breton at clermont.in2p3.fr

         Kevin Burrage, University of Queensland, Australia 
kb at maths.uq.edu.au

         Amitava Data, University of Western Australia, Australia
         datta at csse.uwa.edu.au

         Hans de Sterck, University of Waterloo,        Canada
         hdesterck at math.uwaterloo.ca

         Mario Rosario Guarracino, ICAR-CNR, Italy
         mario.guarracino at cps.na.cnr.it
         http://pixel.dma.unina.it/~mariog/

         Ryoko Hayashi, Advanced Institute of Science and Technology 
(JAIST),
         Japan ryoko at jaist.ac.jp

         Matthew He, Nova Southeastern University, USA
         hem at nsu.nova.edu

         Alfons Hoekstra, University of Amsterdam, The Netherlands
         alfons at science.uva.nl


         Xiaohua (Tony) Hu, Drexel University, USA
         thu at cis.drexel.edu
         http://www.cis.drexel.edu/faculty/thu

         Chun-Hsi Huang  University of Connecticut, USA huang at engr.uconn.edu

         Arun Krishnan, Bioinformatics Institute, Singapore
         arun at bii.a-star.edu.sg

         Joseph Landman, Scalable Informatics, LLC
         landman at scalableinformatics.com

         Wenjun Li, UT Southwestern Medical Center, USA
         liwenjun2k at yahoo.com


         Yiming Li, National Chiao Tung University, Taiwan
         ymli at mail.nctu.edu.tw


         Robert L. Martino, National Institutes of Health, USA
         Robert.Martino at nih.gov

         Maria Mirto
         University of Lecce, Italy
         maria.mirto at unile.it

         Michael Mascagni, Florida State University, USA
         mascagni at fsu.edu

         Martin Middendorf, University of Leipzig, Germany
         middendorf at informatik.uni-leipzig.de

         Giri Narasimhan, Florida International University, USA
giri at cs.fiu.edu

         Jun Ni, University of Iowa, USA
         jun-ni at uiowa.edu

         Sergei Petoukhov, Russian Academy of Sciences, Russia
         petoukhov at hotmail.com

         Pascal Poulet, French West Indies UNiversity, France
         Pascal.Poullet at univ-ag.fr

         Youxing Qu, Univresity of Georgia, USA
         youxing at csbl.bmb.uga.edu

         Nagiza Samatova, Oak Ridge National Lab, USA
         samatovan at ornl.gov

         Bertil Schmidt, Nanyang Technological University, Singapore
         ASBSchmidt at ntu.edu.sg

         Tony Solomonides, University of the West of England, UK
         Tony.Solomonides at uwe.ac.uk

         El-Ghazali Talbi, LIFL, France
         talbi at lifl.fr

         Daming Wei, University of Aizu, Japan
         dm-wei at u-aizu.ac.jp

         Tiffani Williams, University of New Mexico, USA
         tlw at cs.unm.edu

         C. M. Yang, Nankai University, China
         yangchm at nankai.edu.cn

         Yanqing Zhang, Georgia State University, USA
         yzhang at cs.gsu.edu

         Bingbing Zhou, University of Sydney, Australia
         bbz at it.usyd.edu.au

Information related to ICPADS 2005 is available at the official ICPADS
2005 Web site: http://www.takilab.k.dendai.ac.jp/conf/icpads/2005/

ICPADS  2005 is Co-sponsored by IEEE  Computer  Society TCDP and TCPP,
and Fukuoka Institute of Technology (FIT), in cooperation with Fukuoka
City,  IPSJ  (Information Processing Society   of Japan)  SIGDPS, IEEE
Taipei Section, IEEE HonKong Section, SCAT, and AOARD/AOR.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615


From alvin at Mail.Linux-Consulting.com  Wed Feb 23 12:43:20 2005
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Wed, 23 Feb 2005 12:43:20 -0800 (PST)
Subject: [Beowulf] RAID storage: Vendor vs. parts
In-Reply-To: <421CAEAD.8070207@lanl.gov>
Message-ID: <Pine.LNX.3.96.1050223123612.8358A-100000@Maggie.Linux-Consulting.com>


hi ya josip

On Wed, 23 Feb 2005, Josip Loncaric wrote:

> My experience with boxed drives bought from retailers was better than 
> OEM bare drives from reputable sources.  Retail boxed drives often carry 
> 3yr warranty, so there is more at stake for the manufacturer if they go bad.

warranty is the same from retailers or wholesalers

"reputable sournces" that i'm talking about are the ones where
you are required to have reseller permit to buy from them
and some do not have a website for consumer to buy

the big question is does one call ibm or the place you bought it from
to return the item 

to me, the parts should NEVER fail withing its warranty period
	- exceptions are doa when its first powered one
	which has < 1% failure

	- exception was the ibm deathstar disks 
	( that resulted in a class action suit against ibm )

	we don't have any problems with our hardware ... other than
	deathstars and we buy 1,000's of um

> P.S.  You would not want to build a cluster that way (limited selection 
> & higher prices) but for spare parts, retail stores are quick and 
> convenient.

yup .. which is why want to buy cots parts .... and nothing specific
to the system integrator that sells their modified versions
which has been the case where all services/support calls came from
to go fix these "modified/customized" systems

when a part dies, you want to go to the local pc store, and buy
the $10 item or $100 replacement disk and have people working in
a  matter of hour or three hrs .. etc ..

c ya
alvin 


From gmpc at sanger.ac.uk  Wed Feb 23 12:56:53 2005
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Wed, 23 Feb 2005 20:56:53 +0000 (GMT)
Subject: [Beowulf] managing debian packages
In-Reply-To: <20050223054408.GS3124@washoe.rutgers.edu>
References: <20050223054408.GS3124@washoe.rutgers.edu>
Message-ID: <Pine.OSF.4.44.0502232034100.1848273-100000@ecs2c.internal.sanger.ac.uk>

On Wed, 23 Feb 2005, Yaroslav Halchenko wrote:

> Hello to all Beowulfers,
>
> A simple question: so we have cfengine2 to manage configs  through the
> hosts. But its "packages" section is quite handicaped so there is a
> question: how do you manage installing packages on  nodes which
> some times might differ a bit but most often have the same set of
> packages. I have in mind Debian packaging system


For the installs, take a look at FAI; it is the automated debian
installer. It is extremely flexible, so you can hack it about to do pretty
much anything you want.

Once the machines are up, dsh is a fine way of keeping things up to date
or installing new stuff.  If you don't want to be pestered by
configuration questions during package installs, you can change that
behaviour in /etc/debconf.conf to make everything non-interactive.


There is also a rather nifty debian package called jablicator. You run it
on a master system and it creates an empty deb file which depends on every
deb installed.

If you then install that deb on another machine, it will auto-magically
install whatever it needs to make it a clone of the master. It probably
won't handle config files, but dsh/cfengine can handle that part for you.

Cheers,

Guy


-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK
Tel: +44 (0)1223 834244 ex 7199


From mathog at mendel.bio.caltech.edu  Wed Feb 23 15:28:07 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Wed, 23 Feb 2005 15:28:07 -0800
Subject: [Beowulf] Re: S2466 Wake on Lan working, anyone?
Message-ID: <E1D45vL-00024l-00@mendel.bio.caltech.edu>

One other tidbit, both the node with enable_wol=1 and the other
nodes show exactly the same thing for "lspci -vv", which is:

02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado]
(rev 78)
        Subsystem: Tyan Computer: Unknown device 2466
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 80 (2500ns min, 2500ns max), cache line size 10
        Interrupt: pin A routed to IRQ 19
        Region 0: I/O ports at 2000 [size=128]
        Region 1: Memory at f6001000 (32-bit, non-prefetchable) [size=128]
        Expansion ROM at <unassigned> [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=2 PME-

As I understand this the status should have been D3 (from some
old posts by Donald Becker).  Or maybe it should be in D3 at
shutdown, which of course I can't see because, um, the system
is shutdown!  The node was started with this lilo entry:


image=/boot/vmlinuz
        label="linuxserial"
        root=/dev/hda3
        initrd=/boot/initrd.img
        append="devfs=mount acpi=on resume=/dev/hda2 splash=silent
console=tty0 console=ttyS0,38400"
        vga=788
        read-only

How would does one test that enable_wol is actually being
set during a shutdown initiated by poweroff???

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From hvidal at tesseract-tech.com  Thu Feb 24 05:28:14 2005
From: hvidal at tesseract-tech.com (H.Vidal, Jr.)
Date: Thu, 24 Feb 2005 08:28:14 -0500
Subject: [Beowulf] test
Message-ID: <421DD66E.6000302@tesseract-tech.com>

sorry, had a lapsed domain and wanted to see if I am still here..


From joachim at ccrl-nece.de  Thu Feb 24 05:30:09 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Thu, 24 Feb 2005 14:30:09 +0100
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <20050223200440.GD2227@greglaptop.internal.keyresearch.com>
References: <20050222050824.GA2195@greglaptop.attbi.com>	<421C66DB.5080309@ccrl-nece.de>
	<20050223200440.GD2227@greglaptop.internal.keyresearch.com>
Message-ID: <421DD6E1.5000200@ccrl-nece.de>

Greg Lindahl wrote:
> I don't think it's "much reduced" by this, but I think it's clear this
> would be a matter of opinion. What you'll definitely be able to do is
> run an application built on a particular Linux version with different
> MPI libraries compiled for that same Linux version.  You are correct
> that if the MPI library was built for a wildly different Linux distro
> than the app, you can't necessarily put them together.

This problem left apart, do you know of ISV's that would at least be 
willing to think about giving support to an MPI ABI no matter which 
implementation and interconnect, and not a specific MPI library? Because 
this is what matters.

For open source software packages alone, an ABI is not of critical 
importance as people with a tcp/ip cluster can use pre-linkked packages, 
and people with a high-perfomance interconnect cluster typically have 
enough competence to compile the software themselves.

>>Another problem are i.e. vendor-specific assertions that could conflict. 
>>A solution for this could be "numerical namespaces" for such extensions, 
>>but how should they be managed?
> 
> This is certainly something that a committe would discuss. There are
> plenty of examples of this problem being solved successfully by
> handing out numeric ranges.

Well, for MAC addresses, PCI device ids etc, there are professional 
organisations that care for this. For MPI; there is no such instituion. 
ANL? Maybe.

But maybe there's another technical solution, if the linked library 
could somehow know which variant of mpi.h the code was compiled against, 
which then would determine the meaning of all assertion beyond 1024 (or 
some other limit). Something coded into MPI_Init() or it's arguments 
might be a way.. hacky, hacky.

>>And what about the different calling-conventions in Fortran?
> 
> The calling conventions differences (in Linux) revolve around the
> f2c-abi issue, and it so happens that no MPI routines trip on this
> issue, as it only affects functions that return REAL*4 or COMPLEX
> types. Did I miss a function that has those return types?

I did not think of this, but more of issues like "string as an argument" 
as the way how the string length is passed is not standardized. Then 
there are issues with getting access to global variables from COMMON 
blocks etc. which are hard (if at all) to be solved with one shared 
object file for multiple compilers. We currently need to link a small 
extra object file depending on the compiler used.

This does not mean that we should not continue thinking about an ABI, 
but there's more than unifying mpi.h to be able to use a single shared 
library.

   Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From ashley at quadrics.com  Thu Feb 24 06:20:12 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Thu, 24 Feb 2005 14:20:12 +0000
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <421DD6E1.5000200@ccrl-nece.de>
References: <20050222050824.GA2195@greglaptop.attbi.com>
	<421C66DB.5080309@ccrl-nece.de>
	<20050223200440.GD2227@greglaptop.internal.keyresearch.com>
	<421DD6E1.5000200@ccrl-nece.de>
Message-ID: <1109254812.12270.46.camel@localhost.localdomain>

On Thu, 2005-02-24 at 14:30 +0100, Joachim Worringen wrote:

> For open source software packages alone, an ABI is not of critical 
> importance as people with a tcp/ip cluster can use pre-linkked packages, 
> and people with a high-perfomance interconnect cluster typically have 
> enough competence to compile the software themselves.

It's not about competence, it's about time and effort spent.  I wouldn't
need to compile every application myself if there was a ABI.  It's not a
particularly difficult thing to do, it just going through the hoops of
doing it every time you need an application.  The ability to install a
cluster and type '[apt-get|yum] install pmb' would be a truly wonderful
thing indeed.

You also make the assumption that it's the high-performance vendors who
do things differently, I don't believe this is the case.  Quadrics for
example (my employer) happen to use whatever ABI MPICH (1.2.x) provides
as we have never had a reason to modify it.  I believe the same holds
for Myrinet, I've certinally run binaries compiled against Myrinet MPI
on our MPI stack without obvious problems.  Having said that though I
have never attempted to verify binary compatibility and we don't support
such programs but insist they are correctly compiled before support
requests get more than a cursory glance.

Ashley,


From joachim at ccrl-nece.de  Thu Feb 24 07:45:27 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Thu, 24 Feb 2005 16:45:27 +0100
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <1109254812.12270.46.camel@localhost.localdomain>
References: <20050222050824.GA2195@greglaptop.attbi.com>	<421C66DB.5080309@ccrl-nece.de>	<20050223200440.GD2227@greglaptop.internal.keyresearch.com>	<421DD6E1.5000200@ccrl-nece.de>
	<1109254812.12270.46.camel@localhost.localdomain>
Message-ID: <421DF697.3070005@ccrl-nece.de>

Ashley Pittman wrote:
> It's not about competence, it's about time and effort spent.  I wouldn't
> need to compile every application myself if there was a ABI.  It's not a
> particularly difficult thing to do, it just going through the hoops of
> doing it every time you need an application.  The ability to install a
> cluster and type '[apt-get|yum] install pmb' would be a truly wonderful
> thing indeed.

My experience is that such sites want to use their (expensive, 
commercial, Fortran) compiler to optimize the binaries to their 
platform. In some cases, this requires source-code changes (parameters) 
anyway. It's not about PMB.

> You also make the assumption that it's the high-performance vendors who
> do things differently, I don't believe this is the case.  Quadrics for
> example (my employer) happen to use whatever ABI MPICH (1.2.x) provides
[...]

I can't see where I made this assumption. Indeed, most interconnect 
vendors (Quadrics, various Infiniband, Myrinet, ...) happily plug their 
low-level stuff into the latest MPICH and are done. So, it's most often 
the cross-interconnect MPI vendors which create their own ABI for some 
reason. Other cases are vendors (like us) who started to provide MPI-2 
when there was no open-source MPI-2 around. We had to do our own 
definitions then.

But this doesn't really matter after all; what matters is if there are 
enough parties to take part in this effort, and to understand the 
related issues as much as possible and as early as possible.

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From ashley at quadrics.com  Thu Feb 24 08:23:47 2005
From: ashley at quadrics.com (Ashley Pittman)
Date: Thu, 24 Feb 2005 16:23:47 +0000
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <421DF697.3070005@ccrl-nece.de>
References: <20050222050824.GA2195@greglaptop.attbi.com>
	<421C66DB.5080309@ccrl-nece.de>
	<20050223200440.GD2227@greglaptop.internal.keyresearch.com>
	<421DD6E1.5000200@ccrl-nece.de>
	<1109254812.12270.46.camel@localhost.localdomain>
	<421DF697.3070005@ccrl-nece.de>
Message-ID: <1109262227.12272.64.camel@localhost.localdomain>

On Thu, 2005-02-24 at 16:45 +0100, Joachim Worringen wrote:
> My experience is that such sites want to use their (expensive, 
> commercial, Fortran) compiler to optimize the binaries to their 
> platform. In some cases, this requires source-code changes (parameters) 
> anyway. It's not about PMB.

The difference here is between "want" and "need".  If they want to do it
then well done, congratulations, it is often the right thing to do for
performance reasons.  In terms of setup time to get a working cluster
though there is a difference, having things work out the box is a good
thing (tm).

Of course there is a potential downside that if if works out of the box
then they won't play with compilers (why bother - it works?) and not
even try to get the extra performance.  This isn't a valid reason not to
have a ABI though.

> > You also make the assumption that it's the high-performance vendors who
> > do things differently, I don't believe this is the case.  Quadrics for
> > example (my employer) happen to use whatever ABI MPICH (1.2.x) provides
> [...]
> 
> I can't see where I made this assumption.

I was referring to this quote from earlier.  I thought you were implying
that code should come pre-linked with TCPIP MPICH (in effect a de-facto
standard) and us "exotic" people would be on our own.  In this context
there is nothing exotic or unusual about our MPI.  There is no
relationship between the performance of a MPI stack and how much it's
interface varies.  Sounds like we are on the same wavelength here
though.

>> For open source software packages alone, an ABI is not of critical
>> importance as people with a tcp/ip cluster can use pre-linkked
>> packages, and people with a high-perfomance interconnect cluster
>> typically have  enough competence to compile the software themselves.

> Indeed, most interconnect 
> vendors (Quadrics, various Infiniband, Myrinet, ...) happily plug their 
> low-level stuff into the latest MPICH and are done. So, it's most often 
> the cross-interconnect MPI vendors which create their own ABI for some 
> reason. Other cases are vendors (like us) who started to provide MPI-2 
> when there was no open-source MPI-2 around. We had to do our own 
> definitions then.

Hhmm.  Does this mean that the only reasons for not having a ABI are
historical, purely because we have never had one and there isn't the
inertia to change this?  Are there valid technical/performance reasons
for the need to change mpi.h?

> But this doesn't really matter after all; what matters is if there are 
> enough parties to take part in this effort, and to understand the 
> related issues as much as possible and as early as possible.

Plus the fact that someone (everyone?) has to take the hit of breaking
binary compatibility with all their previous MPI releases when they make
the jump to being compliant.  Any volunteers?  This might not actually
be so bad with the shared library number versioning scheme, in fact it
might be possible to avoid it completely at the cost of a bit more
effort in the packaging.

Ashley,


From lindahl at pathscale.com  Thu Feb 24 11:13:26 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Thu, 24 Feb 2005 11:13:26 -0800
Subject: [Beowulf] The Case for an MPI ABI
In-Reply-To: <421DD6E1.5000200@ccrl-nece.de>
References: <20050222050824.GA2195@greglaptop.attbi.com>
	<421C66DB.5080309@ccrl-nece.de>
	<20050223200440.GD2227@greglaptop.internal.keyresearch.com>
	<421DD6E1.5000200@ccrl-nece.de>
Message-ID: <20050224191325.GA1903@greglaptop.internal.keyresearch.com>

On Thu, Feb 24, 2005 at 02:30:09PM +0100, Joachim Worringen wrote:

> This problem left apart, do you know of ISV's that would at least be 
> willing to think about giving support to an MPI ABI no matter which 
> implementation and interconnect, and not a specific MPI library? Because 
> this is what matters.

As I wrote in my talk, no. Some ISVs will balk because of the testing
issue.  However, it is much easier to test against new libraries if
there is an ABI, and no recompilation means no worrying about
recompilation bugs.

> For open source software packages alone, an ABI is not of critical 
> importance as people with a tcp/ip cluster can use pre-linkked packages, 
> and people with a high-perfomance interconnect cluster typically have 
> enough competence to compile the software themselves.

An ABI is still useful for these people. Even if you are using
ethernet and TCP/IP, if you have a queue system and have integrated
with LAM, a pre-linked package using MPICH is not so useful.
Critically important? Probably not. Important? Yes.

> I did not think of this, but more of issues like "string as an argument" 
> as the way how the string length is passed is not standardized.

Yes, but on x86 and x86-64, Intel and PGI and g77 and PathScale all
use the same convention.

> Then there are issues with getting access to global variables from
> COMMON blocks

This is not an issue, because the MPI interface does not use COMMON
blocks.

The most annoying issue is command-line args, but I am about to write
(and will give away) an Amazing Universal Fortran Command Line Arg
Fetcher (AUFCLAF).

> This does not mean that we should not continue thinking about an ABI, 
> but there's more than unifying mpi.h to be able to use a single shared 
> library.

I agree. I wasn't claiming to have a final solution for the issues.
The important thing at this stage is not to solve all the issues, but
to see if the benefits of an ABI are compelling enough to form a
committee and work on it.

-- greg


From list-beowulf at onerussian.com  Wed Feb 23 13:11:58 2005
From: list-beowulf at onerussian.com (Yaroslav Halchenko)
Date: Wed, 23 Feb 2005 16:11:58 -0500
Subject: [Beowulf] managing debian packages
In-Reply-To: <Pine.OSF.4.44.0502232034100.1848273-100000@ecs2c.internal.sanger.ac.uk>
References: <20050223054408.GS3124@washoe.rutgers.edu>
	<Pine.OSF.4.44.0502232034100.1848273-100000@ecs2c.internal.sanger.ac.uk>
Message-ID: <20050223211158.GQ3124@washoe.rutgers.edu>

On Wed, Feb 23, 2005 at 08:56:53PM +0000, Guy Coates wrote:
> For the installs, take a look at FAI; it is the automated debian
> installer. It is extremely flexible, so you can hack it about to do pretty
> much anything you want.
yeap - that is what I used to install all the nodes... And actually FAI
has somewhat neat idea on classes of machines and installed packages
depending on the class. So it reminds cfengine approach, that is why I
actually tried first to use FAI config file as the source of packages to
be installed and then wrote a cf.fai config which was installing
packages using FAI's functionality.

The problems with that were: I needed manually type-in the packages I want
on per class basis, and there were issues with dpkg installation process
not closing all FIDs probably so remote shell never returned which
annoyed... I've mentioned a simple trick to get around that on
cfengine's Wiki but I haven't touched this way to install packages for a
long time....

> Once the machines are up, dsh is a fine way of keeping things up to date
> or installing new stuff.  If you don't want to be pestered by
> configuration questions during package installs, you can change that
> behaviour in /etc/debconf.conf to make everything non-interactive.
That is exactly what I'm doing pretty much

> There is also a rather nifty debian package called jablicator. You run it
> on a master system and it creates an empty deb file which depends on every
> deb installed.
Cool - didn't know about this one... I see limited applicability though,
it is pretty much the same as
dpkg --get-selections | rsh remote host dpkg --set-selections
and probably it doesn't give conflicts, ie which packages to remove if
they are not installed on the system where you run jablicator

My problem is that I'm not sure really on what I'm looking for...
FAI way to configure installed packages is very appealing to me but its
implementation is what keeps me away.

Let me characterize: 
I want to have some classes as cfengine does and depending on to which
class the machine belongs, it gets necessary packages installed in
non-interactive fasion, using debconf values from the machine on which
it was installed in non-interactive fasion.

for instance probably similar command should look like

apt-getclass lam-nodes install libblablah-dev

then it runs interactive installation on a first  machine from lam-nodes
netgroup for instance, clones debconf to related machines, runs
noninteractive install there

Running simply 
dsh -g @lam-nodes apt-get install libblablah-dev

will not suffice because of possible necessity to configure

Hm... probably I just to write a tiny wrapper around apt-get and cfrun
:-)

-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050223/acf6db44/attachment.sig>

From rmiguel at usmp.edu.pe  Thu Feb 24 06:50:57 2005
From: rmiguel at usmp.edu.pe (rmiguel at usmp.edu.pe)
Date: Thu, 24 Feb 2005 09:50:57 -0500
Subject: [Beowulf] about concept of beowulf clusters
In-Reply-To: <E1D45vL-00024l-00@mendel.bio.caltech.edu>
References: <E1D45vL-00024l-00@mendel.bio.caltech.edu>
Message-ID: <1109256657.421de9d113bb3@mail.usmp.edu.pe>

Hi, i have a doubt about the strict concept of Beowulf cluster. Is a cluster
build with comodity hardware only?.. what's up when i build a cluster using
some tools as OSCAR, or ROCKS, etc on servers or using some kind of high speed
networks?.
If I have two Alpha servers with Linux and open source software conected by a
high speed network.. is this a beowulf cluster?.

Thanks for your answers ..


-- 
--------------------------------------------------------
    ~       |
   . .      |  "In a world without fences ... who needs
  (0 0 )    |   GATES?"
  / \/ \    |
 //    \\   |
/(( _  ))\  |
 oo0  0oo   |
----------------------------------------------------------

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From apseyed at bu.edu  Thu Feb 24 12:44:30 2005
From: apseyed at bu.edu (Patrice Seyed)
Date: Thu, 24 Feb 2005 15:44:30 -0500
Subject: [Beowulf] about concept of beowulf clusters
In-Reply-To: <1109256657.421de9d113bb3@mail.usmp.edu.pe>
References: <E1D45vL-00024l-00@mendel.bio.caltech.edu>
	<1109256657.421de9d113bb3@mail.usmp.edu.pe>
Message-ID: <1109277874.348029A3@bd8.dngr.org>

A beowulf cluster is in itself a loose definition, but it basically is 
linux machines that perform some kind of HPC (high performance
computing) tasks. It could be commoity or not. Also there is no set of 
software that makes it a beowulf, but there are software that make them 
useful, like mpi and lam libraries for parallel computing, or pbs/maui 
and sge for scheduler/resource manager for batch/interactive/parallel 
computing in the cluster.  Rocks and Oscar are simply tookits that make 
installing and managing the systems easier, like for installing nodes, 
software management, and parallel commands. Whether you're using low 
latency propriety switches or gigE, as long as the tasks are related to 
hpc I think it falls in the "beowulf" definition.
-Patrice

  On Thu, 24 Feb 2005 3:27 pm, rmiguel at usmp.edu.pe wrote:
> Hi, i have a doubt about the strict concept of Beowulf cluster. Is a 
> cluster
> build with comodity hardware only?.. what's up when i build a cluster 
> using
> some tools as OSCAR, or ROCKS, etc on servers or using some kind of 
> high speed
> networks?.
> If I have two Alpha servers with Linux and open source software 
> conected by a
> high speed network.. is this a beowulf cluster?.
>
> Thanks for your answers ..
>
>
>
> --
> --------------------------------------------------------
>     ~       |
>    . .      |  "In a world without fences ... who needs
>   (0 0 )    |   GATES?"
>   / \/ \    |
>  //    \\   |
> /(( _  ))\  |
>  oo0  0oo   |
> ----------------------------------------------------------
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
-Patrice


From alex at DSRLab.com  Thu Feb 24 12:48:13 2005
From: alex at DSRLab.com (Alex Vrenios)
Date: Thu, 24 Feb 2005 13:48:13 -0700
Subject: FW: [Beowulf] about concept of beowulf clusters
Message-ID: <200502242040.j1OKeQ1R023076@bluewest.scyld.com>

> -----Original Message-----
> From: beowulf-bounces at beowulf.org 
> [mailto:beowulf-bounces at beowulf.org] On Behalf Of rmiguel at usmp.edu.pe
> Sent: Thursday, February 24, 2005 7:51 AM
> To: beowulf at beowulf.org
> Subject: [Beowulf] about concept of beowulf clusters
> 
> Hi, i have a doubt about the strict concept of Beowulf 
> cluster. Is a cluster build with comodity hardware only?.. 
> what's up when i build a cluster using some tools as OSCAR, 
> or ROCKS, etc on servers or using some kind of high speed networks?.
> If I have two Alpha servers with Linux and open source 
> software conected by a high speed network.. is this a beowulf 
> cluster?.
> 
> Thanks for your answers ..
> 
A mid-1990s paper (I'll dig it out if necessary) described differences
between Beowulf and Beowulf II systems. The initial thrust had high ideals
and essentially put Linux and Beowulf clusters in the limelight. I didn't
follow the argument very well, but my "feeling" from that distinction was
the reality of a comparitively slow external data network caused them to
relax the restrictions when it came to networking hardware and software.
Keeping these things in balance is hard to do when the high-end store parts
are at Gigabit Ethernet and 3+ GHz processors.

Early Beowulf clusters, in my opinion, were out to see what could be done
with off-the-shelf components, and it was an inspiration to many of us
without the US Treasury behind our projects.

Alex

P.S. Have a look through Issue #100 of Linux Journal for further info. It
had follow-ups to earlier introductory articles on Beowulfs, and that
earlier issue number is referenced there.


From jrollins at ligo.mit.edu  Thu Feb 24 15:20:21 2005
From: jrollins at ligo.mit.edu (Jamie Rollins)
Date: Thu, 24 Feb 2005 18:20:21 -0500 (EST)
Subject: [Beowulf] motherboards for diskless nodes
Message-ID: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>

Hello.  I am new to this list, and to beowulfery in general.  I am working
at a physics lab and we have decided to put together a relatively small
beowulf cluster for doing data analysis.  I was wondering if people on
this list could answer a couple of my newbie questions.

The basic idea of the system is that it would be a collection of 16 to 32
off-the-shelf motherboards, all booting off the network and operating
completely disklessly.  We're looking at amd64 architecture running
Debian, although we're flexible (at least with the architecture ;).  Most
of my questions have to do with diskless operation.

How netboot-capable are modern motherboards with on-board nics?  I
have experience with a couple that support PXE.  However, I have
been having a hard time finding information on-line stating expicitly that
a given motherboard and/or bios supports netbooting.  The only thing I've
been able to find so far is the Tyan K8SR that uses the AMI BIOS 8.0.  I
get the impression that most MB's that have gigabit probably support PXE
booting, but I was curious what other's impressions are.

Something else that we're looking for that I believe is far more esoteric
and has been equally hard to find information about is BIOS serial console
redirect, ie. being able to control the bios from the serial port.  I've
been getting more and more into accessing machines through the serial
port.  The only thing holding me back from throwing out the video and
keyboard entirely is being able to access and control the BIOS through the
serial port as well.  This would also elliminate the need for on-board
video, which can only be good.  This question is obviously related to the
one in the previous paragraph about netboot, since they are both about
features of the MB and the BIOS.  The Tyan K8SR and the like with the AMI
BIOS 8.0 also claim to support this feature.

If anyone has any suggestions for specific MBs that would fit these three
requirements (netboot, serial BIOS redirect, Debian), or at least some
ideas about where to look, that would be a huge help.

A related question is whether anyone has any experience with LinuxBios and
if so, on what hardware.  This might be too big of an issue to bring up,
but I would love to hear people experiences using LinuxBios to boot
diskless cluster nodes over the network.

Thanks so much for the help, and I really look forward to becoming part of
this community.

Jamie Rollins.


From James.P.Lux at jpl.nasa.gov  Thu Feb 24 16:19:30 2005
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Thu, 24 Feb 2005 16:19:30 -0800
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
Message-ID: <6.1.1.1.2.20050224161442.07749138@mail.jpl.nasa.gov>

At 03:20 PM 2/24/2005, Jamie Rollins wrote:
>Hello.  I am new to this list, and to beowulfery in general.  I am working
>at a physics lab and we have decided to put together a relatively small
>beowulf cluster for doing data analysis.  I was wondering if people on
>this list could answer a couple of my newbie questions.
>
>The basic idea of the system is that it would be a collection of 16 to 32
>off-the-shelf motherboards, all booting off the network and operating
>completely disklessly.  We're looking at amd64 architecture running
>Debian, although we're flexible (at least with the architecture ;).  Most
>of my questions have to do with diskless operation.
>
>How netboot-capable are modern motherboards with on-board nics?

Very capable.  Easier to do netboot than almost anything else.

>I
>have experience with a couple that support PXE.  However, I have
>been having a hard time finding information on-line stating expicitly that
>a given motherboard and/or bios supports netbooting.  The only thing I've
>been able to find so far is the Tyan K8SR that uses the AMI BIOS 8.0.  I
>get the impression that most MB's that have gigabit probably support PXE
>booting, but I was curious what other's impressions are.

One way to check is to look at the BIOS manual for your mobo (they're 
typically online) and see if they mention a "boot from network" option.

As a practical matter, I think almost ALL mobos these days can do network 
boot. Now, if someone could answer whether they can netboot over a 802.11 
card, I'd be real interested. (the question is whether the bios has enough 
smarts to bring up the wireless interface)


>Something else that we're looking for that I believe is far more esoteric
>and has been equally hard to find information about is BIOS serial console
>redirect, ie. being able to control the bios from the serial port.  I've
>been getting more and more into accessing machines through the serial
>port.

That's probably a bit dicey.. While netboot is a "essential" feature for 
modern business environments (which drives mobo development, to a large 
degree), serial access is not. In fact, a lot of "legacy free" mobos have 
NO serial port.

However, for mobos aimed at the "server" application, remote management is 
a big deal, so serial port management might be more likely.


>A related question is whether anyone has any experience with LinuxBios and
>if so, on what hardware.  This might be too big of an issue to bring up,
>but I would love to hear people experiences using LinuxBios to boot
>diskless cluster nodes over the network.

Google for it, and you'll find several examples of LinuxBIOS and clusters.

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875


From becker at scyld.com  Thu Feb 24 16:41:22 2005
From: becker at scyld.com (Donald Becker)
Date: Thu, 24 Feb 2005 19:41:22 -0500 (EST)
Subject: [Beowulf] about concept of beowulf clusters
In-Reply-To: <1109256657.421de9d113bb3@mail.usmp.edu.pe>
References: <E1D45vL-00024l-00@mendel.bio.caltech.edu>
	<1109256657.421de9d113bb3@mail.usmp.edu.pe>
Message-ID: <Pine.LNX.4.62.0502241849030.17823@localhost.localdomain>

On Thu, 24 Feb 2005 rmiguel at usmp.edu.pe wrote:

> Hi, i have a doubt about the strict concept of Beowulf cluster. Is a cluster
> build with comodity hardware only?.. what's up when i build a cluster using
> some tools as OSCAR, or ROCKS, etc on servers or using some kind of high speed
> networks?.
> If I have two Alpha servers with Linux and open source software conected by a
> high speed network.. is this a beowulf cluster?.

My definition of a cluster
   independent machines
   combined into a unified system
   through software and networking

The Beowulf definition is
   commodity machines
   connected by a private cluster network
   running an open source software infrastructure
   for scalable performance computing

Traditionally the term "Beowulf Cluster" has included non-PC
architectures such as the Alpha and somewhat specialized networks such
as Myrinet, but excluded the purpose-built tightly coupled machines such
as the Cray T3E and Digital SC.

We can back to the "cluster" definition.  We are starting with general
purpose machines capable of independent operation, generally those with
a broad market appeal.  The goal is to make them appear to be a single
machine.  We start by networking them together, then we add a software
layer to smooth over the ugliness caused because we couldn't custom
design the hardware.

To distinguish independent machines from the aggregate machine we call the
former "nodes" and the latter the "cluster".

The Beowulf definition sets a category by excluding other important
classes:
    commodity machines
      We are excluding custom built hardware e.g. a single Altix is not a
      Beowulf cluster (or even a cluster by the strict definition)
    connected by a cluster network
      These machines are dedicated to being a cluster, at least
      temporarily.  This excludes cycle scavenging from NOWs and wide
      area grids.
    running an open source infrastructure
      The core elements of the system are open source and verifiable
    for scalable performance computing
      The goal is to scale up performance over many dimensions, rather
      than simulate a single more reliable machine e.g. fail-over.
      Ideally a cluster incrementally scales both up and down, rather
      than being a fixed size.

The original challenges for building clusters were very basic:
    can we build them at all?
    how can we get the nodes to communicate?
    do they do anything useful?
In the early days the answers were
    you have to build them yourself
    writing and improving the basic networking
    for a few application you can use basic message passing

There were many intermediate steps, but those problems were solved a
half decade ago
    You can buy stock cluster configurations from many vendors
    Good OS networking and libraries such as MPI are established
    Most HPTC applications run well on small scale clusters
The real challenges were obvious
    Can we remove compute density as an obstacle to adoption?
    They node can talk to each other, now how do we provision and manage
      cluster that scale in production deployments
    How can we support essentially all applications, and solve the
      programming problem?

Donald Becker				becker at scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993


From dld at cmb.usc.edu  Thu Feb 24 17:00:53 2005
From: dld at cmb.usc.edu (Drake Diedrich)
Date: Thu, 24 Feb 2005 17:00:53 -0800
Subject: [Beowulf] Re: motherboards for diskless nodes
In-Reply-To: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
Message-ID: <20050225010053.GA31456@app1.cmb.usc.edu>

On Thu, Feb 24, 2005 at 06:20:21PM -0500, Jamie Rollins wrote:
> 
> How netboot-capable are modern motherboards with on-board nics?  I
> have experience with a couple that support PXE.  However, I have
> been having a hard time finding information on-line stating expicitly that
> a given motherboard and/or bios supports netbooting.  The only thing I've
> been able to find so far is the Tyan K8SR that uses the AMI BIOS 8.0.  I
> get the impression that most MB's that have gigabit probably support PXE
> booting, but I was curious what other's impressions are.

   You probably want to buy one in advance to test how reliable it is when
PXE booting.  We have a 64-node cluster with local disks that have no CDROMs
or floppies, and we do maintenance and installs by net booting.  It isn't
reliable.  We have to reboot several times to get the things to hear the
PXE/DHCP replies and boot the pxelinux.0 image when attempting to reinstall
a node.  The motherboard is the MSI K8D Master with 2x Broadcom tg3 gigE. 
Many computers do netboot reliably in our environment (including laptops,
older P-III Tyan boards, new Xeon Supermicro/e1000 boards, etc).  Not all
motherboard PXE/DHCP boot implementations are equal and up to the task for
completely diskless use.  If you switch to a slightly newer motherboard on
deployment, all bets are off again (yes, made that mistake once, but had a
friendly supplier who let us exchange parts until it all worked).

   If you deal with temporary files, want to suspend and swap out large
low-priority jobs, etc, you probably want a local disk on each node anyway.
Spending a couple gigs of that for a locally installed O/S isn't much of a
drama, especially on ~16 nodes.  It makes updates more reliable, as in-use
libraries/binaries that are in use remain on local disk even when dpkg
replaces them, and only get deleted when no longer in use.  NFS (being
stateless) doesn't have this behavior, so after an update you may
occaisionally have jobs/daemons when they try to page in a file that has
already been replaced.
   If you don't have a central fileserver yet, you can also spread your
users' home directories among the disks on the nodes to avoid NFS contention
(though this means no RAID unless you buy two disks per node).  If one user
launches some sort of cluster-wide NFS bomb, they only take out themselves
and the job running on the node with their home directory.  Users do this -
they launch a stack of simultanous jobs that all load lots of data off the
filesystem, flattening whichever fileserver has their home directory. 
Building a single high end fileserver that can survive the same load without
severely impacting all other users is expensive and tough.

> 
> Something else that we're looking for that I believe is far more esoteric
> and has been equally hard to find information about is BIOS serial console
> redirect, ie. being able to control the bios from the serial port.  I've
> been getting more and more into accessing machines through the serial
> port.  The only thing holding me back from throwing out the video and
> keyboard entirely is being able to access and control the BIOS through the
> serial port as well.

   We're using an Appro blade rack with devices that bring the critical
ports (serial, power, and reset) back to the "blade control center" (BCC). 
It does more than just a serial console, and has a much smaller cable
footprint.  To be honest though, I never use it.  We power down
automatically using ACPI when the A/C fails (far too often), rather than
attempt to coordinate the BCC pulling the power plug after a node shuts
down.  For console/BIOS, I prefer to just use a long monitor/keyboard cable
and plug it directly into the node with the problems.  A 64-way KVM switch
with 64 cables would block airflow and access to the nodes, as well as being
expensive.  The long cable also works for all the other non-BCC nodes in the
adjacent racks (long cable that can reach the whole row of racks).  There
aren't really that many motherboards out there without onboard video, and
anything works for a text console.

   Don't be afraid to be a Luddite if it works.  :)  You can spend the money
saved on serial cables and KVM switching hardware on noise cancelling
headphones.

-Drake


From eno at dorsai.org  Thu Feb 24 18:39:31 2005
From: eno at dorsai.org (Alpay Kasal)
Date: Thu, 24 Feb 2005 21:39:31 -0500
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
Message-ID: <0ICG009WS4JWON@mta6.srv.hcvlny.cv.net>

Jamie, I went with Pentium4 with my recent purchases and I assumed any new
motherboards would be able to boot off the network without issue. I was
wrong - bigtime. I was buying for small footprint & low cost and I went
through about 5 motherboards before I found one that hit the server properly
everytime (the time wasted in finding hardware really hurt). I am using the
Gigabyte 81845gvm now (but beware of bios revs 4, 5 , or 6. they break Pxe,
only rev3 works everytime). I did not test EVERY version bios with all the
mobo's I tested but learned the same chipset and bios on different mobo's
are not all created equal.

Do your research, then buy one for evaluation from a vendor who will take
the hardware back. That's the only way to know for sure that your entire
setup will work. Then come back here and report your findings :)

PS: PXE is not the only embedded method of bootnic. I forget the other
method but it's often paired with onboard realtek8139's, an old intel
standard (?) which is useless these days but still found on some new
motherboards. Maybe someone else here knows what I'm referring too.

Pps: I'm a Beowulf newb and I am using XP Embedded in my setup.

Alpay


-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On
Behalf Of Jamie Rollins
Sent: Thursday, February 24, 2005 6:20 PM
To: beowulf at beowulf.org
Cc: debian-beowulf at lists.debian.org
Subject: [Beowulf] motherboards for diskless nodes

How netboot-capable are modern motherboards with on-board nics?  I
have experience with a couple that support PXE.  However, I have
been having a hard time finding information on-line stating expicitly that
a given motherboard and/or bios supports netbooting.  The only thing I've
been able to find so far is the Tyan K8SR that uses the AMI BIOS 8.0.  I
get the impression that most MB's that have gigabit probably support PXE
booting, but I was curious what other's impressions are.


From becker at scyld.com  Thu Feb 24 19:03:59 2005
From: becker at scyld.com (Donald Becker)
Date: Thu, 24 Feb 2005 22:03:59 -0500 (EST)
Subject: [Beowulf] Re: motherboards for diskless nodesy
In-Reply-To: <20050225010053.GA31456@app1.cmb.usc.edu>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<20050225010053.GA31456@app1.cmb.usc.edu>
Message-ID: <Pine.LNX.4.62.0502242038180.18632@localhost.localdomain>

On Thu, 24 Feb 2005, Drake Diedrich wrote:
> On Thu, Feb 24, 2005 at 06:20:21PM -0500, Jamie Rollins wrote:
>>
>> How netboot-capable are modern motherboards with on-board nics?  I
>> have experience with a couple that support PXE.

PXE is the only standard way to netboot a PC.  It has swept away the
few other proprietary approaches (e.g. RPL).

I'm not a fan of the PXE protocol and specification details.  It's ugly.
They picked bad semantics.  They picked bad protocols.  They picked
exceptionally bad parameters.

I'm a huge fan of PXE as a standard.
It didn't need to be right.  It just needed to be good enough that we can
make it work reliably.  PXE, and all of its ugliness, is gone in two
seconds.

And with 50+ million installations, it's everywhere.
That Is Good.

>> However, I have
>> been having a hard time finding information on-line stating expicitly that
>> a given motherboard and/or bios supports netbooting.

Virtually every current motherboard with Ethernet supports PXE booting.

A few years ago, when Gigabit Ethernet was new, some motherboards had
both Fast and Gb Ethernet just because there was no PXE and WOL support
for the GbE chips.

>   You probably want to buy one in advance to test how reliable it is when
> PXE booting.  We have a 64-node cluster with local disks that have no CDROMs
> or floppies, and we do maintenance and installs by net booting.  It isn't
> reliable.  We have to reboot several times to get the things to hear the
> PXE/DHCP replies and boot the pxelinux.0 image when attempting to reinstall
> a node.

The specific problem here is very likely the PXE server implementation,
not the client side.

I'm guessing that you are using the ISC DHCP server combined with one of
stand-alone TFTP server.  This can't provide true PXE service, it cannot
work around more than a single version of PXE bugs, and it has
significant "scalability challenges" when many machines are
simultaneously booting.  Almost every BIOS uses the Intel PXE client
code unchanged, and it accepts DHCP responses with static PXE
information.

We ended up writing our own integrated PXE server to reliably boot
compute nodes.  A purpose-built PXE server can
   - interpret the initial request to work around different generation of
     PXE client bugs.  The BIOS code is unlikely to be fixed, and there
     are some pretty ugly bugs.  (What does a file name of '' mean?  Use
     the last file requested...)
   - work around the TFTP capture effect, where clients that drop a
     packet are squeezed out and quickly give up, leaving the machine
     powered on but useless.
   - defer answering new requests when especially busy, but always
     respond before the client times out.

Just as importantly, our PXE server uses and updates the single cluster
configuration file.  Before writing the server we went through several
rounds of writing configuration files from other configuration files,
and each time we ended up with a fragile implementation that was
difficult to debug.


> older P-III Tyan boards, new Xeon Supermicro/e1000 boards, etc).  Not all
> motherboard PXE/DHCP boot implementations are equal and up to the task for
> completely diskless use.  If you switch to a slightly newer motherboard on
> deployment, all bets are off again (yes, made that mistake once, but had a
> friendly supplier who let us exchange parts until it all worked).

Yup, they are using the Intel code, but with different bugs.  Read notes
on the web about using the ISC DHCP server: "You can work around 
bug #1 with canned response #1, which is incompatible with the response
to fix bug #2."

>   If you deal with temporary files, want to suspend and swap out large
> low-priority jobs, etc, you probably want a local disk on each node anyway.

I completely agree.  Local disk is the best I/O bandwidth for the buck.

> Spending a couple gigs of that for a locally installed O/S isn't much of a
> drama, especially on ~16 nodes.

But it's the long-term administrative effort that costs, not the disk
hardware.  The need to maintain and update a persistent local O/S is the
root of most of that cost.

> It makes updates more reliable, as in-use libraries/binaries that are
> in use remain on local disk even when dpkg replaces them, and only get
> deleted when no longer in use.  NFS (being stateless) doesn't have
> this behavior, so after an update you may occaisionally have
> jobs/daemons when they try to page in a file that has already been
> replaced.

A cluster has a richer versioning environment than a local machine.
Simultaneously using different versions of a long-running applications
is something you have to consider when running a cluster system.  And as
you point out, NFS sometimes does do the Right Thing.

But a persistent local install isn't the only way to accomplish this.
We put a specialized whole-file-caching filesystem underneath our
system.  It's only used for libraries and executable, and it tracks them
by version not just path name.  Since it retains the whole file, we
don't encounter the unrecoverable problem of a page-in failure.  And
since the node only fully accepts a process after caching the required
files we avoid other failure points.

> If you don't have a central fileserver yet, you can also spread your
> users' home directories among the disks on the nodes to avoid NFS contention
> (though this means no RAID unless you buy two disks per node).

NFS isn't bad.  Nor does it necessarily doom a server to unbearable
loads.  For some types of file access, especially read-only access to
small (<8KB) configuration files such as ~/.foo.conf, it's pretty close
to optimal.

What you don't want to use it for is
   - paging in programs and libraries.
       Especially not in a big cluster with big applications.
   - writing files that are used for synchronization
       NFS uses semi-synchronous writes, which kills performance, but
       unpredictable time-based cache flushing, which kills consistency

>> Something else that we're looking for that I believe is far more esoteric
>> and has been equally hard to find information about is BIOS serial console
>> redirect, ie. being able to control the bios from the serial port.

Serial consoles are pretty common today, albeit not well documented.
But using them is a hardware problem.  You double your connection count,
with non-standard cables, large connectors and expensive serial port
concentrators.  Compare that to Ethernet, with standard cables, tiny but
robust connectors, link beat lights, cheap switches and simple
configuration rules.

A much better solution is using a software system that has reliable
booting.  That means
   - Unchanging boot firmware on the node
   - Minimal hardware configuration before contacting boot server
   - Configuration reporting as part of the boot negotiation
   - No boot dependence on node configuration e.g. file system contents
   - Complete replacement of the boot software by the boot server
   - Immediate status logging over the network
   - Non-boot drivers and configuration controlled by the boot server

These are simple principles, but almost every system out there misses at
least one.

> down.  For console/BIOS, I prefer to just use a long monitor/keyboard cable
> and plug it directly into the node with the problems.

The "crash cart" approach.  I believe in it.  You should only need to
use the console when you have a hardware problem, and you'll need to be 
right by the machine anyway.

Footnote b1tch: Yup, PXE is ugly and stupid.  Most are hidden, but one
observable stupidity is how it tries to locate a server.  The client
tries for 1+2+4+16+32 seconds to locate a PXE server.  A switch with
spanning tree protocol enabled doesn't pass traffic for 60 seconds to
avoid network loops.  It should try for a slightly longer or much
shorter period.  And the exponential fallback is pointless.  Apparently
someone thought that it would be clever to use an Ethernet-like
fallback, imagining it would avoiding congestion.  But it would take
thousands of machines to saturate even 10Mbps Ethernet.  It's common for
the first packet or two to be dropped as the network link stabilizes,
leading to a two or four second delay.

Donald Becker				becker at scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993


From john.hearns at streamline-computing.com  Fri Feb 25 00:16:13 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Fri, 25 Feb 2005 08:16:13 +0000
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
Message-ID: <1109319374.6055.17.camel@Vigor45>

On Thu, 2005-02-24 at 18:20 -0500, Jamie Rollins wrote:
> Hello.  I am new to this list, and to beowulfery in general.  I am working
> at a physics lab and we have decided to put together a relatively small
> beowulf cluster for doing data analysis.  I was wondering if people on
> this list could answer a couple of my newbie questions.
> 
> The basic idea of the system is that it would be a collection of 16 to 32
> off-the-shelf motherboards, all booting off the network and operating
> completely disklessly.  We're looking at amd64 architecture running
> Debian, although we're flexible (at least with the architecture ;).  Most
> of my questions have to do with diskless operation.

Jamie, 
  why are you going diskless?
IDE hard drives cost very little, and you can still do your network
install.
Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go.


BTW, have a look at Clusterworld http://www.clusterworld.com
They have a project for a low-cost cluster which is similar to your
thoughts.


Also, with the caveat that I work for a clustering company,
why not look at a small turnkey cluster?
I fully acknowledge that building a small cluster from scratch will be
a good learning exercise, and you can get to grips with the motherboard,
PXE etc. 
However if you are spending a research grant, I'd argue that it would be
cost effective to buy a system with support from any one of the
companies that do this.
If you get a prebuilt cluster, the company will have done the research
on PXE booting, chosen gigabit interfaces and switches which perform
well, chosen components which will last. And when your power supplies
fail, or a disk fails someone will come round to replace them.
And you can get on with doing your science.


From starship at mtaonline.net  Fri Feb 25 00:15:30 2005
From: starship at mtaonline.net (Starship Warrior)
Date: Thu, 24 Feb 2005 23:15:30 -0900
Subject: [Beowulf] where can i learn to build a cluster machine?
Message-ID: <421EDEA2.30206@mtaonline.net>

I am totally new to clusters but have been a list member for some time 
and read all the emails trying to learn more - so can anyone tell me 
where there is a good how to guide - I have three machines that I would 
like to use linus and cluster together just to learn more - two are 3.0 
Pentiums and the last one maybe a 3.5 or 6 not sure yet they the two 
have the same ASUS MB  and will use SCSI  drives the third will also 
have a ASUS MB just not sure yet what I will get

thanks for any and all information

Starship Warrior
Cluster user wantabe LOL


From eno at dorsai.org  Fri Feb 25 00:37:33 2005
From: eno at dorsai.org (Alpay Kasal)
Date: Fri, 25 Feb 2005 03:37:33 -0500
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <0ICG009WS4JWON@mta6.srv.hcvlny.cv.net>
Message-ID: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net>

Donald Becker just mentioned "RPL", I do believe that is what I was
referring to in my last response.

>PS: PXE is not the only embedded method of bootnic. I forget the other
method but it's often paired with onboard realtek8139's, an old intel
standard (?) which is useless these days but still found on some new
motherboards. Maybe someone else here knows what I'm referring too.


From gotero at linuxprophet.com  Fri Feb 25 00:23:31 2005
From: gotero at linuxprophet.com (Glen Otero)
Date: Fri, 25 Feb 2005 00:23:31 -0800
Subject: [Beowulf] O'Reilly Clusters Book Review
Message-ID: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>

My review of O'Reilly's latest clusters book published at HPCwire 
(http://www.tgc.com/hpcwire.html):

> 'Crazy Talk' Clutters New Cluster Book
>  Glen Otero, Linux Prophet

>   When my colleagues and I heard that O'Reilly was releasing another
>   cluster book ("High Performance Linux Clusters with OSCAR, Rocks,
>   openMosix & MPI"), we knew it would not turn out well. One of my
>   colleagues even said, "It's going to be written by some guy that
>   doesn't know anything and [gets all excited] over clusters."
>
>   Why such a pessimistic prediction?
>
>   For one, it was uttered by the same cluster expert that O'Reilly
>   ignored while producing their first cluster book debacle several 
> years
>   ago. When told that their first book ("Building Linux Clusters" by
>   David Spector)?should be scrapped and rewritten, O'Reilly ignored
>   their reviewers. The advice only came from the knowledgeable folks at
>   VA Linux, *the* cluster company at that time. But what does VA Linux
>   know? It's O'Reilly, they obviously know better.
>   The first O'Reilly cluster book was a complete disaster. I wrote a
>   scathing review of it for Linux Journal in 2000. Completely void of
>   anything useful, the book and included software were simply not
>   finished. It was like reading a rough draft. Totally embarrassed, and
>   suddenly void of hubris, O'Reilly apologized to its audience and
>   pulled the book from print.
>   Not satisfied to sit around pointing fingers and complaining, I told
>   O'Reilly I would help them with their next cluster book attempt, if
>   there even was one. Before long, I signed a contract to write a
>   clusters book for O'Reilly. But in their infinite wisdom, they didn't
>   like the first few chapters that I submitted. Although I had gotten
>   other cluster experts to review what I had written, O'Reilly didn't
>   bother to get any experts to review what I was writing. They just
>   didn't like it, so they dismissed it out of hand. Needless to say the
>   "we know better" attitude was back, and that ended the contract.
>
>   Which brings us to present day. This latest cluster book suffers from
>   the same brain damaged, hubris-driven process at O'Reilly. Just like
>   the first book, it's written by a virtual unknown in the cluster
>   community (Joseph D. Sloan) and comes across as having been written 
> in
>   a vacuum.
>
>   Let's start with the book's title, "High Performance Linux Clusters
>   with OSCAR, Rocks, openMosix & MPI." There's nothing high-performance
>   about this book because there's no discussion of using any high
>   performance networks like Myrinet, Infiniband, or Quadrics outside of
>   four paragraphs on page 40. There are so many ill-informed sweeping
>   generalizations made about cluster networks on that page that I threw
>   the book against the wall when I read them. For example, Quadrics and
>   Infiniband are clearly established networking technologies, not 
> merely
>   "emerging," as the author believes. Sloan obviously hasn't attended a
>   Supercomputing conference in the last several years. Unfortunately,
>   the rest of the book is rife with several inaccurate cluster
>   oversimplifications and incorrect definitions of terms like single
>   system image (SSI) and virtual machine interface (VMI). The
>   "beginner's guide" design of the book is no excuse for inaccuracies
>   and oversimplifications.
>
>   In my eyes, this book was doomed for the trash after page 8. Sloan
>   states that the term "Beowulf" is a politically charged term that
>   would be avoided in the book.? That is the most ridiculous thing I
>   have ever heard. It's impossible to take that comment seriously,
>   especially since the author doesn't even take the time to properly
>   define a Beowulf.? For these reasons alone, I can't take this book
>   seriously. I've thrown back my share of adult beverages with Don
>   Becker, and trust me when I say that the political nature of Beowulf
>   has never come up. Adding to the confusion, the phrase "more
>   traditional Beowulf-style cluster" is then used on page 63. I hope 
> now
>   you'll understand why I think this book is schizophrenic at best.
>
>   Defining a Beowulf shouldn't have been too difficult for Sloan. He
>   could have used a term that he introduced on page 10, "asymmetric
>   cluster."? But I guess it's too much to ask that the Beowulf project,
>   Tom Sterling and Don Becker's brainchild that started the high
>   performance cluster phenomenon, be properly described and defined in 
> a
>   clusters book.? By the way, I've never heard the term "asymmetric
>   architecture" used when describing clusters. And, outside this book,
>   you won't either.
>
>   After page 8, it's apparent that the author has nothing original to
>   offer and is going to regurgitate what has already been written about
>   clusters. There is absolutely no value in this because the online
>   documentation for all of the cluster projects covered by the author 
> is
>   far more informative than what is included in the book. For example,
>   while screenshots of a cluster install are included in the online
>   Rocks documentation, they are omitted in the book. Furthermore, after
>   regurgitating much of the online Rocks documentation, the author
>   doesn't offer any additional helpful hints or troubleshooting advice.
>   As someone who runs a company that provides and supports cluster
>   software based on Rocks, I can tell you that there are plenty of
>   pitfalls that should have been mentioned.
>
>   This underscores my major complaint with this book. There's nothing
>   new, nothing novel and no real help offered. Everything is just laid
>   out superficially in front of the reader for them to make the right
>   cluster decision. The book should guide the cluster decision-making
>   process, but it only offers a bunch of questions -- with no
>   substantial answers.
>
>   Sloan even admits on page 91 that there is a very detailed set of
>   installation instructions for OSCAR, including screen shots, 
> available
>   online. So why is this book necessary again? Oh yeah, the author is
>   supposed to help the reader decide if OSCAR, or any cluster toolkit
>   for that matter, is right for the reader. Unfortunately, no help of
>   any kind is offered.
>
>   The typos and omissions weren't rampant this time, but the errors I
>   found on pages 76, 123, 127, 130, and 136 provided nasty flashbacks 
> of
>   the first O'Reilly book. Good thing I resigned myself to do a shot of
>   tequila after every typo I found. It dulled the pain this book
>   inflicted.
>
>   OK. "Part I -- An Introduction to Clusters" is just inaccurate and
>   infuriating. "Part II -- Getting Started Quickly" contains recycled
>   and reformatted content easily found for free online. "Part III --
>   Building Custom Clusters" isn't really about building custom 
> clusters,
>   but looks more closely at some software that was gleaned over in 
> Parts
>   I & II. While I don't agree with the inclusion of the parallel 
> virtual
>   file system (PVFS) and the omission of Sun Grid Engine in Part III,
>   I'm sure this can be chalked up to one of the tough decisions the
>   author had to make, like the omission of PVM and Condor from the 
> book.
>   "Part IV -- Cluster Programming" is actually a very good introduction
>   to programming, debugging, and profiling MPI programs.
>
>   It's obvious that this book has no clear identity. It's like a 5th
>   grader's book report: a lifeless facsimile of what's been read,
>   totally void of originality, wisdom or topic advancement. But it's a
>   quick read because it uses small words.
>
>   Should I be this harsh? After all, cluster computing is a complex
>   subject where the answer to most questions is "it depends."? However,
>   I believe that O'Reilly owed us an excellent book after their first
>   cluster gaffe, so I'm disappointed that O'Reilly took the easy way 
> out
>   by reorganizing and watering down documentation that is available
>   elsewhere. Even the content in the exemplary Part IV can be found in
>   several other places. It's just a lot less technical and intimidating
>   here.?
>
>   There are better ways to write a clusters book. I know because I've
>   read several cluster book outlines by members of the cluster
>   intelligentsia that would have been better than this offering. So I'm
>   not going easy on O'Reilly, no matter how good their intentions. The
>   cluster community has a difficult enough time assisting people with
>   clusters without books like this dynamiting the proverbial cluster
>   well. The statement on page 28, "...benchmarking is probably a
>   meaningless activity and waste of time," is just plain wrong and
>   demonstrates a glaring lack of cluster understanding.
>
>   If you really want to learn about clusters, pick up a copy of
>   Sterling's "Beowulf Cluster Computing with Linux," 2nd edition, or
>   check out Warewulf, Rocks, OSCAR, OpenMosix, and ClusterWorld online.
>   You could join a mailing list, like the Beowulf mailing list, and
>   subscribe to ClusterWorld Magazine. This is where the creators and
>   maintainers of all that is clustering hang out, announce, debate,
>   rant, create, lurk, help, and publish. If you want to be part of
>   clustering's future, then you'll check out the community's Cluster
>   Agenda and attend this year's ClusterWorld conference.
>
>   =================================================
>   Glen Otero received his Ph.D. in Microbiology and Immunology from 
> UCLA
>   in 1995 and immediately escaped to the more temperate climes and
>   better surf in San Diego. After some research on the molecular and
>   cellular biology of HIV and Herpes viruses at the Salk Institute for
>   Biological Sciences, Glen left the wet lab research bench in 1999.
>   Although leaving the research bench, he didn't leave science
>   altogether; traveling all the way across the street to the San Diego
>   Supercomputer Center (SDSC) for a stint at the Protein Data Bank. It
>   was while at SDSC that Glen had his Linux clusters and bioinformatics
>   epiphany. Soon after that illuminating event, Glen founded Linux
>   Prophet, a bioinformatics consultancy specializing in the
>   implementation, design, and deployment of Linux Beowulf clusters in
>   the life sciences. Late in 2002 Linux Prophet evolved into Callident,
>   a Linux cluster software and high performance computing company.
>
Glen Otero Ph.D.
Linux Prophet

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 10616 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050225/3919966f/attachment.bin>

From ctierney at HPTI.com  Fri Feb 25 09:17:44 2005
From: ctierney at HPTI.com (Craig Tierney)
Date: Fri, 25 Feb 2005 10:17:44 -0700
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <1109319374.6055.17.camel@Vigor45>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
Message-ID: <1109351864.2883.22.camel@localhost.localdomain>

On Fri, 2005-02-25 at 01:16, John Hearns wrote:
> On Thu, 2005-02-24 at 18:20 -0500, Jamie Rollins wrote:
> > Hello.  I am new to this list, and to beowulfery in general.  I am working
> > at a physics lab and we have decided to put together a relatively small
> > beowulf cluster for doing data analysis.  I was wondering if people on
> > this list could answer a couple of my newbie questions.
> > 
> > The basic idea of the system is that it would be a collection of 16 to 32
> > off-the-shelf motherboards, all booting off the network and operating
> > completely disklessly.  We're looking at amd64 architecture running
> > Debian, although we're flexible (at least with the architecture ;).  Most
> > of my questions have to do with diskless operation.
> 
> Jamie, 
>   why are you going diskless?
> IDE hard drives cost very little, and you can still do your network
> install.
> Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go.
> 

IDE drives fail, they use power, you waste time cloning, and
depending on the toolkit you use you will run into problems
with image consistency.

I have run large systems of both kinds.  The last system was
diskless and I don't see myself going back.  I like changing
one file in one place and having the changes show up immediately.
I like installing a packing once, and having it show up immediately,
so I don't have to reclone or take the node offline to update
the image.

Craig


> 
> BTW, have a look at Clusterworld http://www.clusterworld.com
> They have a project for a low-cost cluster which is similar to your
> thoughts.
> 
> 
> Also, with the caveat that I work for a clustering company,
> why not look at a small turnkey cluster?
> I fully acknowledge that building a small cluster from scratch will be
> a good learning exercise, and you can get to grips with the motherboard,
> PXE etc. 
> However if you are spending a research grant, I'd argue that it would be
> cost effective to buy a system with support from any one of the
> companies that do this.
> If you get a prebuilt cluster, the company will have done the research
> on PXE booting, chosen gigabit interfaces and switches which perform
> well, chosen components which will last. And when your power supplies
> fail, or a disk fails someone will come round to replace them.
> And you can get on with doing your science.
> 


From alvin at iplink.net  Fri Feb 25 09:22:05 2005
From: alvin at iplink.net (alvin)
Date: Fri, 25 Feb 2005 12:22:05 -0500
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net>
References: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net>
Message-ID: <421F5EBD.2030305@iplink.net>

Alpay Kasal wrote:

>Donald Becker just mentioned "RPL", I do believe that is what I was
>referring to in my last response.
>
>  
>
>>PS: PXE is not the only embedded method of bootnic. I forget the other
>>    
>>
>method but it's often paired with onboard realtek8139's, an old intel
>standard (?) which is useless these days but still found on some new
>motherboards. Maybe someone else here knows what I'm referring too.
>
>  
>
RPL if I remember correctly was a IBM/Novell protocol for booting.  
There are reports it can be made to boot linux directly but I had it 
boot an etherboot image.

Etherboot is another boot protocol that in some use I have a funny 
feeling it was based on a Sun protocol that was around before open boot.

Etherboot can be put into non PXE flash bioses and has support for 
things like serial console.  There could be an argument made to rip the 
PXE out of your new motherboard BIOS and replace it with etherboot.


Alvin


From rgb at phy.duke.edu  Fri Feb 25 09:38:10 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 25 Feb 2005 12:38:10 -0500 (EST)
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net>
References: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net>
Message-ID: <Pine.LNX.4.58.0502251158200.23104@ganesh.phy.duke.edu>

On Fri, 25 Feb 2005, Alpay Kasal wrote:

> Donald Becker just mentioned "RPL", I do believe that is what I was
> referring to in my last response.
> 
> >PS: PXE is not the only embedded method of bootnic. I forget the other
> method but it's often paired with onboard realtek8139's, an old intel
> standard (?) which is useless these days but still found on some new
> motherboards. Maybe someone else here knows what I'm referring too.

I imagine you're referring to bootp or etherboot.  bootp is a part of
PXE.  Most PXE implementations currently use dhcp and/or bootp followed
by tftp.  These in order do NIC TCP initialization, path/config
information exchange, retrieval of a bootable image.  At that point
control is passed to a second stage bootloader (usually in ROM BIOS)
that boots the retrieved image.  Google is as always your friend and can
find you anything from HOWTOs to Intel white papers that precisely
define the process (that I'm describing very coarsely).

The reason I put and/or is that dhcpd typically does both of the first
two steps nowadays -- it isn't necessary to have a separate bootp
daemon.  Once upon a time when Sun sold lots of diskless workstations
(by design) there was, but the loader sequence was more or less the same
but without dhcp -- a bootparamd handling bootp and network
initialization, tftp to retrieve an image in RAM, a local ROM loader to
boot it.  There is also the issue of PXE (per se) and etherboot (per se)
-- see e.g.

  http://www.ltsp.org/documentation/pxe.howto.html

Basically, this just involves what kind of image is passed back to the
boot loader -- a lzpxe image (bootable image packed up for PXE) or a nbi
(network bootable image) and what actually does the boot loading.

For MOST users, all of this is irrelevant and not worth knowing or
worrying about -- at most it alters how you prepare the bootable image
for actual transfer and loading at the other end.  To boot with PXE, you
just set up dhcp and tftp (on a server) appropriately, using
instructions available lots of places on the web.  The image you boot
and what happens afterward is completely under your control -- we
routinely boot a dos image via pxe, for example, to relash a BIOS or run
memtest86 via PXE.  You can find a free open source DOS floppy image a
variety of places on the web, e.g. here:

  http://www.freedos.org/freedos/files/

(Note that dos on such a floppy doesn't do very much, typically -- it
functions pretty much strictly as a bootable program loader and
execution environment with an associated (fairly small) set of BIOS
calls and resource hooks built in.  Not much of a kernel...)

HTH -- I'm skipping a lot of detail (and may even be getting some of it
wrong:-) but you can find that detail on the web and read to your
heart's content, if it matters to you.

   rgb

> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rsweet at aoes.com  Fri Feb 25 09:33:58 2005
From: rsweet at aoes.com (Ryan Sweet)
Date: Fri, 25 Feb 2005 18:33:58 +0100 (CET)
Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book Review
In-Reply-To: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
References: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
Message-ID: <Pine.LNX.4.61.0502251807340.27564@lapp-0>


Glenn,

I have also had a look at the new ORA cluster book, hoping that they had 
learned their lesson, and had a similar reaction.  While I don't wish to 
discredit M. Sloan, as writing any sort of book is always going to be a lot of 
work and filled with compromise, I felt from the very beginning that the 
community can do better.  After reading over your review, which, while 
scathing, was entirely accurate, I feel resolved that the beowulf community 
_should_ do better.

Here's what I propose: let's make a "Beowulf.org Guide to Linux Clustering", 
or whatever the heck else you want to call it.   Let us outline, review, 
improve, and comment on it here on this mailing list.  Here's the hard part - 
lets also set a deadline, with realistic goals, and try to stick to it.  Lets 
assign any publishing rights or other "details" like that to the FSF or Linux 
Documentation Project.

Robert Brown has already done a lot of work on such a book, and generously 
made it freely available.  Maybe he is amenable to this being a starting 
point?

In any case, I would gladly provide hosting for something like this, and 
coordinate the project, as well as edit or write content.

There are many questions that arise:
 	Most importantly - What should go in the book?
 	In what order should these topics be covered?
 	Should there be an attempt to have a common style?
 	How (and how often) should it be revised?
 	Does the book target new beowulf admins, seasoned experts, or both and 
some in-between?
 	Should mentioning vendors be allowed? What are the guidelines?

and so on.

Firstly, now that I've proposed the idea, I'll also start by volunteering to 
write a chapter on diskless clustering.

Second, please take this opportunity to tell me why this is a bad idea, and 
while your at it send your comments on the questions above.

regards,
-Ryan

-- 
Ryan Sweet             <ryan.sweet at aoes.com>
Advanced Operations and Engineering Services
AOES Group BV            http://www.aoes.com
Phone +31(0)71 5795521  Fax +31(0)71572 1277


From rgb at phy.duke.edu  Fri Feb 25 09:53:39 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 25 Feb 2005 12:53:39 -0500 (EST)
Subject: [Beowulf] where can i learn to build a cluster machine?
In-Reply-To: <421EDEA2.30206@mtaonline.net>
References: <421EDEA2.30206@mtaonline.net>
Message-ID: <Pine.LNX.4.58.0502251238260.23104@ganesh.phy.duke.edu>

On Thu, 24 Feb 2005, Starship Warrior wrote:

> I am totally new to clusters but have been a list member for some time 
> and read all the emails trying to learn more - so can anyone tell me 
> where there is a good how to guide - I have three machines that I would 
> like to use linus and cluster together just to learn more - two are 3.0 
> Pentiums and the last one maybe a 3.5 or 6 not sure yet they the two 
> have the same ASUS MB  and will use SCSI  drives the third will also 
> have a ASUS MB just not sure yet what I will get
> 
> thanks for any and all information

A "good howto guide" is a tough thing to define, because of the breadth
of the problem, but here goes.

See:

  http://www.phy.duke.edu/brahma

(for example) as a cluster resource clearinghouse, with links to other
cluster resource clearinghouses such as beowulf.org, beowulf underground
and others.  In particular there are links to the beowulf howto, the
beowulf FAQ, and an online book on cluster engineering that can probably
suffice to get you started.

See also a series of columns I wrote last year for the then brand new
Cluster World Magazine on this very topic.

See also the list archives -- this is a FAQ beyond the FAQ and has been
discussed/described on list repeatedly over years.

See also sites like the warewulf site -- there are now free
cluster-in-a-box (so to speak) distributions that should permit you to
build and configure a learning cluster very easily indeed, often without
even installing an operating system image on the nodes (so they can
continue to run whatever you like except when you boot them into a
cluster).

To give you the direct answer, it goes something like the following:

  a) Hook systems into a common switched LAN e.g. an ethernet switch.
  b) If possible use decent quality PXE-aware NICs
  c) If possible use nodes with a decent amount of installed memory (>=
192 MB) although it is possible to get by with less, with effort.
  d) Node hard disk is optional for at least some installation methods
(e.g. warewulf) but is useful and enables others.
  e) At least one system NEEDS ample hard disk and will serve as a
"server" or "head node" to your cluster.  This node will manage boot
images, the distro you wish to install, NFS or other shared filesystems,
authentication, and gives you a place to "login to the cluster".  Note
that this is a sloppy requirement -- there are many different ways to
manage this and I'm just describing one of the simplest and most
straightforward ones.

It then goes like this.  Set up linux (of your choice) on the boot
server/head node.  Learn to PXE boot (dhpc, tftp) and set up head node
as boot server.  Learn to set up an installation repository for e.g.
kickstart or the distribution and packaging scheme of your choice and do
so via a mirror on your head node.  Create accounts and NFS home
directories and so on on your head node.

Then pick a kind of cluster and install it.  This could be a bootable
remote mountable node image on your server (warewulf) or kickstarting an
installation onto all your nodes native or using FAI or something else.
What you choose here depends on what linux distro you're comfortable and
how serious a cluster you want to build.  For most beginners, I tend to
suggest either using a canned cluster (warewulf) or a simple NOW cluster
(e.g. kickstarted FC2 on all nodes).  Install "parallel clustering"
packages as desired, e.g. pvm, lam mpi or mpich, ganglion, wulfware
(xmlsysd/wulfstat), whatever.  Or not, you can actually do and learn
about clustering without them in at least some modes of operation using
just plain old compilers, modern perl, ssh, and some scripts.

That's it.  Write a parallel program (using PVM or MPI) or run a serial
program in parallel (lots of ways) and you can start to learn about
parallel speedup and scaling and everything...

   rgb

> 
> Starship Warrior
> Cluster user wantabe LOL
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rsweet at aoes.com  Fri Feb 25 09:52:59 2005
From: rsweet at aoes.com (Ryan Sweet)
Date: Fri, 25 Feb 2005 18:52:59 +0100 (CET)
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <1109351864.2883.22.camel@localhost.localdomain>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
Message-ID: <Pine.LNX.4.61.0502251834350.27564@lapp-0>

On Fri, 25 Feb 2005, Craig Tierney wrote:

>>   why are you going diskless?
>> IDE hard drives cost very little, and you can still do your network
>> install.
>> Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go.
>>
>
> IDE drives fail, they use power, you waste time cloning, and
> depending on the toolkit you use you will run into problems
> with image consistency.
>
> I have run large systems of both kinds.  The last system was
> diskless and I don't see myself going back.  I like changing
> one file in one place and having the changes show up immediately.
> I like installing a packing once, and having it show up immediately,
> so I don't have to reclone or take the node offline to update
> the image.

I think the term "diskless" is sometimes the problem when discussing centrally 
installed and managed systems.  Lots of "diskless" cluster have GB and GB of 
local disks, only they are used for swap and temp I/O, not for the OS.

In 2000 I switched from locally installed system images (using the very good - 
even back then - system-imager) to using either nfsroot or warewulf style 
diskless systems, but have retained the local disk for scratch I/O.  While I 
can understand debating over the merits of nfsroot vs RAM-disk root, I fail to 
see many useful arguments for maintaining a local OS install.  However, that 
doesn't mean that local disks are bad.  It all depends upon the application, 
of course, but in many cases its hard to beat the local disk for temporary 
I/O, especially if you don't have gobs and gobs of RAM to spare.  Also PVFS is 
sufficiently mature that you can easily combine all of the (very cheap) local 
disks into a large parallel filesystem.  Using nfsroot you can switch from one 
"system image" (really just an nfsroot file tree) to another one with a simple 
reboot. You have all of the advantadges of central configuration and control 
combined with the convenience and speed of local I/O and local swap.

It can be _very_ useful in a situation where you have to support multiple user 
communities with wierd apps or strange requirements.  Using pxeboot and 
pxelinux, I've set up systems where the queue system could even request that a 
node use a specific system configuration before starting the job (eg: must 
have linux 2.4 with checkpointing in the kernel).  Nodes might be available, 
but running another nfsroot cluster system image (say they are running RHEL, 
with no checkpointing, or for compatibility with some other commercial app 
they are running RH 7.2).  The queue system tells the cluster master to 
reconfigure pxelinux so that the requested nodes default to the required 
config, by pointing them at another nfsroot tree.  The cluster master tells 
the nodes to reboot, and when they are rebooted and running the appropriate 
image, the job runs.  That sort of config requires a lot of glue, but it would 
be way too much headache to even attempt without "diskless" systems.

regards,
-Ryan

-- 
Ryan Sweet             <ryan.sweet at aoes.com>
Advanced Operations and Engineering Services
AOES Group BV            http://www.aoes.com
Phone +31(0)71 5795521  Fax +31(0)71572 1277


From lindahl at pathscale.com  Fri Feb 25 10:31:46 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Fri, 25 Feb 2005 10:31:46 -0800
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <Pine.LNX.4.61.0502251834350.27564@lapp-0>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
	<Pine.LNX.4.61.0502251834350.27564@lapp-0>
Message-ID: <20050225183146.GA1563@greglaptop.internal.keyresearch.com>

On Fri, Feb 25, 2005 at 06:52:59PM +0100, Ryan Sweet wrote:

> While I can understand debating over the merits of nfsroot vs RAM-disk 
> root, I fail to see many useful arguments for maintaining a local OS 
> install.

An example of something that goes very wrong with NFS is upgrading a
file to a new file with the same name. If that file is a binary or
library that's in use anywhere in the cluster, you are likely to have
a problem. Local disks and Scyld, on the other hand, do the right
thing: existing processes using the binary or library continue to use
the old version, while new ones use the new version.

This disagreement is as old as the hills, by the way: in the good old
days, when Sun was young, lots of people ran their pizza-box
workstations diskless, but that went out of style when Ethernet's
performance was stuck in place for a bunch of years.

It's important to understand arguments you disagree with; your
dismissal is not a good sign.

> It can be _very_ useful in a situation where you have to support multiple 
> user communities with wierd apps or strange requirements.

Yep. But your conclusion:

> That sort of config requires 
> a lot of glue, but it would be way too much headache to even attempt 
> without "diskless" systems.

Doesn't make any sense; I have seen people describe such systems where
they download a disk image when a batch job wants a different software
load. It's certainly doable that way: it does have different tradeoffs
from the diskless case, but if it gives you a headache, it's probably
because you don't like it, not because it's hard to do.

-- greg


From lindahl at pathscale.com  Fri Feb 25 10:34:08 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Fri, 25 Feb 2005 10:34:08 -0800
Subject: [Beowulf] Microsoft HPC survey spam?
Message-ID: <20050225183408.GA1624@greglaptop.internal.keyresearch.com>

I got 2 emails from some survey company doing an HPC survey for
Microsoft, one at pathscale.com and one at our previous name,
keyresearch.com. Am I just unlucky, or are they spamming a list of
posters to, say, this mailing list?

-- greg


From mathog at mendel.bio.caltech.edu  Fri Feb 25 11:06:58 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Fri, 25 Feb 2005 11:06:58 -0800
Subject: [Beowulf] RE: S2466 Wake on Lan working, anyone?
Message-ID: <E1D4kni-0006I5-00@mendel.bio.caltech.edu>

After messing around with this for a couple of days, with some
very helpful messages from Donald Becker, I still had not
managed to make WOL work on the S2466.  Using pci-config it
was possible to put the onboard NIC into the D3 state and
to observe that the correct bit was set in the NIC when
a magic packet hit it. But no matter how late into the
poweroff sequence it was put into this state, once
powered off (poweroff or shutdown -h), it wouldn't power
back on after the magic packet arrived.

I had emailed Tyan support but they never sent anything back.
So today I called them, and the word came down that:
(The following quotes may not be exact but are as close as I can
remember.)

  "WOL only works on the AMD MPX boards from the S1 state".

S1 state has the motherboard fully powered but the disks
may be spun down.

The rationale given for this appalling design decision
(or more likely, cover up for the design error) was that
 
  "Most people who care about such things use a remote
   management card in the machine, so WOL would have
   been redundant". 

Sure, and that's why there are so many posts from people
trying to make WOL work on the S2466 mobos!

Well, we can all stop asking because apparently you can't
get there from here.  It's possible this is still a BIOS
problem that they could fix but I'm leaning more towards the
hypothesis that they neglected to put in the hardware support
required for the NIC to actually trigger a power on
event.

The more I learn about Tyan the less I like them. Surely they've
known all there is to know about the broken WOL for at least
3 years, yet they didn't ever post the reason for the lack of WOL
function in their FAQ section for this motherboard.  Could it
possibly be that they didn't want to lose sales by admitting
the board couldn't do WOL in a manner that was of any use
to anybody?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From roger at ERC.MsState.Edu  Fri Feb 25 11:33:53 2005
From: roger at ERC.MsState.Edu (Roger L. Smith)
Date: Fri, 25 Feb 2005 13:33:53 -0600 (CST)
Subject: [Beowulf] Microsoft HPC survey spam?
In-Reply-To: <20050225183408.GA1624@greglaptop.internal.keyresearch.com>
References: <20050225183408.GA1624@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.56.0502251332470.11770@Senna.ERC.MsState.Edu>

On Fri, 25 Feb 2005, Greg Lindahl wrote:

> I got 2 emails from some survey company doing an HPC survey for
> Microsoft, one at pathscale.com and one at our previous name,
> keyresearch.com. Am I just unlucky, or are they spamming a list of
> posters to, say, this mailing list?

I got one too, but I'm not convinced it's a trend yet either.  We're both
also SC attendees.

 _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_
| Roger L. Smith                        Phone: 662-325-3625               |
| Sr. Systems Administrator             FAX:   662-325-7692               |
| roger at ERC.MsState.Edu                 http://WWW.ERC.MsState.Edu/~roger |
|                       Mississippi State University                      |
|____________________________________ERC__________________________________|


From ctierney at HPTI.com  Fri Feb 25 11:38:44 2005
From: ctierney at HPTI.com (Craig Tierney)
Date: Fri, 25 Feb 2005 12:38:44 -0700
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <421F6CAA.3040706@mail2.vcu.edu>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
	<421F6CAA.3040706@mail2.vcu.edu>
Message-ID: <1109360324.2883.42.camel@localhost.localdomain>

On Fri, 2005-02-25 at 11:21, Mike Davis wrote:
> Craig,
> 
> Reasons to run disks for physics work.
> 
> 1. Large tmp files and checkpoints.

Good reason, except when a node fails you lose your checkpoints.

> 
> 2. Ability for distributed jobs to continue if master node fails.

Jobs will continue to run once libraries are loaded.  They just hang
at the end.  

It all ends up being a risk assessment.  We have been up for close
to 6 months now.  We have not had a failure of the NFS server.  The
load is all at boot time, but it does very little the rest of the time.

I suspect that by making that statement I will be up at 2am tomorrow
morning replacing hardware.....

> 
> 3. saving network io for jobs rather than admin
> 
> I actually seldom update compute nodes (unless an update is required for 
> software required for research). I mount, a /usr/global that does 
> contain software. I also mount /home on each node.

I guess I wasn't as clear and someone else pointed out why
disks are good.  I actually have disks in some of my compute
nodes for exactly these reasons.  However, they are only for
/tmp and swap.  

You do want to consider how you design your network and the rest
of your system to boot diskless.  Is the cost justified?  For us,
either the systems are booting and all of the IO is image IO, or
the nodes are running and reading/writing files, the IO doesn't
interfere.  We are exporting our IO over the HSN (myrinet in this
case) so the really fast IO isn't interfering anyway.

Your /usr/global does seem to be a good solution that is half way
between having everything local and pure diskless.
> 
> An example of item 1 above are Gaussian jobs that we are now running 
> that require >40GB of tmp space. For these jobs I have both an OS 20GB 
> and tmp 100GB disk in each node. Due to a problematic scsi to ide 
> converter, I have experienced item 2 too many times with one cluster, 
> but even on the others I like knowing that work can continue even if the 
> host is down (facilitated by a separate nfs server).
> 

If you know your job load needs /tmp, disk is great.  I have never had
users than needed to use space in this way, so moving away from diskfull
nodes wasn't an issue.


> Of course, I am definitely old school. I use static IP's, individual 
> passwd files. and simple scripts to handle administration.
> 

I still would probably run system this way if it was disk-full.  I have
run both ways and I diskless has made my life much easier.  Faster to
get the system up, faster to make changes, easier to deal with hardware
failures.

Craig


> Mike
> 
> 
> 
> Craig Tierney wrote:
> 
> >On Fri, 2005-02-25 at 01:16, John Hearns wrote:
> >  
> >
> >>On Thu, 2005-02-24 at 18:20 -0500, Jamie Rollins wrote:
> >>    
> >>
> >>>Hello.  I am new to this list, and to beowulfery in general.  I am working
> >>>at a physics lab and we have decided to put together a relatively small
> >>>beowulf cluster for doing data analysis.  I was wondering if people on
> >>>this list could answer a couple of my newbie questions.
> >>>
> >>>The basic idea of the system is that it would be a collection of 16 to 32
> >>>off-the-shelf motherboards, all booting off the network and operating
> >>>completely disklessly.  We're looking at amd64 architecture running
> >>>Debian, although we're flexible (at least with the architecture ;).  Most
> >>>of my questions have to do with diskless operation.
> >>>      
> >>>
> >>Jamie, 
> >>  why are you going diskless?
> >>IDE hard drives cost very little, and you can still do your network
> >>install.
> >>Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go.
> >>
> >>    
> >>
> >
> >IDE drives fail, they use power, you waste time cloning, and
> >depending on the toolkit you use you will run into problems
> >with image consistency.
> >
> >I have run large systems of both kinds.  The last system was
> >diskless and I don't see myself going back.  I like changing
> >one file in one place and having the changes show up immediately.
> >I like installing a packing once, and having it show up immediately,
> >so I don't have to reclone or take the node offline to update
> >the image.
> >
> >Craig
> >
> >
> >  
> >
> >>BTW, have a look at Clusterworld http://www.clusterworld.com
> >>They have a project for a low-cost cluster which is similar to your
> >>thoughts.
> >>
> >>
> >>Also, with the caveat that I work for a clustering company,
> >>why not look at a small turnkey cluster?
> >>I fully acknowledge that building a small cluster from scratch will be
> >>a good learning exercise, and you can get to grips with the motherboard,
> >>PXE etc. 
> >>However if you are spending a research grant, I'd argue that it would be
> >>cost effective to buy a system with support from any one of the
> >>companies that do this.
> >>If you get a prebuilt cluster, the company will have done the research
> >>on PXE booting, chosen gigabit interfaces and switches which perform
> >>well, chosen components which will last. And when your power supplies
> >>fail, or a disk fails someone will come round to replace them.
> >>And you can get on with doing your science.
> >>
> >>    
> >>
> >
> >
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> >
> >  
> >
> 


From eugen at leitl.org  Fri Feb 25 12:31:06 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Fri, 25 Feb 2005 21:31:06 +0100
Subject: [Beowulf] [Bioclusters] Need some advice on a cluster for EST/cDNA
	assembly, clustering (fwd from gary@www.bioinformatics.org)
Message-ID: <20050225203106.GR1404@leitl.org>

----- Forwarded message from Gary Van Domselaar <gary at www.bioinformatics.org> -----

From: Gary Van Domselaar <gary at www.bioinformatics.org>
Date: Fri, 25 Feb 2005 14:26:04 -0500 (EST)
To: bioclusters at bioinformatics.org
Subject: [Bioclusters] Need some advice on a cluster for EST/cDNA assembly,
	clustering
Reply-To: "Clustering,  compute farming & distributed computing in life science informatics" <bioclusters at bioinformatics.org>


Hey Gang,

I've been called in at the last moment to "consult" on the purchase of a
cluster for a sequencing project.  Admittedly, I know nothing about life
science clusters, despite having been subscribed to this list from its
inception.  I am not making any money on this consulting, just helping out
a neighbouring academic lab.  So what I know at this point is that they
have about $Cdn 200K to spend.  They hae already talked to Sun, and Sun is
offering them a "sweetheart" deal for (something) at about $120k.  My only
exposure is to a G4/G5 cluseter from BioTeam.  I am impressed with it
andit works really well for my purposes, and Im guessing it would work
well for theirs too.  I'm guessing a linux cluster would perfrom nicely
too. The lab currently does not have a bioinforatician, but I thnik they
have money for one.  I'll probably just end up pointing them to Glen, Joe
, and Chris, but any advice, suggestions, and pointers to where I can get 
a little more familiarity, well shucks, that would really be swell.

Decidedly,

g.
--
Gary Van Domselaar, PhD.
Postdoctoral Fellow, Computing Science
and Biological Sciences
University of Alberta
Edmonton, AB, Canada
Phone: 780-492-5969


Assistant Director, Bioinformatics.Org
gary at bioinformatics.org


_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050225/27b7d0c4/attachment.sig>

From eugen at leitl.org  Fri Feb 25 12:42:41 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Fri, 25 Feb 2005 21:42:41 +0100
Subject: [Beowulf] Re: [Bioclusters] Re: Login & home directory strategies
	for PVM? (fwd from mgutteri@fhcrc.org)
Message-ID: <20050225204241.GU1404@leitl.org>

----- Forwarded message from Michael Gutteridge <mgutteri at fhcrc.org> -----

From: Michael Gutteridge <mgutteri at fhcrc.org>
Date: Fri, 25 Feb 2005 11:51:56 -0800
To: "Clustering,  compute farming & distributed computing in life science informatics" <bioclusters at bioinformatics.org>
Subject: Re: [Bioclusters] Re: Login & home directory strategies for PVM?
X-Mailer: Apple Mail (2.619.2)
Reply-To: "Clustering,  compute farming & distributed computing in life science informatics" <bioclusters at bioinformatics.org>


Thanks... been thinking about local homes vs. pvfs since I don't really 
need anything but .bashrc.  However, managing local home directores on 
62+ nodes gets boring after a while...  I rather prefer the idea of 
running pvm as you indicate, but I haven't had any luck finding out how 
to do this- do you have a pointer to something that describes that?  I 
can't even pull together a good google term to find out how that's 
typically done.

I will very likely end up using pvfs for database directories if I can 
make it robust enough, though.  Sounds like pvfs2 has some great 
improvements in that area.

>Lastly, you can port to mpich on a bproc system like Scyld, and get 
>rid of
>pvmd's altogether.

From my conversations with the developers, sounds like a port to MPI is 
underway.

Thanks ...


On Feb 24, 2005, at 11:05 AM, Michael Will wrote:

>Just statically mount /home rather than doing automounting of 
>individual homes,
>and you are fine.  Also you could run the pvmd's as a user that does 
>not require
>or have an nfs-mounted home but uses local scratch instead.
>
>Lastly, you can port to mpich on a bproc system like Scyld, and get 
>rid of
>pvmd's altogether.
>
>Michael

_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050225/91f95fee/attachment.sig>

From hahn at physics.mcmaster.ca  Fri Feb 25 14:18:16 2005
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Fri, 25 Feb 2005 17:18:16 -0500 (EST)
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <1109360324.2883.42.camel@localhost.localdomain>
Message-ID: <Pine.LNX.4.44.0502251658230.31744-100000@coffee.psychology.mcmaster.ca>

> > Reasons to run disks for physics work.
> > 1. Large tmp files and checkpoints.
> 
> Good reason, except when a node fails you lose your checkpoints.

you means s/node/disk/ right?  sure, but doing raid1 on a "diskless"
node is not insane.  though frankly, if your disk failure rate is 
that high, I'd probably do something like intermittently store
checkpoints off-node.

> It all ends up being a risk assessment.  We have been up for close
> to 6 months now.  We have not had a failure of the NFS server.  The

I have two nothing-installed clusters; on in use for 2+ years,
the other for about 8 months.  the older one has never had an
NFS-related problem of any kind (it's a dual-xeon with 2 u160
channels and 3 disks on each; other than scsi, nothing gold-plated.)
this cluster started out with 48 dual-xeons and a single 48pt 
100bT switch with a gigabit uplink.

the newer cluster has been noticably less stable, mainly because 
I've been lazy.  in this cluster, there are 3 racks of 32 dual-opterons 
(fc2 x86_64) that netboot from a single head node.  each rack has a 
gigabit switch which is 4x LACP'ed to a "top" switch, which has 
one measly gigabit to the head/fileserver.  worse yet, the head/FS
is a dual-opteron (good), but running a crappy old 2.4 ia32 kernel.

as far as I can tell, you simply have to think a bit about the 
bandwidths involved.  the first cluster has many nodes connected
via thin pipes, aggregated through a switch to gigabit
connecting to decent on-server bandwidth.

the second cluster has lots more high-bandwidth nodes, connected 
through 12 incoming gigabits, bottlenecked down to a single 
connection to the head/file server (which is itself poorly configured).

one obvious fix to the latter is to move some IO load onto
a second fileserver, which I've done.  great increase in stability,
though enough IO from enough nodes can still cause problems.
shortly I'll have logins, home directories and work/scratch all on 
separate servers.

for a more scalable system, I would put a small fileserver in each rack, 
but still leave the compute nodes nothing-installed.  I know that 
the folks at RQCHP/Sherbrooke have done something like this, very nicely,
for their serial farm.  it does mean you have a potentially significant
number of other servers to manage, but they can be identically configured.
heck, they could even net-boot and just grab a copy of the compute-node
filesystems from a central source.  the Sherbrooke solution involves 
smart automation of the per-rack server for staging user files as well
(they're specifically trying to support parameterized montecarlo runs.)

regards, mark hahn.


From ctierney at HPTI.com  Fri Feb 25 15:02:06 2005
From: ctierney at HPTI.com (Craig Tierney)
Date: Fri, 25 Feb 2005 16:02:06 -0700
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <Pine.LNX.4.44.0502251658230.31744-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0502251658230.31744-100000@coffee.psychology.mcmaster.ca>
Message-ID: <1109372526.2883.81.camel@localhost.localdomain>

On Fri, 2005-02-25 at 15:18, Mark Hahn wrote:
> > > Reasons to run disks for physics work.
> > > 1. Large tmp files and checkpoints.
> > 
> > Good reason, except when a node fails you lose your checkpoints.
> 
> you means s/node/disk/ right?  sure, but doing raid1 on a "diskless"
> node is not insane.  though frankly, if your disk failure rate is 
> that high, I'd probably do something like intermittently store
> checkpoints off-node.

Yes and no.  If the node is down, it is a bit tough for your model
to progress.  Raid1 works well enough in software so that you
don't need additional hardware except the disk.

> 
> > It all ends up being a risk assessment.  We have been up for close
> > to 6 months now.  We have not had a failure of the NFS server.  The
> 
> I have two nothing-installed clusters; on in use for 2+ years,
> the other for about 8 months.  the older one has never had an
> NFS-related problem of any kind (it's a dual-xeon with 2 u160
> channels and 3 disks on each; other than scsi, nothing gold-plated.)
> this cluster started out with 48 dual-xeons and a single 48pt 
> 100bT switch with a gigabit uplink.
> 
> the newer cluster has been noticably less stable, mainly because 
> I've been lazy.  in this cluster, there are 3 racks of 32 dual-opterons 
> (fc2 x86_64) that netboot from a single head node.  each rack has a 
> gigabit switch which is 4x LACP'ed to a "top" switch, which has 
> one measly gigabit to the head/fileserver.  worse yet, the head/FS
> is a dual-opteron (good), but running a crappy old 2.4 ia32 kernel.
> 
> as far as I can tell, you simply have to think a bit about the 
> bandwidths involved.  the first cluster has many nodes connected
> via thin pipes, aggregated through a switch to gigabit
> connecting to decent on-server bandwidth.
> 
> the second cluster has lots more high-bandwidth nodes, connected 
> through 12 incoming gigabits, bottlenecked down to a single 
> connection to the head/file server (which is itself poorly configured).
> 
> one obvious fix to the latter is to move some IO load onto
> a second fileserver, which I've done.  great increase in stability,
> though enough IO from enough nodes can still cause problems.
> shortly I'll have logins, home directories and work/scratch all on 
> separate servers.
> 
> for a more scalable system, I would put a small fileserver in each rack, 
> but still leave the compute nodes nothing-installed.  I know that 
> the folks at RQCHP/Sherbrooke have done something like this, very nicely,
> for their serial farm.  it does mean you have a potentially significant
> number of other servers to manage, but they can be identically configured.
> heck, they could even net-boot and just grab a copy of the compute-node
> filesystems from a central source.  the Sherbrooke solution involves 
> smart automation of the per-rack server for staging user files as well
> (they're specifically trying to support parameterized montecarlo runs.)

Sandia does something similar to this with their CIT toolkit,
but it is still diskless.  For every N nodes, they have an
NFS-redirector.  It boots diskless, and caches all of the files
that the clients read.  The clients hit the redirector, and not
the main filesystem.  

If you do have a disk in these nodes, there are probably some
interesting things you can do with CacheFS when it becomes stable.

Craig


From rgb at phy.duke.edu  Fri Feb 25 16:25:37 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 25 Feb 2005 19:25:37 -0500 (EST)
Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book
	Review
In-Reply-To: <Pine.LNX.4.61.0502251807340.27564@lapp-0>
References: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
	<Pine.LNX.4.61.0502251807340.27564@lapp-0>
Message-ID: <Pine.LNX.4.58.0502251920001.5780@lilith.rgb.private.net>

On Fri, 25 Feb 2005, Ryan Sweet wrote:

> 
> Glenn,
> 
> I have also had a look at the new ORA cluster book, hoping that they had 
> learned their lesson, and had a similar reaction.  While I don't wish to 
> discredit M. Sloan, as writing any sort of book is always going to be a lot of 
> work and filled with compromise, I felt from the very beginning that the 
> community can do better.  After reading over your review, which, while 
> scathing, was entirely accurate, I feel resolved that the beowulf community 
> _should_ do better.
> 
> Here's what I propose: let's make a "Beowulf.org Guide to Linux Clustering", 
> or whatever the heck else you want to call it.   Let us outline, review, 
> improve, and comment on it here on this mailing list.  Here's the hard part - 
> lets also set a deadline, with realistic goals, and try to stick to it.  Lets 
> assign any publishing rights or other "details" like that to the FSF or Linux 
> Documentation Project.
> 
> Robert Brown has already done a lot of work on such a book, and generously 
> made it freely available.  Maybe he is amenable to this being a starting 
> point?

Sure.  I periodically solicit help for such a project on the list --
this is the first time somebody has solicited me:-)

My experience is that it is really pretty difficult to get people to
actually contribute content.  However, I've already got a very decent
start going, I think, and as always if anybody wants to contribute
content (under the OPL it is published under) I will cheerily include
it, with attribution.

Based on Glenn's comments, I was actually feeling (once again) like I
ought to try to shake free enough time to do another full pass through
the content to bring it up to date and see if I can finish off some of
the missing chapters and -- possibly -- seek a paper publisher.  I want
to keep it online/free either way (and there are publishers out there
that are comfortable with this) but a lot of people want to own a paper
copy of stuff like this.  I get a lot of requests for a printable PDF
from people all over who found the html with google but missed the
online pdf images right next door...

> 
> In any case, I would gladly provide hosting for something like this, and 
> coordinate the project, as well as edit or write content.
> 
> There are many questions that arise:
>  	Most importantly - What should go in the book?
>  	In what order should these topics be covered?
>  	Should there be an attempt to have a common style?
>  	How (and how often) should it be revised?
>  	Does the book target new beowulf admins, seasoned experts, or both and 
> some in-between?
>  	Should mentioning vendors be allowed? What are the guidelines?
> 
> and so on.
> 
> Firstly, now that I've proposed the idea, I'll also start by volunteering to 
> write a chapter on diskless clustering.
> 
> Second, please take this opportunity to tell me why this is a bad idea, and 
> while your at it send your comments on the questions above.

It isn't a bad idea, but I've had open requests for content on the list
for years now, so don't hold your breath.  My personal experience is
that if you want something written, you gotta write it yourself;-)

  rgb

> 
> regards,
> -Ryan
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Fri Feb 25 16:53:01 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 25 Feb 2005 19:53:01 -0500 (EST)
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <20050225183146.GA1563@greglaptop.internal.keyresearch.com>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
	<Pine.LNX.4.61.0502251834350.27564@lapp-0>
	<20050225183146.GA1563@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.58.0502251927560.5780@lilith.rgb.private.net>

On Fri, 25 Feb 2005, Greg Lindahl wrote:

> On Fri, Feb 25, 2005 at 06:52:59PM +0100, Ryan Sweet wrote:
> 
> > While I can understand debating over the merits of nfsroot vs RAM-disk 
> > root, I fail to see many useful arguments for maintaining a local OS 
> > install.
> 
> An example of something that goes very wrong with NFS is upgrading a
> file to a new file with the same name. If that file is a binary or
> library that's in use anywhere in the cluster, you are likely to have
> a problem. Local disks and Scyld, on the other hand, do the right
> thing: existing processes using the binary or library continue to use
> the old version, while new ones use the new version.
> 
> This disagreement is as old as the hills, by the way: in the good old
> days, when Sun was young, lots of people ran their pizza-box
> workstations diskless, but that went out of style when Ethernet's
> performance was stuck in place for a bunch of years.
> 
> It's important to understand arguments you disagree with; your
> dismissal is not a good sign.

To add to Greg's remarks, yes diskless is a perfectly valid way to
structure a LAN (and I'm, alas, as old as the hills and actually ran
whole networks with sparc 1's, SLC's and ELC's diskless.  They typically
only had 4 MB or so of RAM (in later days as much as 16 or 32) and
actually did remote swap as well as NFS home directories and binaries
and so forth.  They worked amazingly well given the times.

The notion of "thin clients" goes back even before Sparcs -- I once ran a
wierd IBM PC clone in the mid-80s that had a proprietary "network"
interface, a hook into the bios of a host PC, and ran "diskless" by
leeching on the floppies and hard drive or the host.  It was marginally
cheaper than getting a regular PC with its own hard drive because back
then disk cost a mint -- all peripherals cost a mint.  PC's cost
thousands of 1980's dollars.  

I've run diskless linux clusters off and on as well, mostly from
necessity.  If you have enough memory it's good.

Still, I actually think that there are excellent reasons to consider and
perform installs to local disk.  Robustness, speed, ease of installation
and maintenance with tools like PXE, kickstart and yum have taken just
about all the sting out of it.  Having local swap is good.  Having local
scratch is good.  Decreasing memory occupancy may be good -- having
local disk means local paging is possible with a small performance edge
(depending on your network and so forth).  Thin clients have been
proposed periodically over the years, but they never quite take off --
it is just too damn convenient to have some measure of local robustness,
and hard disks are cheap.

> 
> > It can be _very_ useful in a situation where you have to support multiple 
> > user communities with wierd apps or strange requirements.
> 
> Yep. But your conclusion:
> 
> > That sort of config requires 
> > a lot of glue, but it would be way too much headache to even attempt 
> > without "diskless" systems.
> 
> Doesn't make any sense; I have seen people describe such systems where
> they download a disk image when a batch job wants a different software
> load. It's certainly doable that way: it does have different tradeoffs
> from the diskless case, but if it gives you a headache, it's probably
> because you don't like it, not because it's hard to do.

Ya.  Right now I think it is kinda cool just how MUCH one can do, all of
it pretty easy.  In the old days it was God's Own PITA to set up
diskless anything -- I wrote all kinds of stuff myself to get systems to
boot diskless and at least I COULD do it because I'd run Suns from the
old days, and then there was management of "packages" (binary
architectures and shared this and that) with no particular organization
on top of that. You young'uns have it all easy -- several different
toolsets to choose from to set up diskless operation, several different
toolsets to choose from to set up disk and manage packages, and far more
homogeneous hardware.

I certainly don't think that diskless is a knee-jerk obvious choice for
either LANs or clusters, although sure, there are some advantages to it
(just as there are some advantages to having local disks).


   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From jmdavis at mail2.vcu.edu  Fri Feb 25 10:21:30 2005
From: jmdavis at mail2.vcu.edu (Mike Davis)
Date: Fri, 25 Feb 2005 13:21:30 -0500
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <1109351864.2883.22.camel@localhost.localdomain>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
Message-ID: <421F6CAA.3040706@mail2.vcu.edu>

Craig,

Reasons to run disks for physics work.

1. Large tmp files and checkpoints.

2. Ability for distributed jobs to continue if master node fails.

3. saving network io for jobs rather than admin

I actually seldom update compute nodes (unless an update is required for 
software required for research). I mount, a /usr/global that does 
contain software. I also mount /home on each node.

An example of item 1 above are Gaussian jobs that we are now running 
that require >40GB of tmp space. For these jobs I have both an OS 20GB 
and tmp 100GB disk in each node. Due to a problematic scsi to ide 
converter, I have experienced item 2 too many times with one cluster, 
but even on the others I like knowing that work can continue even if the 
host is down (facilitated by a separate nfs server).

Of course, I am definitely old school. I use static IP's, individual 
passwd files. and simple scripts to handle administration.

Mike


Craig Tierney wrote:

>On Fri, 2005-02-25 at 01:16, John Hearns wrote:
>  
>
>>On Thu, 2005-02-24 at 18:20 -0500, Jamie Rollins wrote:
>>    
>>
>>>Hello.  I am new to this list, and to beowulfery in general.  I am working
>>>at a physics lab and we have decided to put together a relatively small
>>>beowulf cluster for doing data analysis.  I was wondering if people on
>>>this list could answer a couple of my newbie questions.
>>>
>>>The basic idea of the system is that it would be a collection of 16 to 32
>>>off-the-shelf motherboards, all booting off the network and operating
>>>completely disklessly.  We're looking at amd64 architecture running
>>>Debian, although we're flexible (at least with the architecture ;).  Most
>>>of my questions have to do with diskless operation.
>>>      
>>>
>>Jamie, 
>>  why are you going diskless?
>>IDE hard drives cost very little, and you can still do your network
>>install.
>>Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go.
>>
>>    
>>
>
>IDE drives fail, they use power, you waste time cloning, and
>depending on the toolkit you use you will run into problems
>with image consistency.
>
>I have run large systems of both kinds.  The last system was
>diskless and I don't see myself going back.  I like changing
>one file in one place and having the changes show up immediately.
>I like installing a packing once, and having it show up immediately,
>so I don't have to reclone or take the node offline to update
>the image.
>
>Craig
>
>
>  
>
>>BTW, have a look at Clusterworld http://www.clusterworld.com
>>They have a project for a low-cost cluster which is similar to your
>>thoughts.
>>
>>
>>Also, with the caveat that I work for a clustering company,
>>why not look at a small turnkey cluster?
>>I fully acknowledge that building a small cluster from scratch will be
>>a good learning exercise, and you can get to grips with the motherboard,
>>PXE etc. 
>>However if you are spending a research grant, I'd argue that it would be
>>cost effective to buy a system with support from any one of the
>>companies that do this.
>>If you get a prebuilt cluster, the company will have done the research
>>on PXE booting, chosen gigabit interfaces and switches which perform
>>well, chosen components which will last. And when your power supplies
>>fail, or a disk fails someone will come round to replace them.
>>And you can get on with doing your science.
>>
>>    
>>
>
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>


From dld at cmb.usc.edu  Fri Feb 25 10:29:35 2005
From: dld at cmb.usc.edu (Drake Diedrich)
Date: Fri, 25 Feb 2005 10:29:35 -0800
Subject: [Beowulf] Re: motherboards for diskless nodesy
In-Reply-To: <Pine.LNX.4.62.0502242038180.18632@localhost.localdomain>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<20050225010053.GA31456@app1.cmb.usc.edu>
	<Pine.LNX.4.62.0502242038180.18632@localhost.localdomain>
Message-ID: <20050225182935.GC31456@app1.cmb.usc.edu>

On Thu, Feb 24, 2005 at 10:03:59PM -0500, Donald Becker wrote:
> 
> The specific problem here is very likely the PXE server implementation,
> not the client side.
> 

   Ah, thanks.  I'd read a bit about some of the alternate paths PXE used,
but didn't realize there were quite so many bugs in Intel's various
implementations.  I'm currently using Tim Hurman's PXE 1.42, ISC DHCPD 3.0.1
on the same segment (and if I think about it, I bet I'll find a race in
there somewhere), and Jean-Pierre Lefebvre's atftpd 3.0.1.  It's working
well on many of our systems, and works poorly for the largest class (the
compute nodes). The next time I have to do a major install I may try to get
it working better on whatever class of machine is being installed that day.

> 
> >Spending a couple gigs of that for a locally installed O/S isn't much of a
> >drama, especially on ~16 nodes.
> 
> But it's the long-term administrative effort that costs, not the disk
> hardware.  The need to maintain and update a persistent local O/S is the
> root of most of that cost.
> 

   For small clusters though, local admin just isn't that much of a burden
compared to the hassles of writing your own distributed filesystem, testing
images, scheduling reboots, or crashing long-term jobs.  I'm using Makefile
targets for each request, so I can later make the same changes to new nodes. 
eg:

gsl:
        for h in `cat nodes` ; \
                do \
                ssh $$h apt-get install -y libgsl0 libgsl0-dev; \
                 done

   There are more elaborate cluster management/install systems (FAI,
cfengine, ...), dsh to perform ssh in parallel, etc, but for a small
research cluster with installation requirements that change daily, being
able to make simple changes in-flight without any testing or scheduling
updates, getting administrative approval, or really doing any of that hard
stuff turns day-long tasks into a few minutes.

User: can I have...
Reply 2 minutes later: installed.

   It's not quite as simple as a single system image, but it's only about
twice as much work as doing one node and retains all the flexibility, and
doesn't require rebooting or re-imaging nodes and killing jobs.

> >deleted when no longer in use.  NFS (being stateless) doesn't have
> >this behavior, so after an update you may occaisionally have
> >jobs/daemons when they try to page in a file that has already been
               ^die  [oops]
> >replaced.

> 
> NFS isn't bad.  Nor does it necessarily doom a server to unbearable
> loads.  For some types of file access, especially read-only access to
> small (<8KB) configuration files such as ~/.foo.conf, it's pretty close
> to optimal.

   Oh, NFS is actually pretty good (most of gigE wire speed on large files),
and I really like being able to do maintenance on the fileservers without
killing jobs (shutdown and replace dead disks, switch kernels, etc).  It's
our fileservers that can't keep up: the cluster is able to pound them into
the ground over NFS.  If client-side NFS were worse, our fileservers would
remain responsive during major job launches.  :) Having users with on the
order of a million small files each, most of which they try to open during
the course of their jobs is pretty damaging.  All of this is research code,
so it tends to get written once (data not consolidated in a database), run
once on the cluster, and then published.  Localizing the damage helps,
RAID10 helps, and convincing people to stage off a sacrificial scratch
fileserver also helps.  There are some relatively new distributed
filesystems out there (Lustre, GFS, ...) that might survive this load
better, but we haven't tested them, some aren't really unix filesystems at
all, and we are a long way from ready to commit /home to one.


From ballew at sublinear.net  Fri Feb 25 10:54:31 2005
From: ballew at sublinear.net (Mark C. Ballew)
Date: Fri, 25 Feb 2005 10:54:31 -0800
Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book
	Review
In-Reply-To: <Pine.LNX.4.61.0502251807340.27564@lapp-0>
References: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
	<Pine.LNX.4.61.0502251807340.27564@lapp-0>
Message-ID: <1109357671.4880.48.camel@sport>

On Fri, 2005-02-25 at 18:33 +0100, Ryan Sweet wrote:
> Glenn,
> 
> I have also had a look at the new ORA cluster book, hoping that they had 
> learned their lesson, and had a similar reaction.  While I don't wish to 
> discredit M. Sloan, as writing any sort of book is always going to be a lot of 
> work and filled with compromise, I felt from the very beginning that the 
> community can do better.  After reading over your review, which, while 
> scathing, was entirely accurate, I feel resolved that the beowulf community 
> _should_ do better.
> 
> Here's what I propose: let's make a "Beowulf.org Guide to Linux Clustering", 
> or whatever the heck else you want to call it.   Let us outline, review, 
> improve, and comment on it here on this mailing list.  Here's the hard part - 
> lets also set a deadline, with realistic goals, and try to stick to it.  Lets 
> assign any publishing rights or other "details" like that to the FSF or Linux 
> Documentation Project.
> 
> Robert Brown has already done a lot of work on such a book, and generously 
> made it freely available.  Maybe he is amenable to this being a starting 
> point?
> 
> In any case, I would gladly provide hosting for something like this, and 
> coordinate the project, as well as edit or write content.
> 
> There are many questions that arise:
>  	Most importantly - What should go in the book?
>  	In what order should these topics be covered?
>  	Should there be an attempt to have a common style?
>  	How (and how often) should it be revised?
>  	Does the book target new beowulf admins, seasoned experts, or both and 
> some in-between?
>  	Should mentioning vendors be allowed? What are the guidelines?
> 
> and so on.
> 
> Firstly, now that I've proposed the idea, I'll also start by volunteering to 
> write a chapter on diskless clustering.
> 
> Second, please take this opportunity to tell me why this is a bad idea, and 
> while your at it send your comments on the questions above.

I think a community-written book is a spectacular idea. The question I
have is would it be better to just do a web-based book since the beowulf
community is basically a moving target, or just put these into printed
"editions" as well as a Copyleft'd book?

I volunteer for any editing or proofing if such a project comes to life.

What goes in the book? Types of clusters, cluster purposes, cluster
interconnects, common issues (HVAC, user admin), and scaling are
somethings that comes to mind.

Order? Start with the basics. What is a cluster? Why clusters and not
SMP beasts? What types are there? What software is there? 

Style: Good question

How often should it be revised? If it is web and dead tree based, the
web version would constantly be updated with perhaps a yearly dead tree
revision?

Who does it target? I think a book that targeted beowulf admins with
some stuff for the seasoned expert would be good. New beowulf admins
often find themselves swamped with options. Seasoned admins join the
mailing list.

Vendors: I think that vendors should be allowed but only if there is a
balance between free options and vendors, or in the case of hardware,
why you'd go with a particular vendor (Myrinet vs. Infiniband, etc.). I
think something like "you should go to PSSC or Penguin Computing" is a
little bit of a stretch beyond an appendix on various vendors.

Mark
-- 
Mark C. Ballew                          Reno, Nevada    
ballew at sublinear.net                    http://markballew.com
PGP: 0xB2A33008                         AIM: pdx110
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050225/41cd413c/attachment.sig>

From john.hearns at streamline-computing.com  Sat Feb 26 02:50:48 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sat, 26 Feb 2005 10:50:48 +0000
Subject: [Beowulf] O'Reilly Clusters Book Review
In-Reply-To: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
References: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
Message-ID: <1109415052.6688.12.camel@ip13.2214.h2.fosdem.lan>

On Fri, 2005-02-25 at 00:23 -0800, Glen Otero wrote:
> 
> 
> ______________________________________________________________________
> 
> My review of O'Reilly's latest clusters book published at HPCwire
> (http://www.tgc.com/hpcwire.html):
> 

Glen,
  I am preparing a review of this book for the UK Unix Users Group.

I haven't finished reading your review yet, but I feel I should say that
I don't agree with the tone.
I'm sure that you have valid points - yes I agree that Myrinet and
Infiniband should be dismissed as 'emerging'.

But on balance I think it is a decent book, and I would recommend it.
In fact I have flagged it up on this list.

If I was to make a criticism, it would be about the section on MPI
programming. There's nothing wrong with this - in fact I will refer to
it myself. I would just have made it shorter and put some references in
to existing tutorials on the web, or textbooks. But hey, I didn't write
or edit the book.


From john.hearns at streamline-computing.com  Sat Feb 26 02:55:30 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sat, 26 Feb 2005 10:55:30 +0000
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <1109351864.2883.22.camel@localhost.localdomain>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
Message-ID: <1109415331.6688.16.camel@ip13.2214.h2.fosdem.lan>

On Fri, 2005-02-25 at 10:17 -0700, Craig Tierney wrote:

> >   why are you going diskless?
> > IDE hard drives cost very little, and you can still do your network
> > install.
> > Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go.
> > 
> 
> IDE drives fail, they use power, you waste time cloning, and
> depending on the toolkit you use you will run into problems
> with image consistency.
I agree - heck I'm work with large Beowulves every day.

but listen to what I said. For THIS APPLICATION in a small lab,
where a researcher is looking to homebrew a system, 
I firmly believe that putting IDE drives on each node and then
installing over the network is the way ahead.

We here on the Beowulf list can argue the benefits of diskless versus
disks. 
But for someone who just wants too get something working and off the
ground, I say go the 'conventional' route.


> I have run large systems of both kinds.  The last system was
> diskless and I don't see myself going back.  I like changing
> one file in one place and having the changes show up immediately.
> I like installing a packing once, and having it show up immediately,
> so I don't have to reclone or take the node offline to update
> the image.
Why take a node offline to do an update or a disk system?


From john.hearns at streamline-computing.com  Sat Feb 26 02:59:57 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sat, 26 Feb 2005 10:59:57 +0000
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <20050225183146.GA1563@greglaptop.internal.keyresearch.com>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
	<Pine.LNX.4.61.0502251834350.27564@lapp-0>
	<20050225183146.GA1563@greglaptop.internal.keyresearch.com>
Message-ID: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan>

On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote:

> 
> Doesn't make any sense; I have seen people describe such systems where
> they download a disk image when a batch job wants a different software
> load. It's certainly doable that way: it does have different tradeoffs
> from the diskless case, but if it gives you a headache, it's probably

I've always dreamed of using User Mode Linux images for this.
In a Grid-based world, prepare a UML instance which has all the
libraries and runtime to run your code. Ship it across the grid with
your executable. 
The cluster at the receiving end can be running any distribution - it
runs your UML in a sandbox.

And before anyone says it, yes performance would be a dog,
and I don't see how UML could access all those nice Myrinet and
Infiniband cards. SO I'm definitely blue-skying.


From john.hearns at streamline-computing.com  Sat Feb 26 03:04:47 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sat, 26 Feb 2005 11:04:47 +0000
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <Pine.LNX.4.44.0502251658230.31744-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0502251658230.31744-100000@coffee.psychology.mcmaster.ca>
Message-ID: <1109415887.6688.24.camel@ip13.2214.h2.fosdem.lan>

On Fri, 2005-02-25 at 17:18 -0500, Mark Hahn wrote:
> you means s/node/disk/ right?  sure, but doing raid1 on a "diskless"
> node is not insane.  though frankly, if your disk failure rate is 
> that high, I'd probably do something like intermittently store
> checkpoints off-node.

We recently put in mirrored system disks on a cluster.
The nodes in question are beefy SunFire V40z's, and the cluster is
intended to run long (ie. week long) jobs and the concern was raised
about long jobs failing due to a disk failure.


From atp at piskorski.com  Sat Feb 26 13:32:17 2005
From: atp at piskorski.com (Andrew Piskorski)
Date: Sat, 26 Feb 2005 16:32:17 -0500
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan>
References: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan>
Message-ID: <20050226213217.GA85119@piskorski.com>

On Sat, Feb 26, 2005 at 10:59:57AM +0000, John Hearns wrote:
> On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote:

> > Doesn't make any sense; I have seen people describe such systems where
> > they download a disk image when a batch job wants a different software
> > load. It's certainly doable that way: it does have different tradeoffs
> > from the diskless case, but if it gives you a headache, it's probably
> 
> I've always dreamed of using User Mode Linux images for this.

> And before anyone says it, yes performance would be a dog,

In that case, you should look into Xen.  I haven't heard of anyone
using it for HPC yet, but if I remember right, they claim only a 3% or
so performance loss running Linux virtualized under Xen vs. running on
the bare metal:

  http://www.cl.cam.ac.uk/Research/SRG/netos/xen/

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/


From mathog at mendel.bio.caltech.edu  Sat Feb 26 16:06:51 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Sat, 26 Feb 2005 16:06:51 -0800
Subject: [Beowulf] RE: S2466 Wake on Lan working, anyone?
Message-ID: <E1D5BxT-0001HM-00@mendel.bio.caltech.edu>


> 
> A much clearer demonstration of this failure is to run "shutdown -h 0" 
> and then attempt to power the machine on again at the front pannel switch.
> 
> Apparently the state machine that controls the power on these things 
> becomes locked in it's "shutdown" state and never enters it's "cold and 
> ready to be booted" state.

There is a solution to the problem you described.   When using
a 2.6.x kernel force the "button" module to load by placing the
line

button

in

  /etc/modprobe.preload

reboot
poweroff

At that point you can use the front power switch to restart
the machine after a poweroff.  At least with ACPI built into
the kernel and "acpi=on" present on the boot line. 

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From mathog at mendel.bio.caltech.edu  Sat Feb 26 16:33:34 2005
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Sat, 26 Feb 2005 16:33:34 -0800
Subject: [Beowulf] RE: S2466 Wake on Lan working, anyone?
Message-ID: <E1D5CNK-0001KQ-00@mendel.bio.caltech.edu>

> We sold a lot of those boards, and I could not remember any claims for 
> WOL support, so I went looking through the literature.

The _current_ literature says nothing.  The literature when
we bought this did.  Google for S2466 and WOL and you'll
find the old data sheets still available on the web, for
instance here:

http://www.bellmicro.com/product/HotProduct/tyan/d_s2466_210.pdf

that indicate the presence of a 3 pin WOL header. What point having
a WOL header if the board won't do WOL?  Note that they still don't
claim that the board won't do WOL, only that it will only WOL 
from a state where the capability is of no use.

As you point out, that line is missing from the current literature
for the board here:

ftp://ftp.tyan.com/datasheets/d_s2466_270.pdf

so apparently somewhere between _210 and _270 WOL was
eliminated from the product sheet.  They fessed up
(and fixed) the USB problem but seem to have swept the WOL
problem under the rug.  Or maybe the original documentation
was in error?  Either way, the story changed.

> I see no reference to WOL.
> Looking at their FAQ i see no mention of WOL:
> http://www.tyan.com/support/html/f_s2466.html

My point exactly.  Even if they don't support it people have
asked enough that it should be in the FAQ.

> While I understand that you want this to work, I see no reason to 
> complain that they do not properly support something they never claimed 
> to provide.

But they used to claim WOL, they just don't anymore.

> Could it be that you are expecting too much? And bitter about not 
> getting your way?

Nope, bitter about a company removing WOL from the product spec
rather than admitting explicitly, if only in the FAQ, that
it wasn't supported.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From maurice at harddata.com  Fri Feb 25 22:10:48 2005
From: maurice at harddata.com (Maurice Hilarius)
Date: Fri, 25 Feb 2005 23:10:48 -0700
Subject: [Beowulf] Re: RE: S2466 Wake on Lan working, anyone?
In-Reply-To: <200502252000.j1PK08cJ025090@bluewest.scyld.com>
References: <200502252000.j1PK08cJ025090@bluewest.scyld.com>
Message-ID: <422012E8.8030406@harddata.com>

"David Mathog" <mathog at mendel.bio.caltech.edu> wrote:

>The more I learn about Tyan the less I like them. Surely they've
>known all there is to know about the broken WOL for at least
>3 years, yet they didn't ever post the reason for the lack of WOL
>function in their FAQ section for this motherboard.  Could it
>possibly be that they didn't want to lose sales by admitting
>the board couldn't do WOL in a manner that was of any use
>to anybody?
>
We sold a lot of those boards, and I could not remember any claims for 
WOL support, so I went looking through the literature.
Can you show me an instance of something where Tyan claim they provide 
WOL support on this board?
Looking here:
http://www.tyan.com/products/html/tigermpx.html
ftp://ftp.tyan.com/datasheets/d_s2466_270.pdf

I see no reference to WOL.
Looking at their FAQ i see no mention of WOL:
http://www.tyan.com/support/html/f_s2466.html

They DO sell a server management card that provides equivalent 
capability plus other features:
http://www.tyan.com/products/html/m3289.html

While I understand that you want this to work, I see no reason to 
complain that they do not properly support something they never claimed 
to provide.
That they even tried to help at their tech support strikes me as going 
beyond the call of duty.

Could it be that you are expecting too much? And bitter about not 
getting your way?


From rsweet at aoes.com  Sat Feb 26 02:35:37 2005
From: rsweet at aoes.com (Ryan Sweet)
Date: Sat, 26 Feb 2005 11:35:37 +0100 (CET)
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <20050225183146.GA1563@greglaptop.internal.keyresearch.com>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
	<Pine.LNX.4.61.0502251834350.27564@lapp-0>
	<20050225183146.GA1563@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.61.0502261118400.27564@lapp-0>


Greg,

Thanks for the injection of some perspective.  It is clear that I chose my 
phrases poorly, and some alternate wording would have been more constructive 
;-)

On Fri, 25 Feb 2005, Greg Lindahl wrote:

> On Fri, Feb 25, 2005 at 06:52:59PM +0100, Ryan Sweet wrote:
>
>> While I can understand debating over the merits of nfsroot vs RAM-disk
>> root, I fail to see many useful arguments for maintaining a local OS
>> install.
>
> An example of something that goes very wrong with NFS is upgrading a
> file to a new file with the same name. If that file is a binary or
> library that's in use anywhere in the cluster, you are likely to have
> a problem. Local disks and Scyld, on the other hand, do the right
> thing: existing processes using the binary or library continue to use
> the old version, while new ones use the new version.

In practice I think the importance of this depends upon the particular 
requirements of the site.  Definitely there must be some places where its 
important, and should be planned for.  However most sites I've seen accept 
(prefer) that new versions of application software be installed alongside the 
old, rather than in replacement of it, and for system software updates they 
are usually willing to accept stopping running jobs and starting them again 
(either with the queue system or without.

> This disagreement is as old as the hills, by the way: in the good old
> days, when Sun was young, lots of people ran their pizza-box
> workstations diskless, but that went out of style when Ethernet's
> performance was stuck in place for a bunch of years.

Or in some places it kept right on going.

> It's important to understand arguments you disagree with; your
> dismissal is not a good sign.
>
> Yep. But your conclusion:
> Doesn't make any sense; I have seen people describe such systems where
> they download a disk image when a batch job wants a different software
> load. It's certainly doable that way: it does have different tradeoffs
> from the diskless case, but if it gives you a headache, it's probably
> because you don't like it, not because it's hard to do.

Yes, I agree.  I chose my words poorly in the first post, and came down rather 
hard against local installs.

In the end the choice is about balancing managing complexity with the 
requirements of the particular site.  I think that in a large percentage of 
use cases the admin will find that managing "diskless" (local disk for 
swap/scratch) systems is highly advantageous.

regards,

-Ryan

-- 
Ryan Sweet             <ryan.sweet at aoes.com>
Advanced Operations and Engineering Services
AOES Group BV            http://www.aoes.com
Phone +31(0)71 5795521  Fax +31(0)71572 1277


From ajt at rri.sari.ac.uk  Sat Feb 26 03:51:34 2005
From: ajt at rri.sari.ac.uk (Tony Travis)
Date: Sat, 26 Feb 2005 11:51:34 +0000
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <Pine.LNX.4.61.0502251834350.27564@lapp-0>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>	<1109319374.6055.17.camel@Vigor45>	<1109351864.2883.22.camel@localhost.localdomain>
	<Pine.LNX.4.61.0502251834350.27564@lapp-0>
Message-ID: <422062C6.2020603@rri.sari.ac.uk>

Ryan Sweet wrote:
> [...]
> I think the term "diskless" is sometimes the problem when discussing
>  centrally installed and managed systems.  Lots of "diskless" cluster
>  have GB and GB of local disks, only they are used for swap and temp
> I/O, not for the OS.

Hello, Ryan.

I agree with you: It is common for so-called 'diskless' nodes to have a 
local disk for /tmp and swap. In our case I have also made a symbolic 
link /var/tmp -> /tmp on each node as well. In fact, Sun used to call 
this a 'dataless' client (i.e. no permanent data stored on the client, 
only local swap and temporary files). In the end, Sun abandoned support 
for 'dataless' clients in favour of their NFS-based cacheFS.

The important thing about diskless/dataless/cacheFS clients is that they 
can easily be replaced with a new one if they go wrong without loss of 
permanent 'data'. Of course, the data associated with processes actually 
running on a node is lost if the node fails in use, but the new node is 
a plug-in replacement for the old one, and just needs to be rebooted. In 
our case, there is a little bit more to it than that because we have to 
add the new MAC address to the DHCP server for the fixed IP address used 
by the node, and partition/format the new local disk but this is done by 
a script and takes about 2 minutes!.

It might be a lot less confusing if we talked about PXE booting dataless 
clients/nodes...

	Tony.
-- 
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687


From reuti at staff.uni-marburg.de  Sat Feb 26 04:27:50 2005
From: reuti at staff.uni-marburg.de (Reuti)
Date: Sat, 26 Feb 2005 13:27:50 +0100
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
	<Pine.LNX.4.61.0502251834350.27564@lapp-0>
	<20050225183146.GA1563@greglaptop.internal.keyresearch.com>
	<1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan>
Message-ID: <1109420870.42206b4640844@home.staff.uni-marburg.de>

Quoting John Hearns <john.hearns at streamline-computing.com>:

> On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote:
> 
> > 
> > Doesn't make any sense; I have seen people describe such systems where
> > they download a disk image when a batch job wants a different software
> > load. It's certainly doable that way: it does have different tradeoffs
> > from the diskless case, but if it gives you a headache, it's probably
> 
> I've always dreamed of using User Mode Linux images for this.
> In a Grid-based world, prepare a UML instance which has all the
> libraries and runtime to run your code. Ship it across the grid with
> your executable. 
> The cluster at the receiving end can be running any distribution - it
> runs your UML in a sandbox.

I would like to have it also: if any queuing system wants to kill a job on a 
node: just shutdown the virtual machine. And you also get off of any semaphores 
and shared memory segments (and message queues), which maybe left behind in 
other cases. I saw leftover semaphores not only on Linux, but also on AIX and 
SuperUX in case of a job abort. Is there any safe way to release them after a 
job? I already got the idea, to catch them with a library which wraps the 
shmget(),.. calls by using LD_PRELOAD to get the IDs, and then release them in 
an epilog after the jobs (seems working, but of course only for dynamically 
linked applications).

Just got the hint to look at Meiosys. Seems they have such features in their 
virtual machines.

Cheers - Reuti


From hepu.deng at rmit.edu.au  Fri Feb 25 16:56:05 2005
From: hepu.deng at rmit.edu.au (Hepu Deng)
Date: Sat, 26 Feb 2005 11:56:05 +1100
Subject: [Beowulf] ICNC'05-FSKD'05 Final Call for Papers/Special
	Sessions/Sponsorship: Changsha China
Message-ID: <s22063ea.033@its-gw-inet57.its.rmit.edu.au>


----------------------------------------------------------------------
    2005 International Conference on Natural Computation (ICNC'05)
   International Conference on Fuzzy Systems and Knowledge Discovery 
                          (FSKD'05)
----------------------------------------------------------------------

                27 - 29 August 2005, Changsha, China

           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               Home Page: http://www.xtu.edu.cn/nc2005
              http://www.ntu.edu.sg/home/elpwang/nc2005
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

             *** Submission Deadline: 15 March 2005 ***

      FINAL CALL FOR PAPERS, SPECIAL SESSIONS, AND SPONSORSHIP

The ICNC'05-FSKD'05 will feature the most up-to-date research results 
in computational algorithms inspired from nature, including biological,

ecological, and physical systems. It is an exciting and emerging
inter-
disciplinary area in which a wide range of techniques and methods are 
being studied for dealing with large, complex, and dynamic problems. 
The joint conferences will also promote cross-fertilization over these

exciting and yet closely-related areas. Registration to either 
conference will entitle a participant to the proceedings and technical

sessions of both conferences, as well as the conference banquet, 
buffet lunches, and tours to some attractions in Changsha. 

Specific areas include, but are not limited to neural computation, 
evolutionary computation, quantum computation, DNA computation, 
chemical computation, information processing in cells and tissues, 
molecular computation, computation with words, fuzzy computation, 
granular computation, artificial life, swarm intelligence, ants 
colony, artificial immune systems, etc., with applications to 
knowledge discovery, finance, operations research, and more.

Publications
------------
The ICNC'05 and FSKD'05 conference proceedings will be published in 
Springer's Lecture Notes in Computer Science (LNCS) and Lecture Notes 
in Artificial Intelligence (LNAI), respectively. Both the LNCS and 
LNAI are indexed in SCI-Expanded. A selected number of authors will be
invited to expand and revise their papers for possible inclusions in 
peer-reviewed international journals / edited books.

Special Sessions
----------------
In addition to regular sessions, participants are encouraged to 
organize special sessions on specialized topics. Each special session 
should have at least 4 papers. Special session organizers will solicit

submissions and conduct reviews on the submitted papers. Proposals for
special sessions should be sent to the respective Program Chairs, 
i.e., 
        Ke Chen (neural computation, Ke.Chen at manchester.ac.uk)
        Yew Soon Ong (other topics in ICNC'05, asysong at ntu.edu.sg)
        Yaochu Jin (FSKD'05, yaochu.jin at honda-ri.de) 

Keynote Speakers
----------------
        Shun-ichi Amari, Japan
        Aike Guo, China
        Nikhil R. Pal, India
        Xin Yao, UK

About Changsha, Hunan, China
----------------------------
Changsha, the capital of Hunan Province, is a historic and cultural 
city in southern China and a busy port on the Xiangjiang River, with a

population over 6 million. Founded 3000 years ago, the city became the

capital of the Zhou state (951-960 AD) and a leading commercial center

during the Song dynasty (960-1279 AD). Changsha International Airport 
is easily accessible with direct flights to all major domestic and 
some international destinations. Other famous tourist destinations in 
Hunan include the Zhangjiajie National Park (natural heritage listed 
by UN) and Fenghuang (Phoenix) Ancient City.

Important Dates
---------------
        Paper Submission                    :   15 March 2005
        Decision Notification               :   15 April 2005
        Final Versions / Author Registration:   15 May 2005

Contact
-------
        Email: nc2005 at xtu.edu.cn 
        Phone/Fax: +86 732 829 2201 / 829 3249 

Submission of Papers 
--------------------
Authors are invited to submit a full paper as an electronic file 
(postscript, pdf or Word format) at the conference website. Templates 
are available at both the conference website and the Springer 
website. 

Sponsorship / Exhibition
------------------------
The conferences will offer product vendors a sponsorship package 
and/or an opportunity to interact with conference participants. 
Product demonstration and exhibition can also be arranged. For more 
information, please visit the conference web page.  

Sponsor / Organizer
-------------------
        Xiangtan University, China

Technical Co-Sponsor
--------------------
        IEEE Circuits and Systems Society
        IEEE Computational Intelligence Society
        IEEE Control Systems Society

In Co-operation with
--------------------
        International Neural Network Society
        International Fuzzy Systems Association
        Chinese Association for Artificial Intelligence
        European Neural Network Society
        Fuzzy Mathematics and Systems Association of China
        Japanese Neural Network Society
        Asia-Pacific Neural Network Assembly

Honorary Conference Chairs
--------------------------
        Shun-ichi Amari, Japan
        Lotfi A. Zadeh, USA

International Advisory Board
----------------------------
        Toshio Fukuda, Japan  
        Kunihiko Fukushima, Japan
        Tom Gedeon, Australia
        Aike Guo, China
        Zhenya He, China  
        Janusz Kacprzyk, Poland
        Nik Kasabov, New Zealand
        John A. Keane, UK
        Soo-Young Lee, Korea
        Erkki Oja, Finland 
        Nikhil R. Pal, India
        Witold Pedrycz, Canada
        Jose Principe, USA
        Harold Szu, USA
        Shiro Usui, Japan
        Xindong Wu, USA
        Lei Xu, Hong Kong, China
        Xin Yao, UK
        Syozo Yasui, Japan
        Bo Zhang, China
        Yixin Zhong, China  
        Jacek M. Zurada, USA

General Chair
-------------
        He-An Luo, China

General Co-Chairs
-----------------
        Lipo Wang, Singapore
        Yunqing Huang, China

Program Chairs
--------------
        ICNC'05: 	
           Ke Chen, UK
           Yew Soon Ong, Singapore
        FSKD'05: 	
           Yaochu Jin, Germany

Local Arrangement Chairs
------------------------
        Renren Liu, China
        Xieping Gao, China

Proceedings Chair
-----------------
        Fen Xiao, China

Publicity Chair
---------------
        Hepu Deng, Australia

Sponsorship/Exhibits Chairs
---------------------------
        Shaoping Ling, China
        Geok See Ng, Singapore

Webmaster
---------
        Linai Kuang, China
        Yanyu Liu, China


From jake at spiekerfamily.com  Sun Feb 27 19:27:23 2005
From: jake at spiekerfamily.com (Jake Thebault-Spieker)
Date: Sun, 27 Feb 2005 22:27:23 -0500
Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix
Message-ID: <42228F9B.2040503@spiekerfamily.com>

A couple of questions.

1. Is it possible to use the HD in the nodes on the cluster as one 
HD(kind of like an extended RAID array)?

2. Does anybody know of a program that will calculate pi, one digit at a 
time, infinitely that will run in parallel?

3. What is the difference between Mosix(www.mosix.org) and 
openMosix(www.openmosix.org)?

I'm in the process of reading "Engineering a Beowulf Style Computer 
Cluster" by Robert Brown. I like it a lot and it contains lots of 
information. Thanks Mr. Brown! ;-)

-- 
I think computer viruses should count as life. 
I think it says something about human nature 
that the only form of life we have created so far is purely destructive. 
We've created life in our own image. 
--Stephen Hawking

Jake Thebault-Spieker


From rgb at phy.duke.edu  Sun Feb 27 23:10:38 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 28 Feb 2005 02:10:38 -0500 (EST)
Subject: [Beowulf] motherboards for diskless nodes
In-Reply-To: <1109420870.42206b4640844@home.staff.uni-marburg.de>
References: <Pine.GSO.4.56.0502221609400.24176@ligo.mit.edu>
	<1109319374.6055.17.camel@Vigor45>
	<1109351864.2883.22.camel@localhost.localdomain>
	<Pine.LNX.4.61.0502251834350.27564@lapp-0>
	<20050225183146.GA1563@greglaptop.internal.keyresearch.com>
	<1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan>
	<1109420870.42206b4640844@home.staff.uni-marburg.de>
Message-ID: <Pine.LNX.4.58.0502280203210.12512@lilith.rgb.private.net>

On Sat, 26 Feb 2005, Reuti wrote:

> Quoting John Hearns <john.hearns at streamline-computing.com>:
> 
> > On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote:
> > 
> > > 
> > > Doesn't make any sense; I have seen people describe such systems where
> > > they download a disk image when a batch job wants a different software
> > > load. It's certainly doable that way: it does have different tradeoffs
> > > from the diskless case, but if it gives you a headache, it's probably
> > 
> > I've always dreamed of using User Mode Linux images for this.
> > In a Grid-based world, prepare a UML instance which has all the
> > libraries and runtime to run your code. Ship it across the grid with
> > your executable. 
> > The cluster at the receiving end can be running any distribution - it
> > runs your UML in a sandbox.
> 
> I would like to have it also: if any queuing system wants to kill a job on a 
> node: just shutdown the virtual machine. And you also get off of any semaphores 
> and shared memory segments (and message queues), which maybe left behind in 
> other cases. I saw leftover semaphores not only on Linux, but also on AIX and 
> SuperUX in case of a job abort. Is there any safe way to release them after a 
> job? I already got the idea, to catch them with a library which wraps the 
> shmget(),.. calls by using LD_PRELOAD to get the IDs, and then release them in 
> an epilog after the jobs (seems working, but of course only for dynamically 
> linked applications).
> 
> Just got the hint to look at Meiosys. Seems they have such features in their 
> virtual machines.

Another place to look for stuff not unlike this is the COD project at
Duke.  Except that with COD the "sandbox" is the whole computer.  If
your application needs a specific operating system or resource
collection, you just prepare an appropriate image and boot the cluster
(diskless or not) into that image long enough to run the application,
then boot it back into something else.

Clumsy as this sounds (and obviously overkill for certain classes of
things) it has some significant advantages to consider.  In addition to
very definitely having all the right libraries and resources there is
security -- not to worry, the images you load contain YOUR account
information and authentication information, so when your reboot into
something else later, the entire system is taken down.  Booting up a
cluster into a new image can take as little as a few minutes, which is
no big deal if the task will run for days.  It eliminates the need for
significant virtualization or something like vmware.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Sun Feb 27 23:28:51 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 28 Feb 2005 02:28:51 -0500 (EST)
Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix
In-Reply-To: <42228F9B.2040503@spiekerfamily.com>
References: <42228F9B.2040503@spiekerfamily.com>
Message-ID: <Pine.LNX.4.58.0502280211380.12512@lilith.rgb.private.net>

On Sun, 27 Feb 2005, Jake Thebault-Spieker wrote:

> A couple of questions.
> 
> 1. Is it possible to use the HD in the nodes on the cluster as one 
> HD(kind of like an extended RAID array)?

Yes.  Google for PVFS.

> 2. Does anybody know of a program that will calculate pi, one digit at a 
> time, infinitely that will run in parallel?

I don't know about one that will compute an infinite number of digits in
PI, but the computation of PI via the arctan series is trivially
partitionable in a variety of ways.  You'll spend more time working to
sum and align the digits you get (as they obviously will have to be
obtained and manipulated piecewise as strings) than you will doing the
computation per se.  It actually sounds like a decent exercise, as the
carry from small digits may have to propagate iteratively back to larger
ones as you extend the computation farther and farther.

However, it ALSO sounds like one of those problems where parallization
may not do too well trying to beat a well-written serial version.

Also, IIRC there are example programs for computing pi in parallel in
lam and mpich, but I don't think they are geared for returning all the
digits as a digit string.

You might look at the following:

  http://www.mathpages.com/home/kmath373.htm

or

  http://aemes.mae.ufl.edu/~uhk/PI.html

for some of many online articles on pi and its computation.  Google is
your friend.

> 
> 3. What is the difference between Mosix(www.mosix.org) and 
> openMosix(www.openmosix.org)?

I don't know, don't use Mosix.  But somebody on list probably does.

> I'm in the process of reading "Engineering a Beowulf Style Computer 
> Cluster" by Robert Brown. I like it a lot and it contains lots of 
> information. Thanks Mr. Brown! ;-)

You're welcome!

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From joachim at ccrl-nece.de  Mon Feb 28 01:46:08 2005
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Mon, 28 Feb 2005 10:46:08 +0100
Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book
	Review
In-Reply-To: <Pine.LNX.4.61.0502251807340.27564@lapp-0>
References: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
	<Pine.LNX.4.61.0502251807340.27564@lapp-0>
Message-ID: <4222E860.2060508@ccrl-nece.de>

Ryan Sweet wrote:
> Here's what I propose: let's make a "Beowulf.org Guide to Linux 
> Clustering", or whatever the heck else you want to call it.   Let us 
> outline, review, improve, and comment on it here on this mailing list.  
> Here's the hard part - lets also set a deadline, with realistic goals, 
> and try to stick to it.  Lets assign any publishing rights or other 
> "details" like that to the FSF or Linux Documentation Project.
[...]
> Second, please take this opportunity to tell me why this is a bad idea, 
> and while your at it send your comments on the questions above.

I'd say a wiki would be an easier start as it is self-organizing. 
Depending on how it develops, you can still turn it into a book.

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de


From eugen at leitl.org  Mon Feb 28 02:01:38 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Mon, 28 Feb 2005 11:01:38 +0100
Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book
	Review
In-Reply-To: <4222E860.2060508@ccrl-nece.de>
References: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
	<Pine.LNX.4.61.0502251807340.27564@lapp-0>
	<4222E860.2060508@ccrl-nece.de>
Message-ID: <20050228100138.GM1404@leitl.org>

On Mon, Feb 28, 2005 at 10:46:08AM +0100, Joachim Worringen wrote:

> I'd say a wiki would be an easier start as it is self-organizing. 
> Depending on how it develops, you can still turn it into a book.

Yes, my vote goes to a Wiki, too. 

(I could host it, if necessary).

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050228/f1b90ca5/attachment.sig>

From landman at scalableinformatics.com  Mon Feb 28 04:58:49 2005
From: landman at scalableinformatics.com (Joe Landman)
Date: Mon, 28 Feb 2005 07:58:49 -0500
Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix
In-Reply-To: <Pine.LNX.4.58.0502280211380.12512@lilith.rgb.private.net>
References: <42228F9B.2040503@spiekerfamily.com>
	<Pine.LNX.4.58.0502280211380.12512@lilith.rgb.private.net>
Message-ID: <42231589.4080706@scalableinformatics.com>


>>2. Does anybody know of a program that will calculate pi, one digit at a 
>>time, infinitely that will run in parallel?
> 
> 
> I don't know about one that will compute an infinite number of digits in
> PI, but the computation of PI via the arctan series is trivially
> partitionable in a variety of ways.  You'll spend more time working to
> sum and align the digits you get (as they obviously will have to be
> obtained and manipulated piecewise as strings) than you will doing the
> computation per se.  It actually sounds like a decent exercise, as the
> carry from small digits may have to propagate iteratively back to larger
> ones as you extend the computation farther and farther.
> 


http://mathworld.wolfram.com/PiDigits.html
http://mathworld.wolfram.com/PiFormulas.html
http://www.andrews.edu/~calkins/physics/Miracle.pdf

and others.

It is possible to calculate the digits individually using the Bailey et 
al algorithm.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615


From eugen at leitl.org  Mon Feb 28 09:24:57 2005
From: eugen at leitl.org (Eugen Leitl)
Date: Mon, 28 Feb 2005 18:24:57 +0100
Subject: [Beowulf] [Lustre-announce] CLUSTER FILE SYSTEMS,
	INC. RELEASES LUSTRE VERSION 1.2.4 TO THE GENERAL PUBLIC
	(fwd from jeff@clusterfs.com)
Message-ID: <20050228172456.GE1404@leitl.org>

----- Forwarded message from Jeffrey Denworth <jeff at clusterfs.com> -----

From: Jeffrey Denworth <jeff at clusterfs.com>
Date: Mon, 28 Feb 2005 10:19:22 -0500
To: lustre-announce at lists.clusterfs.com,
	lustre-discuss at lists.clusterfs.com
Subject: [Lustre-announce] CLUSTER FILE SYSTEMS, INC. RELEASES LUSTRE VERSION 1.2.4 TO THE GENERAL PUBLIC
X-Mailer: Apple Mail (2.619.2)

Boston, MA?February 28, 2005? Cluster File Systems, Inc.?, the leader 
in high-performance parallel file systems, today released a major 
update to the Lustre? file system on its public download site.?Used on 
many of the world?s largest Linux clusters, Lustre 1.2.4 represents a 
significant advancement in the development of a world-class open-source 
file system.?This release further demonstrates Cluster File Systems? 
ongoing commitment to make new versions of Lustre available to the free 
software community on a regular basis.

The Lustre file system is a next-generation cluster storage solution, 
designed to serve clusters with up to tens of thousands of nodes, 
manage petabytes of storage, and move hundreds of gigabytes per second 
with state of the art security and management infrastructure.

Improvements in Lustre 1.2.4 over the previous publicly available 
version include:
- support for Intel?Itanium?, Intel EM64T, and AMD64 Architectures
- support for Linux 2.6
- support for the Quadrics?QsNet II (Elan 4) interconnect
- a disaster recovery tool (lfsck)
- support for Object Storage Server addition
- zero-configuration Lustre clients
- very many improvements to performance and stability
- demonstrated capability on systems with more than 1,200 nodes and 300 
TB of storage, at more than 11 GB/s
?
Downloading Lustre
To download Lustre 1.2.4, please visit 
http://www.clusterfs.com/download.html

About Cluster File Systems

Founded in 2001, Cluster File Systems has established itself as the 
recognized leader in high-performance, scalable cluster file system 
technology.?The company?s premier Lustre cluster file system has 
demonstrated acceptance and capability on the world?s fastest cluster 
supercomputers.?Through partnerships with leading HPC storage, server, 
and software vendors, Cluster File Systems is helping cluster customers 
worldwide realize the benefits of scalable, reliable storage with 
Lustre.?The company is headquartered in Boston, Massachusetts, with 
operations in North America, Europe, and Asia.

Lustre, the Lustre logo, Cluster File Systems, and CFS are trademarks 
of Cluster File Systems, Inc in the United States.?All other trademarks 
are the property of their respective holders.
_______________________________________________
Lustre-announce mailing list
Lustre-announce at lists.clusterfs.com
https://lists.clusterfs.com/mailman/listinfo/lustre-announce

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050228/02c77c0f/attachment.sig>

From mike at etek.chalmers.se  Mon Feb 28 00:32:48 2005
From: mike at etek.chalmers.se (Mikael Fredriksson)
Date: Mon, 28 Feb 2005 09:32:48 +0100
Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix
In-Reply-To: <Pine.LNX.4.58.0502280211380.12512@lilith.rgb.private.net>
References: <42228F9B.2040503@spiekerfamily.com>
	<Pine.LNX.4.58.0502280211380.12512@lilith.rgb.private.net>
Message-ID: <4222D730.2090702@etek.chalmers.se>

Robert G. Brown wrote:
>>3. What is the difference between Mosix(www.mosix.org) and 
>>openMosix(www.openmosix.org)?
> 
> 
> I don't know, don't use Mosix.  But somebody on list probably does.

Mosix is proprietary software, OpenMosix is opensoftware.  OpenMosix has 
it's roots in the Mosix project.


MF


From rsweet at aoes.com  Mon Feb 28 02:01:29 2005
From: rsweet at aoes.com (Ryan Sweet)
Date: Mon, 28 Feb 2005 11:01:29 +0100 (CET)
Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix
In-Reply-To: <Pine.LNX.4.58.0502280211380.12512@lilith.rgb.private.net>
References: <42228F9B.2040503@spiekerfamily.com>
	<Pine.LNX.4.58.0502280211380.12512@lilith.rgb.private.net>
Message-ID: <Pine.LNX.4.61.0502281046240.27564@lapp-0>

>>
>> 3. What is the difference between Mosix(www.mosix.org) and
>> openMosix(www.openmosix.org)?
>
> I don't know, don't use Mosix.  But somebody on list probably does.

MOSIX started in the '70s, on PDP11.  It was ported to linux in the '90s.  It 
is a system for allowing a (pseudo)cluster-wide process id space and 
cluster-wide load balancing via process migration.  OpenMOSIX is a 
fork/re-write of MOSIX that was started a few years ago due to disagreements 
about the MOSIX license (which is not Open Source).  OpenMOSIX is released 
under the GPL, and has a much larger developer and user community than MOSIX.

If you are interested in single-system-image clustering maybe also checkout 
bproc http://sourceforge.net/projects/bproc (note - see scyld for a complete 
bproc solution)
or 
OpenSSI http://www.openssi.org
and
(definitely more experimental than the above, but promising) Kerrighed 
http://www.kerrighed.org/

SSI clusters are certainly interesting, but if you are new to clustering 
you may want to get your feet wet with a more traditional model first, so that 
you have a good reference point when reviewing and weighing options for SSI.

regards,
-Ryan

-- 
Ryan Sweet             <ryan.sweet at aoes.com>
Advanced Operations and Engineering Services
AOES Group BV            http://www.aoes.com
Phone +31(0)71 5795521  Fax +31(0)71572 1277


From eno at dorsai.org  Mon Feb 28 03:42:05 2005
From: eno at dorsai.org (Alpay Kasal)
Date: Mon, 28 Feb 2005 06:42:05 -0500
Subject: [Beowulf] Thermal Kill-Switch
Message-ID: <0ICM0051KDO1XP@mta9.srv.hcvlny.cv.net>

Hello all. Can someone please give me some pointers to a vendor of some kind
of Thermal Kill-Switch? If I'm not mistaken, it's an inline AC power device
that powers-off if the room reaches a certain temperature. I checked eBay,
Google, and Home Depot. No dice. Thanks for the help.

Ps: not really looking to start a DIY project. I am hoping someone can point
me to a low-cost off the shelf device.

And just in case this fits anyones needs, I found this in the Beowulf.org
archives... http://www.apcc.com/products/family/index.cfm?id=47 I think it
might be too pricey for my budget

Alpay


From rsweet at aoes.com  Mon Feb 28 04:24:19 2005
From: rsweet at aoes.com (Ryan Sweet)
Date: Mon, 28 Feb 2005 13:24:19 +0100 (CET)
Subject: [Beowulf] So we will write our own book - next steps...
In-Reply-To: <20050228100138.GM1404@leitl.org>
References: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
	<Pine.LNX.4.61.0502251807340.27564@lapp-0>
	<4222E860.2060508@ccrl-nece.de> <20050228100138.GM1404@leitl.org>
Message-ID: <Pine.LNX.4.61.0502281132420.27564@lapp-0>


On Mon, 28 Feb 2005, Eugen Leitl wrote:

> On Mon, Feb 28, 2005 at 10:46:08AM +0100, Joachim Worringen wrote:
>
>> I'd say a wiki would be an easier start as it is self-organizing.
>> Depending on how it develops, you can still turn it into a book.
>
> Yes, my vote goes to a Wiki, too.
>
> (I could host it, if necessary).

I'm willing to try this as it may be a good way to bootstrap an effort, but I 
see a few problems with it, which may be real problems or may be imagined ones 
(hopefully this doesn't wander too far off-topic - jab me with a stick if it 
does):

* I think it would be good to target the Linux Documentation Project, which 
uses DocBook (http://www.tldp.org/LDP/LDP-Author-Guide/html/docbook-why.html) 
DocBook has a lot of advantages for this sort of thing.  If a wiki were used 
to organise the content what does the actual data look like?  Raw text in a a 
database, or xml in a database would be preferable for later conversion to 
docbook.  A wiki that used docbook articles as a backend would be great. 
Google turns up a dead project on freshmeat.  What I think would be bad is a 
wiki database containing 500 paragraphs of HTML, with different styles (if 
any), inconsistent tags, and so on.

* most wikis seem to make it difficult to generate a printed copy or pdf 
version of the whole document - similarly, is it possible to make entire wikis 
available as a download for offline reading?

* I've seen far more badly structured or confusing wikis that good ones.  The 
ones that I have seen that are good are much closer in form to a FAQ or a 
HOWTO, using the wiki more for collaborative editing than for organsiational 
structure.  Maybe all this implies is a stronger editorial presence, I don't 
know.

* Drupal's collaborative book feature looks like 
maybe an interesting middle-road: http://drupal.org/node/284 though maybe it 
would have the same problems.

Also re: 
>> Robert Brown has already done a lot of work on such a book, and generously
>> made it freely available.  Maybe he is amenable to this being a starting
>> point?
>
> Sure.  I periodically solicit help for such a project on the list --
> this is the first time somebody has solicited me:-)
>
> My experience is that it is really pretty difficult to get people to 
> actually contribute content.  However, I've already got a very decent
> start going, I think, and as always if anybody wants to contribute
> content (under the OPL it is published under) I will cheerily include
> it, with attribution.

Well, we've even received a testimonial just yesterday about how helpful it 
has been.  I do recall that the book has a rather personal style to it though 
(an asset that makes the publication less dull), which may (or may not?) make 
it seem awkward as the basis for a larger collaborative effort.

> Based on Glenn's comments, I was actually feeling (once again) like I
> ought to try to shake free enough time to do another full pass through
> the content to bring it up to date and see if I can finish off some of
> the missing chapters and -- possibly -- seek a paper publisher.  I want
> to keep it online/free either way (and there are publishers out there
> that are comfortable with this) but a lot of people want to own a paper
> copy of stuff like this.  I get a lot of requests for a printable PDF
> from people all over who found the html with google but missed the
> online pdf images right next door...

Which is another reason that maybe a wiki would not entirely serve (see 
above).

OK, this may be a loaded question - for Robert, how do you feel about people 
contributing to your book vs starting a new, collaborative effort which draws 
upon the strenghts of what you have already done?

For others, how would you feel about contributing to Robert's book using his 
Latex template, vs starting a new collaborative effort?

Each approach has advantages though if, as was mentioned, its been difficult 
to get wider contributions for the book as it is, then maybe a more overtly 
collaborative approach would help.

BTW - If Mr. Joeseph Sloan is around, I hope you aren't taking offense (er.. I 
guess I would taken offense at the tone of Glen's review - but hopefully you 
appreciates brutal honesty;-)  - I just think there is a need for a book that 
draws upon all of the knowledge which is dispensed on this list and presents 
it in a balanced, well-considered and thorough fashion, offering something for 
beginners and advanced users alike.

In either case I think I can devote a certain amount of dedicated time to the 
effort, as it overlaps with my need to develop some training material, because 
it has been very difficult to find good HPC consultants lately.

Incidentaly, with reference to the above, we're hiring:
http://www.aoes.com/en/jobs/vn0416.html
send me your CV if you are both able and interested.

-- 
Ryan Sweet             <ryan.sweet at aoes.com>
Advanced Operations and Engineering Services
AOES Group BV            http://www.aoes.com
Phone +31(0)71 5795521  Fax +31(0)71572 1277


From makcaym at gmail.com  Mon Feb 28 04:28:37 2005
From: makcaym at gmail.com (m a)
Date: Mon, 28 Feb 2005 14:28:37 +0200
Subject: [Beowulf] Beowulf cluster usage statistics
Message-ID: <d946647a050228042823cf1ca1@mail.gmail.com>

Hello,

I would like to install new beowulf cluster.
is there any statistical info about what are the distribution of
cluster software distribution all over the world?
is there any info about beowulf cluster specs?
what would be your suggestion for new comers to start where?
thanks
makcaym


From john.hearns at streamline-computing.com  Mon Feb 28 15:57:30 2005
From: john.hearns at streamline-computing.com (John Hearns)
Date: Mon, 28 Feb 2005 23:57:30 +0000
Subject: [Beowulf] So we will write our own book - next steps...
In-Reply-To: <Pine.LNX.4.61.0502281132420.27564@lapp-0>
References: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
	<Pine.LNX.4.61.0502251807340.27564@lapp-0>
	<4222E860.2060508@ccrl-nece.de> <20050228100138.GM1404@leitl.org>
	<Pine.LNX.4.61.0502281132420.27564@lapp-0>
Message-ID: <1109635050.6244.17.camel@Vigor11>

On Mon, 2005-02-28 at 13:24 +0100, Ryan Sweet wrote:
> 

> 
> * Drupal's collaborative book feature looks like 
> maybe an interesting middle-road: http://drupal.org/node/284 though maybe it 
> would have the same problems.
> 
A first look at this Drupal thing looks good.

It would be nice to get together an overview of cluster interconnects.
Why they are important, what the various choices are, and what the
strengths and weaknesses are.
As Glen originally commented, this is a very poor part of the OReilly
book by Sloan.
On this list, we have many expert people from industry, working for
the companies which (for example) develop and support interconnects.


From rgb at phy.duke.edu  Mon Feb 28 17:28:12 2005
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 28 Feb 2005 20:28:12 -0500 (EST)
Subject: [Beowulf] So we will write our own book - next steps...
In-Reply-To: <Pine.LNX.4.61.0502281132420.27564@lapp-0>
References: <e2be28c62386b9c35b2c364fd5739134@linuxprophet.com>
	<Pine.LNX.4.61.0502251807340.27564@lapp-0>
	<4222E860.2060508@ccrl-nece.de> <20050228100138.GM1404@leitl.org>
	<Pine.LNX.4.61.0502281132420.27564@lapp-0>
Message-ID: <Pine.LNX.4.58.0502281840160.3885@lilith.rgb.private.net>

On Mon, 28 Feb 2005, Ryan Sweet wrote:

> > Based on Glenn's comments, I was actually feeling (once again) like I
> > ought to try to shake free enough time to do another full pass through
> > the content to bring it up to date and see if I can finish off some of
> > the missing chapters and -- possibly -- seek a paper publisher.  I want
> > to keep it online/free either way (and there are publishers out there
> > that are comfortable with this) but a lot of people want to own a paper
> > copy of stuff like this.  I get a lot of requests for a printable PDF
> > from people all over who found the html with google but missed the
> > online pdf images right next door...
> 
> Which is another reason that maybe a wiki would not entirely serve (see 
> above).
> 
> OK, this may be a loaded question - for Robert, how do you feel about people 
> contributing to your book vs starting a new, collaborative effort which draws 
> upon the strenghts of what you have already done?

I'm perfectly happy either way, as long as no further demands are put on
my time (which is under a tension of a dozen kilonewtons or so:-).

I put an OPL on the document for a reason -- as long as you don't
publish an effort that contains a whole lot of my writing for money and
not give me any (or get permission to do so ahead of time), you're
welcome to steal, reuse, borrow, adapt, or otherwise mutilate my efforts
in anything you put together with some measure of attribution and the
viral copyleft thing in force.

As I also said, I welcome contributions -- if anybody wants to
contribute chapters that would be great, and I'll even leave your name
at the top of your chapters.  One concept that I had for the book some
time ago that isn't really implemented is to make it a kind of revolving
"journal of cluster computing".  I've written what might be viewed as a
core/intro to cluster computing, with fairly detailed sections on at
least some of the important stuff.

What it NEEDS is somebody who is a Myrinet expert to write a chapter or
article on "Using Myrinet in a Compute Cluster" -- stuff on getting it,
installing it, plugging in the hardware drivers and so forth so that
e.g. MPI can run on top of it, some example programs (toy code or real
applications) that run on 100 BT TCP/IP and Myrinet side by side for
timing and parallel speedup comparisons.

Ditto for SCI.  Ditto for Fiber Channel, infiniband, gigabit ethernet.
An article on diskless clustering.  An article on installing and using
warewulf on top of e.g. RHEL, FC2 or FC3, Centos, Caosity.  An article
on SGE.  All by people who actually use all of the tools in their daily
work.

This is where I, or any possible author, come up short.  I know
something about all of the above, but I don't have direct experience
with all of it and don't know a lot of people that do.  Greg, probably,
and Don.  People in the business side of building clusters so that they
end up hands on with lots of hardware configurations.  A few people in
the REALLY big cluster compute centers.

So what we NEED is some of the real experts on the list to write expert
level but user-friendly contributions.  If these were done as "articles"
rather than chapters per se, it would also address the problem any such
documentation has with information getting "stale" quickly.  Without
updates every year, a lot of the technical stuff has such a short
half-life that any cluster book quickly becomes nearly useless beyond
the intro level.  I haven't done a major catchup on my book for a couple
of whole years, and it is already woefully behind.

Alas, my experience with co-authors so far hasn't been too positive. I
think no fewer than five or six people have offered to do everything
from write half the book with me as a full co-author to contribute a
chapter here or there, and I have yet to see a single line of actual
contributed text.  Hence my cynicism -- we are busy, we are all busy.
Writing is a LOT of work (I promise -- it is one of the things I "do").
Most folks don't realize how hard until they have to write a twenty or
thirty page chapter (maybe with references and figures) that needs to
pass some sort of review and that other people will read and everything.
Twenty or thirty hours later...

Well, apparently they quit before they get to the 20-30 hour mark.

So I will watch with great interest as you try to get something
together, and will applaud your energy and determination if you succeed.
A wiki/blog sort of thing actually isn't such a terrible idea if you can
push it to the critical point where enough people participate and
contribute.  Sort of an online freeform journal.  But then, this list is
(if and as google succeeds in getting to the online archives) already a
pretty hellacious resource in that regard.

> For others, how would you feel about contributing to Robert's book using his 
> Latex template, vs starting a new collaborative effort?
> 
> Each approach has advantages though if, as was mentioned, its been difficult 
> to get wider contributions for the book as it is, then maybe a more overtly 
> collaborative approach would help.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From jake at spiekerfamily.com  Mon Feb 28 18:11:55 2005
From: jake at spiekerfamily.com (Jake Thebault-Spieker)
Date: Mon, 28 Feb 2005 21:11:55 -0500
Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix
Message-ID: <4223CF6B.5070805@spiekerfamily.com>

Ok, thanks for all the replies.

Does anybody know of a use for a cluster? I know I'm going to build one, 
and I have no programming experience. My thought was that I would just 
calculate Pi out infinitely, but that doesn't look like it will work. 
The nodes will be cyrix x86 processors w/ clock speeds of 133 MHz. They 
each have about 3 GB of HD space. Thoughts on something that I can 
calculate and log in a MySQL database?

-- 
I think computer viruses should count as life. 
I think it says something about human nature 
that the only form of life we have created so far is purely destructive. 
We've created life in our own image. 
--Stephen Hawking

Jake Thebault-Spieker


From maillists at gauckler.ch  Mon Feb 28 22:44:14 2005
From: maillists at gauckler.ch (Michael Gauckler)
Date: Tue, 01 Mar 2005 07:44:14 +0100
Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv?
Message-ID: <1109659454.6544.2.camel@localhost.localdomain>

Dear List, 

I would like to gather the data from several processes. 
Instead of the comonly used stride, I want to interleave 
the data:

Rank 0: AAAAA -> ABCDABCDABCDABCDABCD
Rank 1: BBBBB ----^---^---^---^---^
Rank 2: CCCCC -----^---^---^---^---^
Rank 3: DDDDD ------^---^---^---^---^

Since the stride of the receive type is indicated 
in multpiles of its mpi_type, no interleaving is 
possible (the smallest striping factor leads to 
AAAAABBBBBBCCCCCDDDDD).

Is there a way to achieve this behaviour in an 
elegant way, as MPI_Gather promises it? Or do
I need to do Send/Recv with self-aligned offsets?

Thank you for your help!

 Michael