From ispmarin at gmail.com Mon Oct 1 05:34:46 2007 From: ispmarin at gmail.com (Ivan Paganini) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Problems with a JS21 - Ah, the networking... In-Reply-To: <200709301202.01236.csamuel@vpac.org> References: <751c63ee0709281343r535364d2rdfa8db9aed8426c@mail.gmail.com> <200709301202.01236.csamuel@vpac.org> Message-ID: <751c63ee0710010534u1c611f03x8d31bb1fa58165b9@mail.gmail.com> Hello Chris, everybody: I am not using jumbo frames, and I'm now considering this option, but first I wanted to know for sure that there is no other problem before, just to control the number of variables at hand. But thanks for your help. I did a strace on the hanged process, and the output is this: ______________________________________________ mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x401 76000 read(4, "#\n# hosts This file desc"..., 4096) = 4096 read(4, "yriBlade077\n192.168.30.178 myri"..., 4096) = 4096 read(4, " blade067 blade067.lcca.usp.br\n1"..., 4096) = 2055 read(4, "", 4096) = 0 close(4) = 0 munmap(0x40176000, 4096) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, chil d_tidptr=0x40046f68) = 25994 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, chil d_tidptr=0x40046f68) = 25995 brk(0x102ab000) = 0x102ab000 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, chil d_tidptr=0x40046f68) = 25996 waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) --- SIGWINCH (Window changed) @ 0 (0) --- waitpid(-1, ______________________________________________ and just that. I'm now trying to make a better undestanding that what is happening. Thank you. Ivan 2007/9/29, Chris Samuel : > On Sat, 29 Sep 2007, Ivan Paganini wrote: > > > I sniffed the network in the store nodes interface, and i got lots > > of TCP lost fragment, previos lost fragments, ack lost fragments > > and TCP window size full. > > Some suggestions would be to check that all network interfaces are > negotiating gigabit back to the switch, and that if you are using > jumbo frames then all interfaces are indeed using jumbo frames. > > A useful check to verify 2 way jumbo frames connectivity is by using > the ping command, doing: > > ping -c 1 -M do -s 8900 $hostname > > should tell you whether or not it is working. > > Best of luck! > Chris > -- > Christopher Samuel - (03) 9925 4751 - Systems Manager > The Victorian Partnership for Advanced Computing > P.O. Box 201, Carlton South, VIC 3053, Australia > VPAC is a not-for-profit Registered Research Agency > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- ----------------------------------------------------------- Ivan S. P. Marin ---------------------------------------------------------- From ispmarin at gmail.com Mon Oct 1 06:05:43 2007 From: ispmarin at gmail.com (Ivan Paganini) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Problems with a JS21 - Ah, the networking... In-Reply-To: <751c63ee0710010534u1c611f03x8d31bb1fa58165b9@mail.gmail.com> References: <751c63ee0709281343r535364d2rdfa8db9aed8426c@mail.gmail.com> <200709301202.01236.csamuel@vpac.org> <751c63ee0710010534u1c611f03x8d31bb1fa58165b9@mail.gmail.com> Message-ID: <751c63ee0710010605g2a26bce8udbd955741b42650e@mail.gmail.com> Just a update: trying several times, the strace stops in different points, the speficied in the other email and here: _______________________________________________ munmap(0x40176000, 4096) = 0 time([1191243868]) = 1191243868 open("/etc/hosts", O_RDONLY) = 4 fcntl64(4, F_GETFD) = 0 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 fstat64(4, {st_mode=S_IFREG|0644, st_size=10247, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40176000 read(4, "#\n# hosts This file desc"..., 4096) = 4096 read(4, "yriBlade077\n192.168.30.178 myri"..., 4096) = 4096 read(4, " blade067 blade067.lcca.usp.br\n1"..., 4096) = 2055 read(4, "", 4096) = 0 close(4) = 0 munmap(0x40176000, 4096) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40046f68) = 31382 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40046f68) = 31383 brk(0x102ab000) = 0x102ab000 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40046f68) = 31384 waitpid(-1, _______________________________________________ Thanks. 2007/10/1, Ivan Paganini : > Hello Chris, everybody: > > I am not using jumbo frames, and I'm now considering this option, but > first I wanted to know for sure that there is no other problem before, > just to control the number of variables at hand. But thanks for your > help. > > I did a strace on the hanged process, and the output is this: > ______________________________________________ > > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x401 > 76000 > read(4, "#\n# hosts This file desc"..., 4096) = 4096 > read(4, "yriBlade077\n192.168.30.178 myri"..., 4096) = 4096 > read(4, " blade067 blade067.lcca.usp.br\n1"..., 4096) = 2055 > read(4, "", 4096) = 0 > close(4) = 0 > munmap(0x40176000, 4096) = 0 > clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, chil > d_tidptr=0x40046f68) = 25994 > clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, chil > d_tidptr=0x40046f68) = 25995 > brk(0x102ab000) = 0x102ab000 > clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, chil > d_tidptr=0x40046f68) = 25996 > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, > > ______________________________________________ > and just that. I'm now trying to make a better undestanding that what > is happening. > > Thank you. > > Ivan > > > 2007/9/29, Chris Samuel : > > On Sat, 29 Sep 2007, Ivan Paganini wrote: > > > > > I sniffed the network in the store nodes interface, and i got lots > > > of TCP lost fragment, previos lost fragments, ack lost fragments > > > and TCP window size full. > > > > Some suggestions would be to check that all network interfaces are > > negotiating gigabit back to the switch, and that if you are using > > jumbo frames then all interfaces are indeed using jumbo frames. > > > > A useful check to verify 2 way jumbo frames connectivity is by using > > the ping command, doing: > > > > ping -c 1 -M do -s 8900 $hostname > > > > should tell you whether or not it is working. > > > > Best of luck! > > Chris > > -- > > Christopher Samuel - (03) 9925 4751 - Systems Manager > > The Victorian Partnership for Advanced Computing > > P.O. Box 201, Carlton South, VIC 3053, Australia > > VPAC is a not-for-profit Registered Research Agency > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > > -- > ----------------------------------------------------------- > Ivan S. P. Marin > ---------------------------------------------------------- > -- ----------------------------------------------------------- Ivan S. P. Marin ---------------------------------------------------------- From patrick at myri.com Mon Oct 1 06:35:38 2007 From: patrick at myri.com (Patrick Geoffray) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Problems with a JS21 - Ah, the networking... In-Reply-To: <751c63ee0709281343r535364d2rdfa8db9aed8426c@mail.gmail.com> References: <751c63ee0709281343r535364d2rdfa8db9aed8426c@mail.gmail.com> Message-ID: <4700F7AA.9020407@myri.com> Hi Ivan, Ivan Paganini wrote: > The myrinet connection was working right, but sometimes a user program > just got stuck - one of the processes was sleeping, and all others > were running. Then, the program hangs. Investigating this further, Unless you are using bocking receives ("--mx-recv blocking" or "--mx-recv hybrid"), the default mode is polling. So, a process will only sleep if it is still in the spawning phase (in MPI_Init) or if it's blocking on something outside MPI (like disk IO). > overheat. mpirun.ch_mx -v shows that all the processes are issued ok > to the nodes, but somehow one (or more) process go to sleep or never > starts, and all the other processes just hangs. The mx diagnose tools All processes wait on everybody at spawn time, so if one process never starts, the rest of the MPI world will wait for it, possibly forever. The root problem is the process not starting. The spawning phase in MPICH-MX uses socket and ssh (or rsh). Usually, ssh uses native Ethernet, but it could also use IPoM (Ethernet over Myrinet). Which case is it for you ? Patrick From patrick at myri.com Mon Oct 1 06:38:34 2007 From: patrick at myri.com (Patrick Geoffray) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Problems with a JS21 - Ah, the networking... In-Reply-To: References: <751c63ee0709281343r535364d2rdfa8db9aed8426c@mail.gmail.com> Message-ID: <4700F85A.2030300@myri.com> Mark Hahn wrote: > here's an idea: configure ip-over-myrinet, and use it exclusively > to start the jobs. if that works, then you know for sure that the > problem is solely on the eth side (switch, perhaps, or maybe a nic > that's jabbering or otherwise misbehaving?) Ivan may have to stage the binary on local disk prior to spawning, to not rely on GPFS over Ethernet to serve it. Or even run GFPS over IPoM too. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Mon Oct 1 06:42:42 2007 From: patrick at myri.com (Patrick Geoffray) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Problems with a JS21 - Ah, the networking... In-Reply-To: <751c63ee0710010534u1c611f03x8d31bb1fa58165b9@mail.gmail.com> References: <751c63ee0709281343r535364d2rdfa8db9aed8426c@mail.gmail.com> <200709301202.01236.csamuel@vpac.org> <751c63ee0710010534u1c611f03x8d31bb1fa58165b9@mail.gmail.com> Message-ID: <4700F952.10307@myri.com> Ivan, Ivan Paganini wrote: > I did a strace on the hanged process, and the output is this: "strace -f" to trace the children as well. Could you send the output of mpirun.ch_mx -v also, to see if the process starts and send some info to the mpirun perl script and hangs later or never really starts. What is your Myricom Tech Support ticket number ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From hahn at mcmaster.ca Mon Oct 1 06:44:29 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Problems with a JS21 - Ah, the networking... In-Reply-To: <751c63ee0710010605g2a26bce8udbd955741b42650e@mail.gmail.com> References: <751c63ee0709281343r535364d2rdfa8db9aed8426c@mail.gmail.com> <200709301202.01236.csamuel@vpac.org> <751c63ee0710010534u1c611f03x8d31bb1fa58165b9@mail.gmail.com> <751c63ee0710010605g2a26bce8udbd955741b42650e@mail.gmail.com> Message-ID: > clone(child_stack=0, > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x40046f68) = 31384 > waitpid(-1, this looks like a fork/exec that's failing. as you might expect if, for instance, your shared FS doesn't supply a binary successfully. note also that ltrace -S often provides somewhat more intelligible diags for this kind of thing (since it might show what's actually being exec'ed.) From ispmarin at gmail.com Mon Oct 1 07:06:45 2007 From: ispmarin at gmail.com (Ivan Paganini) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Problems with a JS21 - Ah, the networking... In-Reply-To: References: <751c63ee0709281343r535364d2rdfa8db9aed8426c@mail.gmail.com> <200709301202.01236.csamuel@vpac.org> <751c63ee0710010534u1c611f03x8d31bb1fa58165b9@mail.gmail.com> <751c63ee0710010605g2a26bce8udbd955741b42650e@mail.gmail.com> Message-ID: <751c63ee0710010706x3444feeem77e560c82df05d07@mail.gmail.com> Hello Mark, Patrick, >>The spawning phase in MPICH-MX uses socket and ssh (or rsh). Usually, >>ssh uses native Ethernet, but it could also use IPoM (Ethernet over >>Myrinet). Which case is it for you ? As I said before, I'm also experiencing some ether problems (in the service network) like TCP window full, lost segments, ack lost segments, and trying to rule this out too. I'm using the IPoM, as the manual says, because I configured each node with ifconfig myri0 192.168.30. and associated this number on the /etc/hosts with a hostname, like myriBlade. I am also using ssh and polling method. the mpirun.ch_mx -v with a hanged process is below: ___________________________________________ ivan@mamute:~/lib/mpich-mx-1.2.7-5-xl/examples> mpirun.ch_mx -v --mx-label --mx-kill 30 -machinefile list -np 3 ./cpi Program binary is: /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/bin/mpimxlabel Program binary is: /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples/./cpi Machines file is /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples/list Processes will be killed 30 after first exits. mx receive mode used: polling. 3 processes will be spawned: Process 0 (/mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples/./cpi ) on mamute Process 1 (/mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples/./cpi ) on mamute Process 2 (/mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples/./cpi ) on myriBlade109 Open a socket on mamute... Got a first socket opened on port 55353. ssh mamute "cd /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples && exec env MXMPI_MAGIC=3366365 MXMPI_MASTER=mamute MXMPI_PORT=55353 MX_DISABLE_SHMEM=0 MXMPI_VERBOSE=1 MXMPI_SIGCATCH=1 LD_LIBRARY_PATH=/usr/lib:/usr/lib64 MXMPI_ID=0 MXMPI_NP=3 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.15.1 /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/bin/mpimxlabel /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples/./cpi " ssh mamute -n "cd /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples && exec env MXMPI_MAGIC=3366365 MXMPI_MASTER=mamute MXMPI_PORT=55353 MX_DISABLE_SHMEM=0 MXMPI_VERBOSE=1 MXMPI_SIGCATCH=1 LD_LIBRARY_PATH=/usr/lib:/usr/lib64 MXMPI_ID=1 MXMPI_NP=3 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.15.1 /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/bin/mpimxlabel /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples/./cpi " ssh myriBlade109 -n "cd /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples && exec env MXMPI_MAGIC=3366365 MXMPI_MASTER=mamute MXMPI_PORT=55353 MX_DISABLE_SHMEM=0 MXMPI_VERBOSE=1 MXMPI_SIGCATCH=1 LD_LIBRARY_PATH=/usr/lib:/usr/lib64 MXMPI_ID=2 MXMPI_NP=3 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.30.209 /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/bin/mpimxlabel /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/examples/./cpi " All processes have been spawned MPI Id 0 is using mx port 0, board 0 (MAC 0060dd47afe7). MPI Id 2 is using mx port 0, board 0 (MAC 0060dd478aff). MPI Id 1 is using mx port 1, board 0 (MAC 0060dd47afe7). Received data from all 3 MPI processes. Sending mapping to MPI Id 0. Sending mapping to MPI Id 1. Sending mapping to MPI Id 2. Data sent to all processes. ___________________________________________ and hanged. The list file includes mamute:2 myriBlade109:4 myriBlade108:4 where mamute is my headnode, so I can do all the traces. >>Ivan may have to stage the binary on local disk prior to spawning, to >>not rely on GPFS over Ethernet to serve it. Or even run GFPS over IPoM too. GPFS over myri now is not an option. I compiled the executable staticaly and tested it. Same problem. Now I staged the binary in the scrath partition in each node, and the process hanged the same way: __________________________________________ ivan@mamute:/home/ivan> mpirun.ch_mx -v --mx-label --mx-kill 30 -machinefile list -np 3 ./cpi Program binary is: /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/bin/mpimxlabel Program binary is: /home/ivan/./cpi Machines file is /home/ivan/list Processes will be killed 30 after first exits. mx receive mode used: polling. 3 processes will be spawned: Process 0 (/home/ivan/./cpi ) on mamute Process 1 (/home/ivan/./cpi ) on mamute Process 2 (/home/ivan/./cpi ) on myriBlade109 Open a socket on mamute... Got a first socket opened on port 55684. ssh mamute "cd /home/ivan && exec env MXMPI_MAGIC=1802255 MXMPI_MASTER=mamute MXMPI_PORT=55684 MX_DISABLE_SHMEM=0 MXMPI_VERBOSE=1 MXMPI_SIGCATCH=1 LD_LIBRARY_PATH=/usr/lib:/usr/lib64 MXMPI_ID=0 MXMPI_NP=3 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.15.1 /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/bin/mpimxlabel /home/ivan/./cpi " ssh mamute -n "cd /home/ivan && exec env MXMPI_MAGIC=1802255 MXMPI_MASTER=mamute MXMPI_PORT=55684 MX_DISABLE_SHMEM=0 MXMPI_VERBOSE=1 MXMPI_SIGCATCH=1 LD_LIBRARY_PATH=/usr/lib:/usr/lib64 MXMPI_ID=1 MXMPI_NP=3 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.15.1 /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/bin/mpimxlabel /home/ivan/./cpi " ssh myriBlade109 -n "cd /home/ivan && exec env MXMPI_MAGIC=1802255 MXMPI_MASTER=mamute MXMPI_PORT=55684 MX_DISABLE_SHMEM=0 MXMPI_VERBOSE=1 MXMPI_SIGCATCH=1 LD_LIBRARY_PATH=/usr/lib:/usr/lib64 MXMPI_ID=2 MXMPI_NP=3 MXMPI_BOARD=-1 MXMPI_SLAVE=192.168.30.209 /mamuteData/ivan/lib/mpich-mx-1.2.7-5-xl/bin/mpimxlabel /home/ivan/./cpi " All processes have been spawned MPI Id 1 is using mx port 0, board 0 (MAC 0060dd47afe7). MPI Id 2 is using mx port 0, board 0 (MAC 0060dd478aff). MPI Id 0 is using mx port 1, board 0 (MAC 0060dd47afe7). Received data from all 3 MPI processes. Sending mapping to MPI Id 0. Sending mapping to MPI Id 1. Sending mapping to MPI Id 2. Data sent to all processes. __________________________________________ I notice, thought, that the spawing is _much_ faster than firing the process from the GPFS partition. This is the output of strace -f (lots of things here!): ________________________________________ [pid 7498] ioctl(4, TCGETS or TCGETS, 0xffffda30) = -1 EINVAL (Invalid argument) [pid 7498] _llseek(4, 0, 0xffffda98, SEEK_CUR) = -1 ESPIPE (Illegal seek) [pid 7498] fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 [pid 7498] setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 [pid 7498] connect(4, {sa_family=AF_INET, sin_port=htons(55787), sin_addr=inet_addr("192.168.15.1")}, 16) = 0 [pid 7498] write(1, "Sending mapping to MPI Id 1.\n", 29Sending mapping to MPI Id 1. ) = 29 [pid 7498] send(4, "[[[<0:96:3712462823:0><1:96:3712"..., 72, 0) = 72 [pid 7498] close(4) = 0 [pid 7498] time([1191247146]) = 1191247146 [pid 7498] open("/etc/hosts", O_RDONLY) = 4 [pid 7498] fcntl64(4, F_GETFD) = 0 [pid 7498] fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 [pid 7498] fstat64(4, {st_mode=S_IFREG|0644, st_size=10247, ...}) = 0 [pid 7498] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40018000 [pid 7498] read(4, "#\n# hosts This file desc"..., 4096) = 4096 [pid 7498] read(4, "yriBlade077\n192.168.30.178 myri"..., 4096) = 4096 [pid 7498] read(4, " blade067 blade067.lcca.usp.br\n1"..., 4096) = 2055 [pid 7498] read(4, "", 4096) = 0 [pid 7498] close(4) = 0 [pid 7498] munmap(0x40018000, 4096) = 0 [pid 7498] open("/etc/protocols", O_RDONLY) = 4 [pid 7498] fcntl64(4, F_GETFD) = 0 [pid 7498] fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 [pid 7498] fstat64(4, {st_mode=S_IFREG|0644, st_size=6561, ...}) = 0 [pid 7498] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40018000 [pid 7498] read(4, "#\n# protocols\tThis file describe"..., 4096) = 4096 [pid 7498] close(4) = 0 [pid 7498] munmap(0x40018000, 4096) = 0 [pid 7498] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 4 [pid 7498] ioctl(4, TCGETS or TCGETS, 0xffffda30) = -1 EINVAL (Invalid argument) [pid 7498] _llseek(4, 0, 0xffffda98, SEEK_CUR) = -1 ESPIPE (Illegal seek) [pid 7498] ioctl(4, TCGETS or TCGETS, 0xffffda30) = -1 EINVAL (Invalid argument) [pid 7498] _llseek(4, 0, 0xffffda98, SEEK_CUR) = -1 ESPIPE (Illegal seek) [pid 7498] fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 [pid 7498] setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 [pid 7498] connect(4, {sa_family=AF_INET, sin_port=htons(45412), sin_addr=inet_addr("192.168.30.209")}, 16) = 0 [pid 7498] write(1, "Sending mapping to MPI Id 2.\n", 29Sending mapping to MPI Id 2. ) = 29 [pid 7498] send(4, "[[[<0:96:3712462823:0><1:96:3712"..., 69, 0) = 69 [pid 7498] close(4) = 0 [pid 7498] alarm(0) = 0 [pid 7498] write(1, "Data sent to all processes.\n", 28Data sent to all processes. ) = 28 [pid 7498] accept(3, [pid 7499] <... select resumed> ) = 1 (in [3]) [pid 7499] read(3, "\302\317\32\275\357jD\230\222=\270N\341F\237\326@]\4\4"..., 8192) = 80 [pid 7499] select(7, [3 4], [6], NULL, NULL) = 1 (out [6]) [pid 7499] write(6, "0: Process 0 on mamute.lcca.usp."..., 350: Process 0 on mamute.lcca.usp.br ) = 35 [pid 7499] select(7, [3 4], [], NULL, NULL ________________________________________ and hangs. This was with the binary _out_ of GPFS and statically compiled. My ticket number is 53912, and Ruth and Scott are helping me. Mark, ltrace does not accepts the mpirun.ch_mx as a valid elf binary... it was compiled using the xlc compiler. Strange, because it works with other system binaries (like ls...) Thank you very much!! Ivan 2007/10/1, Mark Hahn : > > clone(child_stack=0, > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > > child_tidptr=0x40046f68) = 31384 > > waitpid(-1, > > this looks like a fork/exec that's failing. as you might expect > if, for instance, your shared FS doesn't supply a binary successfully. > note also that ltrace -S often provides somewhat more intelligible > diags for this kind of thing (since it might show what's actually > being exec'ed.) > -- ----------------------------------------------------------- Ivan S. P. Marin ---------------------------------------------------------- From scheinin at crs4.it Mon Oct 1 10:11:32 2007 From: scheinin at crs4.it (Alan Louis Scheinine) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Problems with a JS21 - Ah, the networking... In-Reply-To: <751c63ee0710010706x3444feeem77e560c82df05d07@mail.gmail.com> References: <751c63ee0709281343r535364d2rdfa8db9aed8426c@mail.gmail.com> <200709301202.01236.csamuel@vpac.org> <751c63ee0710010534u1c611f03x8d31bb1fa58165b9@mail.gmail.com> <751c63ee0710010605g2a26bce8udbd955741b42650e@mail.gmail.com> <751c63ee0710010706x3444feeem77e560c82df05d07@mail.gmail.com> Message-ID: <47012A44.6090209@crs4.it> I found that the blocking send of MPI blocks for the version of MPICH compiled for Myrinet (at least w.r.t. the old GM) but does not block for the MPICH from Argonne compile for GCC and PGI. Or was it the other way around, I don't recall. Assuming it was the first case, it might be relavent. Someone wrote a program that had deadlock, a simple send() ; recv(); in the same order for two processes will cause deadlock. But the program ran always until he changed the connection and hence the underlying drivers. From X.Jin at Bradford.ac.uk Mon Oct 1 08:22:36 2007 From: X.Jin at Bradford.ac.uk (X.Jin) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] CFP -- PMEO-UCNS'08 to be held with IPDPS'08 Message-ID: <200710011522.l91FMabL016908@radon.cen.brad.ac.uk> [Please accept our apologies if you receive multiple copies of this emails] ************************************************************** (Selected high-quality papers from the PMEO-UCNS'08 workshop will appear in a Special Issue of the journal - International Journal of Parallel, Emergent, and Distributed Systems (IJPEDS), Taylor & Francis - http://www.tandf.co.uk/journals/titles/17445760.asp) ************************************************************** CALL FOR PAPERS 7th International Workshop on Performance Modeling, Evaluation, and Optimization of Ubiquitous Computing and Networked Systems (PMEO-UCNSĄŻ08) To be held in conjunction with IPDPS'08 (Supported by IEEE Computer Society in cooperation with ACM SIGARCH), 14 -18 April, 2008, Miami, Florida, USA http://www.inf.brad.ac.uk/~gmin/PMEO-08.html http://www.ipdps.org/ipdps2008/2008_workshops.html SCOPE: The performance modeling, evaluation, and optimization of ubiquitous computing and networked systems have been an important research topic over the past years and poses challenging problems that require new tools and methods to keep up with the rapid evolution and increasing complexity of such systems. This workshop will bring together scientists, engineers, practitioners, and computer users to share and exchange their experiences, discuss challenges, and report state-of-the-art and in-progress research on all aspects of performance modeling, evaluation, and optimization of ubiquitous computing and networked systems. The topics of interest include, but are not limited to: -- Predictive performance models of ubiquitous computing systems -- Predictive performance models of wired and wireless networks -- Performance measurement and monitoring tools -- Tracing and trace analysis -- Simulation -- Analytical modeling -- Software tools for system performance and evaluation -- Automatic performance analysis -- Performance comparison -- Performance of memory and I/O interconnect -- Performance of communication networks -- Performance of mobile distributed systems -- Performance analysis and evaluation of ubiquitous computing and networked applications -- Improvement in system performance through optimization and tuning -- Case studies showing the role of evaluation in the design of systems WORKSHOP CO-CHAIRS: Geyong Min Department of Computing University of Bradford Bradford, BD7 1DP, U.K. E-mail: g.min@brad.ac.uk Mohamed Ould-Khaoua Department of Computing Science University of Glasgow Glasgow, G12 8RZ, U.K. E-mail: mohamed@dcs.gla.ac.uk PROGRAM CHAIR: Ahmed Y. Al-Dubai School of Computing Napier University Edinburgh, EH10 5DT, U.K. E-mail: A.Al-Dubai@napier.ac.uk PUBLICITY CO-CHAIRS: Xiaolong Jin Department of Computing University of Bradford Bradford, BD7 1DP, U.K. E-mail: x.jin@brad.ac.uk Mirela Sechi Moretti Annoni Notare Barddal University Florianopolis, SC Brazil Email: mirela@barddal.br PROGRAM COMMITTEE: K. Al-Begain, Univ. of Glamorgan (UK) I. Romdhani, Napier University (UK) H. R. Arabnia, Univ. of Georgia (USA) D. K. Arvind, Edinburgh University (UK) I. Awan, Univ. of Bradford (UK) A. Boukerche, Univ. of North Texas (USA) J. Bradley, Imperial College London (UK) P. Cockshott, Univ. of Glasgow (UK) M. Colajanni, Univ. of Modena (Italy) K. Day, Sultan Qaboos Univ. (Oman) K. Djemame, Univ. of Leeds (UK) T. El-Ghazawi, George Washington University (USA) R. Fatoohi, San Jose State University (USA) M. Gueroui, University of Cergy-Pontoise (France) S. Helal, University of Florida (USA) H. Hassanein, Queen's University (Canada) S. Jarvis, Univ. of Warwick (UK) H. Karatza, Univ. of Thessaloniki (Greece) A. Katangur, Texas A&M Univ. (USA) A. Khonsari, IPM (Iran) W. Knottenbelt, Imperial College London (UK) K. Li, State Univ. of New York at New Paltz (USA) H. Liu, Huazhong Univ. of Science and Technology (China) S. Loucif, Moncton University, (Canada) L.M. Mackenzie, Univ. of Glasgow (UK) Y. Pan, Georgia State Univ. (USA) D. K. Pradhan, Univ. of Bristol (UK) H. Sarbazi-Azad, Sharif Univ. & IPM (Iran) A. Shahrabi, Glasgow Caledonian Univ. (UK) E. Song, Huazhong Univ. of Science and Technology (China) N. Thomas, Univ. of Newcastle (UK) A. Touzene, Sultan Qaboos Univ. (Oman) M. Woodward, Univ. of Bradford (UK) J. Wu, Florida Atlantic Univ. (USA) L. Xiao, Michigan State Univ. (USA) T. Xie, San Diego State University (USA) L. T. Yang, St Francis Xavier Univ. (Canada) W. Vanderbauwhede, University of Glasgow (UK) W. Buchanan, Napier University (UK) A. Zomaya, Univ. of Sydney (Australia) PAPER SUBMISSION: Authors are invited to submit manuscripts reporting original unpublished research and recent developments in the topics related to the workshop. The length of the papers should not exceed 8 pages (IEEE Computer Society Proceedings Manuscripts style: two columns, single-spaced), including figures and references, using 10 fonts, and number each page. Papers should be submitted electronically in PDF format (or postscript) by sending it as an e-mail attachment to A.Al-Dubai@napier.ac.uk. All papers will be peer reviewed and the comments will be provided to the authors. The accepted papers will be published together with those of other IPDPS'08 workshops by the IEEE Computer Society Press. IMPORTANT DATES: Submission Deadline: October 21, 2007 Author Notification: December 11, 2007 Final Manuscript Due: January 28, 2008 ************************************************************** From hcxckwk at hkucc.hku.hk Tue Oct 2 04:10:16 2007 From: hcxckwk at hkucc.hku.hk (Kwan Wing Keung) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Naive question: mpi-parallel program in multicore CPUs Message-ID: This is perhap a naive question. 10 years before we started using the SP2, but we later changed to Intel based linux beowulf in 2001. In our University there are quite a no. of mpi-based parallel programs running in a 178 node dual-Xeon PC cluster that was installed 4 years ago. We are now planning to upgrade our cluster in the coming year. Very likely blade servers with multi-core CPUs will be used. To port these mpi-based parallel programs to a multi-core CPU environment, someone suggested that OpenMP should be used, such that the programs can be converted to a multi-thread version. Nevertheless it may take time, and the users may be reluctant to do so. Also for some of the installed programs, we don't have the source code. Another user suggested that we may change slightly on the .machinefile before executing the "mpirun" command. Suppose we are going to run a 8 mpi-task program on a quad-core cluster, then only 2 CPUs should be selected, with the ".machinefile" looks like "cpu0 cpu1 cpu0 cpu1 cpu0 cpu1 cpu0 cpu1" created, i.e. 4 mpi-tasks will be spooled to CPU0 and 4 mpi-tasks will be spooled to CPU1. But the REAL question will be: Will EACH mpi-task be executed on ONE single core? If not, then could there be any Linux utility program to help? I asked this question to one of the potential vendor, and the sales suddenly suggested "Well, you can buy VMWARE to create virtual CPUs to do so." Do you think it is logical? Thanks in advance. W.K. Kwan Computer Centre University of Hongkong From Li at mx2.buaa.edu.cn Tue Oct 2 07:40:57 2007 From: Li at mx2.buaa.edu.cn (Li@mx2.buaa.edu.cn) Date: Mon Mar 15 01:06:30 2010 Subject: [Beowulf] Naive question: mpi-parallel program in multicore CPUs Message-ID: <20071002144057.D417A37F7E@mx2.buaa.edu.cn> Hello, ---------------------------------------------------- > >This is perhap a naive question. > >10 years before we started using the SP2, but we later changed to Intel >based linux beowulf in 2001. In our University there are quite a no. of >mpi-based parallel programs running in a 178 node dual-Xeon PC cluster >that was installed 4 years ago. > >We are now planning to upgrade our cluster in the coming year. Very >likely blade servers with multi-core CPUs will be used. To port these >mpi-based parallel programs to a multi-core CPU environment, someone >suggested that OpenMP should be used, such that the programs can be >converted to a multi-thread version. Nevertheless it may take time, and >the users may be reluctant to do so. Also for some of the installed >programs, we don't have the source code. > >Another user suggested that we may change slightly on the .machinefile >before executing the "mpirun" command. > >Suppose we are going to run a 8 mpi-task program on a quad-core cluster, >then only 2 CPUs should be selected, with the ".machinefile" looks like >"cpu0 cpu1 cpu0 cpu1 cpu0 cpu1 cpu0 cpu1" created, i.e. 4 mpi-tasks will >be spooled to CPU0 and 4 mpi-tasks will be spooled to CPU1. But the REAL >question will be: > Will EACH mpi-task be executed on ONE single core? > If not, then could there be any Linux utility program to help? > Generally, each mpi-task should be executed on a single core, and if not, you can run 4 mpid on a single node. >I asked this question to one of the potential vendor, and the sales >suddenly suggested "Well, you can buy VMWARE to create virtual CPUs to do >so." Do you think it is logical? > It was just a kidding. I suggest to use openmp on a single node and mpi among nodes for performance issues. And you may need to find a good mpi for your purpose. Regards, Li, Bo From gerry.creager at tamu.edu Tue Oct 2 07:50:18 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Naive question: mpi-parallel program in multicore CPUs In-Reply-To: <20071002144057.D417A37F7E@mx2.buaa.edu.cn> References: <20071002144057.D417A37F7E@mx2.buaa.edu.cn> Message-ID: <47025AAA.9020108@tamu.edu> Just for the record, I hate HTML-encoded e-mails. Li@mx2.buaa.edu.cn wrote: > Hello, > ---------------------------------------------------- >> This is perhap a naive question. >> >> 10 years before we started using the SP2, but we later changed to Intel >> based linux beowulf in 2001. In our University there are quite a no. of >> mpi-based parallel programs running in a 178 node dual-Xeon PC cluster >> that was installed 4 years ago. >> >> We are now planning to upgrade our cluster in the coming year. Very >> likely blade servers with multi-core CPUs will be used. To port these >> mpi-based parallel programs to a multi-core CPU environment, someone >> suggested that OpenMP should be used, such that the programs can be >> converted to a multi-thread version. Nevertheless it may take time, and >> the users may be reluctant to do so. Also for some of the installed >> programs, we don't have the source code. >> >> Another user suggested that we may change slightly on the .machinefile >> before executing the "mpirun" command. >> >> Suppose we are going to run a 8 mpi-task program on a quad-core cluster, >> then only 2 CPUs should be selected, with the ".machinefile" looks like >> "cpu0 cpu1 cpu0 cpu1 cpu0 cpu1 cpu0 cpu1" created, i.e. 4 mpi-tasks will >> be spooled to CPU0 and 4 mpi-tasks will be spooled to CPU1. But the REAL >> question will be: >> Will EACH mpi-task be executed on ONE single core? >> If not, then could there be any Linux utility program to help? >> > Generally, each mpi-task should be executed on a single core, and if not, you can run 4 mpid on a single node. >> I asked this question to one of the potential vendor, and the sales >> suddenly suggested "Well, you can buy VMWARE to create virtual CPUs to do >> so." Do you think it is logical? Selection of OpenMP vs MPI, or the combination of the two, depends considerably on how your code funcitons. We have been working with dual-core, dual-processor (4-cores/node) machines for a couple of years, running exclusively MPI codes, and seen very good performance. Similarly, (yeah, guys, I know it's not a Beowulf but...) I run weather forecast (WRF) codes on an IBM Cluster 1600 ) (p575) system using 8 of 16 coreres and 32 nodes (don't ask why: silly sysadmin tricks). The WRF codes will run as SMP, DM or a combination of both but, for the domains I tend to forecast over, are more efficient using just MPI. We see no problems with multicore machines running MPI. It would be worth encouraging your users to evaluate how their codes would run if enabled for a combination of OpenMP and MPI but not mandatory. gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From ntmoore at gmail.com Tue Oct 2 07:20:58 2007 From: ntmoore at gmail.com (Nathan Moore) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Naive question: mpi-parallel program in multicore CPUs In-Reply-To: References: Message-ID: <6009416b0710020720p2c8b94a3ifcce5e9640e43938@mail.gmail.com> My understanding is that on a multi-core machine, mpi communication routines (MPI_SEND, etc) are implemented as memory copy instructions. Accordingly, message passing within a multi-core node should be very fast compared to your present cluster. That said, It seems like all the performance benchmarks suggest that dual-core chips have the performances of 1.5-1.7 single core chips, so for the same number of nodes (defined as a CPU core) you wouldn't see the same output. All of this course depends on the structure of the code, memory usage, etc - these are just scaling estimates on my part. regards, Nathan On 10/2/07, Kwan Wing Keung wrote: > > > This is perhap a naive question. > > 10 years before we started using the SP2, but we later changed to Intel > based linux beowulf in 2001. In our University there are quite a no. of > mpi-based parallel programs running in a 178 node dual-Xeon PC cluster > that was installed 4 years ago. > > We are now planning to upgrade our cluster in the coming year. Very > likely blade servers with multi-core CPUs will be used. To port these > mpi-based parallel programs to a multi-core CPU environment, someone > suggested that OpenMP should be used, such that the programs can be > converted to a multi-thread version. Nevertheless it may take time, and > the users may be reluctant to do so. Also for some of the installed > programs, we don't have the source code. > > Another user suggested that we may change slightly on the .machinefile > before executing the "mpirun" command. > > Suppose we are going to run a 8 mpi-task program on a quad-core cluster, > then only 2 CPUs should be selected, with the ".machinefile" looks like > "cpu0 cpu1 cpu0 cpu1 cpu0 cpu1 cpu0 cpu1" created, i.e. 4 mpi-tasks will > be spooled to CPU0 and 4 mpi-tasks will be spooled to CPU1. But the REAL > question will be: > Will EACH mpi-task be executed on ONE single core? > If not, then could there be any Linux utility program to help? > > I asked this question to one of the potential vendor, and the sales > suddenly suggested "Well, you can buy VMWARE to create virtual CPUs to do > so." Do you think it is logical? > > Thanks in advance. > > W.K. Kwan > Computer Centre > University of Hongkong > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071002/6d71e2a9/attachment.html From gdjacobs at gmail.com Tue Oct 2 08:24:43 2007 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Naive question: mpi-parallel program in multicore CPUs In-Reply-To: <20071002144057.D417A37F7E@mx2.buaa.edu.cn> References: <20071002144057.D417A37F7E@mx2.buaa.edu.cn> Message-ID: <470262BB.7070500@gmail.com> Li@mx2.buaa.edu.cn wrote: Using OpenMP per node is going to be much less portable and much more complex to implement. It's better to let your MPI library handle things the smart way. -- Geoffrey D. Jacobs To have no errors would be life without meaning No struggle, no joy From tom.elken at qlogic.com Tue Oct 2 10:07:27 2007 From: tom.elken at qlogic.com (Tom Elken) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Naive question: mpi-parallel program in multicore CPUs In-Reply-To: <470262BB.7070500@gmail.com> References: <20071002144057.D417A37F7E@mx2.buaa.edu.cn> <470262BB.7070500@gmail.com> Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A01594BF1@AVEXCH1.qlogic.org> > -----Original Message----- > From: beowulf-bounces@beowulf.org > [mailto:beowulf-bounces@beowulf.org] On Behalf Of Geoff Jacobs > > Using OpenMP per node is going to be much less portable and > much more complex to implement. It's better to let your MPI > library handle things the smart way. At QLogic we're quite familiar with benchmarking with good OpenMP compilers (esp. the PathScale compiler, which used to be part of QLogic) and with various modern MPIs (InfiniPath MPI, MVAPICH, Open MPI, HP-MPI, Scali MPI), and I totally agree with the previous two answers. MPI developers pay a large and increasing amount of performance-tuning attention to the larger core counts, NUMA and SMP issues on cluster nodes. Even if you underwent the large effort to re-code an application for which you had the source to an OpenMP / MPI hybrid approach, it is as likely that you would reduce your application's performance as to improve it. That said, there is continuing interest in the Hybrid approach, and large SMP nodes will keep that interest going. A Google on "Hybrid OpenMP MPI" will provide you with some presentations and papers on recent work in this area. You will find that wins with the Hybrid approach are possible, but it is a lot of work, and by no means required, as you move towards clusters with multi-core CPUs. -Tom > > -- > Geoffrey D. Jacobs > > To have no errors > would be life without meaning > No struggle, no joy > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org To change your > subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From larry.stewart at sicortex.com Tue Oct 2 11:52:21 2007 From: larry.stewart at sicortex.com (Larry Stewart) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Naive question: mpi-parallel program in multicore CPUs In-Reply-To: <20071002144057.D417A37F7E@mx2.buaa.edu.cn> References: <20071002144057.D417A37F7E@mx2.buaa.edu.cn> Message-ID: <47029365.8090606@sicortex.com> The question of OpenMP vs MPI has been around for a long time, for example: http://www.beowulf.org/archive/2001-March/002718.html My general impression is that it is a waste of time to convert from pure MPI to a hybrid approach. For example: www.sc2000.org/techpapr/papers/pap.pap214.pdf On the other hand, here's a fellow who got a 4X speedup by going to hybrid: www.nersc.gov/nusers/services/training/classes/NUG/Jun04/NUG2004_yhe_hybrid.ppt My own view is that with a modern cluster with fast processors and with inter-node communications not that much slower than a cache miss to main memory, the unified MPI model makes more sense, but there are many many papers arguing about this topic. -L -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071002/567c759a/attachment.html From lindahl at pbm.com Tue Oct 2 14:16:00 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Naive question: mpi-parallel program in multicore CPUs In-Reply-To: <47029365.8090606@sicortex.com> References: <20071002144057.D417A37F7E@mx2.buaa.edu.cn> <47029365.8090606@sicortex.com> Message-ID: <20071002211559.GA1115@bx9.net> On Tue, Oct 02, 2007 at 02:52:21PM -0400, Larry Stewart wrote: > On the other hand, here's a fellow who got a 4X speedup by going to hybrid: > > www.nersc.gov/nusers/services/training/classes/NUG/Jun04/NUG2004_yhe_hybrid.ppt It's always hard to evaluate how real such claims are. But I will note that this guy is on an IBM SP, which for a while (dunno if it's fixed now) had some really horribly slow MPI library code for in-box communications. Code so bad that it made OpenMP and hybrid programming look good. He also claims MM5 gets a win with hybrid programming, but the numbers I've seen say that pure MPI is best on that code. -- greg From gdjacobs at gmail.com Wed Oct 3 07:31:38 2007 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Naive question: mpi-parallel program in multicore CPUs In-Reply-To: <6009416b0710020720p2c8b94a3ifcce5e9640e43938@mail.gmail.com> References: <6009416b0710020720p2c8b94a3ifcce5e9640e43938@mail.gmail.com> Message-ID: <4703A7CA.6010604@gmail.com> Nathan Moore wrote: > My understanding is that on a multi-core machine, mpi communication > routines (MPI_SEND, etc) are implemented as memory copy instructions. > Accordingly, message passing within a multi-core node should be very > fast compared to your present cluster. > > That said, It seems like all the performance benchmarks suggest that > dual-core chips have the performances of 1.5-1.7 single core chips, so > for the same number of nodes (defined as a CPU core) you wouldn't see > the same output. > > All of this course depends on the structure of the code, memory usage, > etc - these are just scaling estimates on my part. > > regards, > > Nathan Aren't there two ways it's done? IARC, MPICH2 has (1) standard ch3 TCP/IP for all coms, but local coms bounced off a loopback interface (2) Nemesis TCP/IP for inter node coms, but intra node coms use shared memory. And dual core chip performance is, again, totally dependent on the code in question. If the code is highly pipelined and can run in cache, an embarrassingly parallel algorithm will scale linearly with cores. Code limited by memory bandwidth, I/O, or (in a cluster) network performance will not scale as well. -- Geoffrey D. Jacobs To have no errors would be life without meaning No struggle, no joy From herbert.fruchtl at st-andrews.ac.uk Wed Oct 3 01:28:05 2007 From: herbert.fruchtl at st-andrews.ac.uk (Herbert Fruchtl) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Re: Naive question: mpi-parallel program in multicore CPUs In-Reply-To: <200710021900.l92J08Up016243@bluewest.scyld.com> References: <200710021900.l92J08Up016243@bluewest.scyld.com> Message-ID: <47035295.5060501@st-andrews.ac.uk> What is really difficult with MPI is data distribution. A lot of applications are parallelised using replicated data. That's fine if you are CPU bound, but it you are limited by the amount of memory per processor, a shared memory approach (which is the default with OpenMP) is the easiest way of using all the memory. MPI may also add a lot of overhead if you parallelise inner loops, which is easy and cheap with OpenMP. OTOH, coarse-grain parallelism with OpenMP is difficult; MPI is usually more suitable here. It depends on your application, and you may find candidates for both approaches in the same application. Herbert beowulf-request@beowulf.org wrote: > Send Beowulf mailing list submissions to > beowulf@beowulf.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.beowulf.org/mailman/listinfo/beowulf > or, via email, send a message with subject or body 'help' to > beowulf-request@beowulf.org > > You can reach the person managing the list at > beowulf-owner@beowulf.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Beowulf digest..." > > > Today's Topics: > > 1. Re: Naive question: mpi-parallel program in multicore CPUs > (Larry Stewart) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 02 Oct 2007 14:52:21 -0400 > From: Larry Stewart > Subject: Re: [Beowulf] Naive question: mpi-parallel program in > multicore CPUs > To: Li@mx2.buaa.edu.cn, Bo > Cc: beowulf@beowulf.org, Kwan Wing Keung > Message-ID: <47029365.8090606@sicortex.com> > Content-Type: text/plain; charset="gb2312" > > The question of OpenMP vs MPI has been around for a long time, > for example: > > http://www.beowulf.org/archive/2001-March/002718.html > > My general impression is that it is a waste of time to convert from pure > MPI to > a hybrid approach. For example: > > www.sc2000.org/techpapr/papers/pap.pap214.pdf > > On the other hand, here's a fellow who got a 4X speedup by going to hybrid: > > www.nersc.gov/nusers/services/training/classes/NUG/Jun04/NUG2004_yhe_hybrid.ppt > > My own view is that with a modern cluster with fast processors and with > inter-node communications not > that much slower than a cache miss to main memory, the unified MPI model > makes more sense, but > there are many many papers arguing about this topic. > > -L > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://www.scyld.com/pipermail/beowulf/attachments/20071002/567c759a/attachment-0001.html > > ------------------------------ > > _______________________________________________ > Beowulf mailing list > Beowulf@beowulf.org > http://www.beowulf.org/mailman/listinfo/beowulf > > > End of Beowulf Digest, Vol 44, Issue 4 > ************************************** -- Herbert Fruchtl EaStCHEM Fellow School of Chemistry University of St Andrews From anandvaidya.ml at gmail.com Wed Oct 3 04:29:23 2007 From: anandvaidya.ml at gmail.com (Anand Vaidya) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux Message-ID: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> Dear Beowulfers, We ran into a problem with large disks which I suspect is fairly common, however the usual solutions are not working. IBM, RedHat have not been able to provide any useful answers so I am turning to this list for help. (Emulex is still helping, but I am not sure how far they can go without access to the hardware) Details: * Linux Cluster for Weather modelling * IBM Bladecenter blades and an IBM x3655 Opteron head node FC attached to a Hitachi Tagmastore SAN storage, Emulex LightPulse FC HBA, PCI-Express, Dual port * RHEL 4update5, x86_64 kernel 2.6.9-55 SMP and RHEL provided Emulex driver (lpfc) and lpfcdfc also installed * GPT partition created with parted There is one 2TB LUN, works fine. There is a 3TB LUN on the Hitachi SAN which is reported as "only" 2199GB ( 2.1TB) , We noticed that, when the emulex driver loads, the following error message is reported: Emulex LightPulse Fibre Channel SCSI driver 8.0.16.32 Copyright(c) 2003-2007 Emulex. All rights reserved. ACPI: PCI Interrupt 0000:2d:00.0[A] -> GSI 18 (level, low) -> IRQ 185 PCI: Setting latency timer of device 0000:2d:00.0 to 64 lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received Data: x2 x4 x1000 lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received Data: x2 x4 x1000 lpfc 0000:2d:00.0: 0:1303 Link Up Event x3 received Data: x3 x1 x10 x0 scsi5 : IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x on PCI bus 2d device 00 irq 185 port 0 Vendor: HITACHI Model: OPEN-V*3 Rev: 5007 Type: Direct-Access ANSI SCSI revision: 03 sdb : very big device. try to use READ CAPACITY(16). sdb : READ CAPACITY(16) failed. sdb : status=1, message=00, host=0, driver=08 sdb : use 0xffffffff as device size SCSI device sdb: 4294967296 512-byte hdwr sectors (2199023 MB) SCSI device sdb: drive cache: write back sdb : very big device. try to use READ CAPACITY(16). sdb : READ CAPACITY(16) failed. sdb : status=1, message=00, host=0, driver=08 sdb : use 0xffffffff as device size SCSI device sdb: 4294967296 512-byte hdwr sectors (2199023 MB) SCSI device sdb: drive cache: write back The problem is with the READ CAPACITY(16) failed, but we are unable to find the source of this error. We conducted several experiments without success: - Tried compiling the latest driver from Emulex (8.0.16.32) - same error - Tried Knoppix (2.6.19) and Gentoo LiveCD (2.6.19) , and CentOS 4.4 - same error - Tried to boot Belenix (Solaris 32 bit live), failed to boot completely (may be unrelated issue) We have a temporary workaround in place: We created 3x1TB disks and used LVM to create a striped 3TB volume with ext3 FS. This works fine. RedHat claims ext3 and RHEL4 supports disks upto 8TB and 16TB respectively (since RHEL4u2) I would like to know if anyone on the list has any pointers that can help us solve the issue. Regards Anand Vaidya -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071003/118f7440/attachment.html From atchley at myri.com Wed Oct 3 08:14:05 2007 From: atchley at myri.com (Scott Atchley) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> Message-ID: Is someone using a signed int to represent the 1 KB blocks? 2 * 1024 * 1024 * 1024 * 1024 = 2199023255552 Scott On Oct 3, 2007, at 7:29 AM, Anand Vaidya wrote: > Dear Beowulfers, > > We ran into a problem with large disks which I suspect is fairly > common, however the usual solutions are not working. IBM, RedHat > have not been able to provide any useful answers so I am turning to > this list for help. (Emulex is still helping, but I am not sure how > far they can go without access to the hardware) > > Details: > > * Linux Cluster for Weather modelling > > * IBM Bladecenter blades and an IBM x3655 Opteron head node FC > attached to a Hitachi Tagmastore SAN storage, Emulex LightPulse FC > HBA, PCI-Express, Dual port > > * RHEL 4update5, x86_64 kernel 2.6.9-55 SMP and RHEL provided > Emulex driver (lpfc) and lpfcdfc also installed > > * GPT partition created with parted > > There is one 2TB LUN, works fine. > > There is a 3TB LUN on the Hitachi SAN which is reported as "only" > 2199GB ( 2.1TB) , > > We noticed that, when the emulex driver loads, the following error > message is reported: > > Emulex LightPulse Fibre Channel SCSI driver 8.0.16.32 > Copyright(c) 2003-2007 Emulex. All rights reserved. > ACPI: PCI Interrupt 0000:2d:00.0[A] -> GSI 18 (level, > low) -> IRQ 185 > PCI: Setting latency timer of device 0000:2d:00.0 to 64 > lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received > Data: x2 x4 x1000 > lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received > Data: x2 x4 x1000 > lpfc 0000:2d:00.0: 0:1303 Link Up Event x3 received > Data: x3 x1 x10 x0 > scsi5 : IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x > on PCI bus 2d device 00 irq 185 port 0 > Vendor: HITACHI Model: OPEN-V*3 Rev: 5007 > Type: Direct-Access ANSI SCSI > revision: 03 > sdb : very big device. try to use READ CAPACITY(16). > sdb : READ CAPACITY(16) failed. > sdb : status=1, message=00, host=0, driver=08 > sdb : use 0xffffffff as device size > SCSI device sdb: 4294967296 512-byte hdwr sectors > (2199023 MB) > SCSI device sdb: drive cache: write back > sdb : very big device. try to use READ CAPACITY(16). > sdb : READ CAPACITY(16) failed. > sdb : status=1, message=00, host=0, driver=08 > sdb : use 0xffffffff as device size > SCSI device sdb: 4294967296 512-byte hdwr sectors > (2199023 MB) > SCSI device sdb: drive cache: write back > > The problem is with the READ CAPACITY(16) failed, but we are unable > to find the source of this error. > > We conducted several experiments without success: > > - Tried compiling the latest driver from Emulex (8.0.16.32) - same > error > - Tried Knoppix (2.6.19) and Gentoo LiveCD (2.6.19 ) , and CentOS > 4.4 - same error > - Tried to boot Belenix (Solaris 32 bit live), failed to boot > completely (may be unrelated issue) > > We have a temporary workaround in place: We created 3x1TB disks and > used LVM to create a striped 3TB volume with ext3 FS. This works > fine. > > RedHat claims ext3 and RHEL4 supports disks upto 8TB and 16TB > respectively (since RHEL4u2) > > I would like to know if anyone on the list has any pointers that > can help us solve the issue. > > Regards > Anand Vaidya > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From hahn at mcmaster.ca Wed Oct 3 08:49:42 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> Message-ID: > Is someone using a signed int to represent the 1 KB blocks? > 2 * 1024 * 1024 * 1024 * 1024 = 2199023255552 yes - you can see from the message that the kernel is trying to use a 16-byte read-capacity command, but it's failing. these days, scsi and especially fc and super-especially some obscure driver like this are less heavily scrutinized, so this is not hugely surprising. I don't know whether the error status indicates a flaw in the driver or perhaps in the target (which might not support large luns). >> We ran into a problem with large disks which I suspect is fairly common, in my world, large == sata raid, and scsi/fc is a vanishing breed, which also coincides with the stunted size of scsi/fc disks. that said, the only scsi my organization has (embedded in HP SFS clusters) is (annoyingly) chopped up into piddly little 2TB luns. >> however the usual solutions are not working. IBM, RedHat have not been >> able to provide any useful answers so I am turning to this list for help. >> (Emulex is still helping, but I am not sure how far they can go without >> access to the hardware) I think you should ask them whether the driver supports the 16-byte read-capacity. and you should ask the target provider (hitachi) whether they support >2TB luns and whether they implement 16-byte commands. you paid through the nose for FC hardware; you should expect a high level of service. >> >> Details: >> >> * Linux Cluster for Weather modelling >> >> * IBM Bladecenter blades and an IBM x3655 Opteron head node FC attached to >> a Hitachi Tagmastore SAN storage, Emulex LightPulse FC HBA, PCI-Express, >> Dual port >> >> * RHEL 4update5, x86_64 kernel 2.6.9-55 SMP and RHEL provided Emulex driver >> (lpfc) and lpfcdfc also installed >> >> * GPT partition created with parted >> >> There is one 2TB LUN, works fine. >> >> There is a 3TB LUN on the Hitachi SAN which is reported as "only" 2199GB ( >> 2.1TB) , >> >> We noticed that, when the emulex driver loads, the following error message >> is reported: >> >> Emulex LightPulse Fibre Channel SCSI driver 8.0.16.32 >> Copyright(c) 2003-2007 Emulex. All rights reserved. >> ACPI: PCI Interrupt 0000:2d:00.0[A] -> GSI 18 (level, low) -> >> IRQ 185 >> PCI: Setting latency timer of device 0000:2d:00.0 to 64 >> lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received Data: x2 >> x4 x1000 >> lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received Data: x2 >> x4 x1000 >> lpfc 0000:2d:00.0: 0:1303 Link Up Event x3 received Data: x3 x1 >> x10 x0 >> scsi5 : IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x on PCI >> bus 2d device 00 irq 185 port 0 >> Vendor: HITACHI Model: OPEN-V*3 Rev: 5007 >> Type: Direct-Access ANSI SCSI revision: >> 03 >> sdb : very big device. try to use READ CAPACITY(16). >> sdb : READ CAPACITY(16) failed. >> sdb : status=1, message=00, host=0, driver=08 >> sdb : use 0xffffffff as device size >> SCSI device sdb: 4294967296 512-byte hdwr sectors (2199023 MB) >> SCSI device sdb: drive cache: write back >> sdb : very big device. try to use READ CAPACITY(16). >> sdb : READ CAPACITY(16) failed. >> sdb : status=1, message=00, host=0, driver=08 >> sdb : use 0xffffffff as device size >> SCSI device sdb: 4294967296 512-byte hdwr sectors (2199023 MB) >> SCSI device sdb: drive cache: write back >> >> The problem is with the READ CAPACITY(16) failed, but we are unable to find >> the source of this error. >> >> We conducted several experiments without success: >> >> - Tried compiling the latest driver from Emulex (8.0.16.32) - same error >> - Tried Knoppix (2.6.19) and Gentoo LiveCD (2.6.19 ) , and CentOS 4.4 - >> same error >> - Tried to boot Belenix (Solaris 32 bit live), failed to boot completely >> (may be unrelated issue) >> >> We have a temporary workaround in place: We created 3x1TB disks and used >> LVM to create a striped 3TB volume with ext3 FS. This works fine. >> >> RedHat claims ext3 and RHEL4 supports disks upto 8TB and 16TB respectively >> (since RHEL4u2) >> >> I would like to know if anyone on the list has any pointers that can help >> us solve the issue. >> >> Regards >> Anand Vaidya >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- operator may differ from spokesperson. hahn@mcmaster.ca From gmpc at sanger.ac.uk Wed Oct 3 09:03:29 2007 From: gmpc at sanger.ac.uk (Guy Coates) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> Message-ID: <4703BD51.9020001@sanger.ac.uk> Luns over 2TB are a bad idea. There are just too many reasons why they might not work, and trying to track down the right one is a pain. Your workaround to use the LVM to stripe 3x1TB luns together is the way to go. (You really want to use LVM anyhow, as trying to do SAN storage without it is a pain.) You've covered the usual Gotchas with 2TB luns (dos partition tables and 32 bit kernels), so I'd suspect your storage array. Check with Hitachi that the array supports creating luns over 2TB. Many don't, and for maximum confusion, some allow you to create >2TB luns, even though you can't subsequently use them. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jlforrest at berkeley.edu Wed Oct 3 09:13:44 2007 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: <4703BD51.9020001@sanger.ac.uk> References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> <4703BD51.9020001@sanger.ac.uk> Message-ID: <4703BFB8.1020606@berkeley.edu> Guy Coates wrote: > Luns over 2TB are a bad idea. There are just too many reasons why they might not > work, and trying to track down the right one is a pain. Other than the usual putting all your eggs into one basket problem, why should a lun over 2TB containing an ext3 file system, be any more risky than those under 2TB, assuming all the partition table issues are resolved? Are there performance, reliability, compatibility, or other problems? (I'm not saying you're wrong but I'm always suspicious of claims like these without specific examples of why something is a bad idea.) Cordially, -- Jon Forrest Unix Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest@berkeley.edu From gmpc at sanger.ac.uk Wed Oct 3 09:27:35 2007 From: gmpc at sanger.ac.uk (Guy Coates) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: <4703BFB8.1020606@berkeley.edu> References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> <4703BD51.9020001@sanger.ac.uk> <4703BFB8.1020606@berkeley.edu> Message-ID: <4703C2F7.2010100@sanger.ac.uk> Jon Forrest wrote: > Guy Coates wrote: >> Luns over 2TB are a bad idea. There are just too many reasons why they >> might not >> work, and trying to track down the right one is a pain. > > Other than the usual putting all your eggs into one basket problem, > why should a lun over 2TB containing an ext3 file system, be any > more risky than those under 2TB, assuming all the partition table > issues are resolved? Are there performance, reliability, > compatibility, or other problems? You are right; luns over 2TB are not any more risky than luns under 2TB. If they work "out of the box" on your particular setup, great. However, if they don't work then you are into a world of pain working out whether it is your storage controller, SAN fabric, HBA, multipath implementation, kernel, partition table or filesystem which is broken. It gets especially bad in SAN environments where you have multiple kernel/storage combinations. Keeping luns < 2TB means that there is a reasonable expectation that things will keep on working when you change the storage presentation. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From landman at scalableinformatics.com Wed Oct 3 09:46:43 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> Message-ID: <4703C773.2040102@scalableinformatics.com> Hi Anand: Anand Vaidya wrote: > Dear Beowulfers, > > We ran into a problem with large disks which I suspect is fairly common, > however the usual solutions are not working. IBM, RedHat have not been > able to provide any useful answers so I am turning to this list for > help. (Emulex is still helping, but I am not sure how far they can go > without access to the hardware) > > Details: > > * Linux Cluster for Weather modelling > > * IBM Bladecenter blades and an IBM x3655 Opteron head node FC attached > to a Hitachi Tagmastore SAN storage, Emulex LightPulse FC HBA, > PCI-Express, Dual port > > * RHEL 4update5, x86_64 kernel 2.6.9-55 SMP and RHEL provided Emulex > driver (lpfc) and lpfcdfc also installed There is a problem in some parted versions (prior to 1.8.x) where they munge the partition table on gpt/large disks. Apart from suggesting a more modern kernel (2.6.22.6 or so), there may be other things you can do. > > * GPT partition created with parted I presume this is 1.6.19 parted? > > There is one 2TB LUN, works fine. > > There is a 3TB LUN on the Hitachi SAN which is reported as "only" 2199GB > ( 2.1TB) , Yup. Sounds like either the parted problem, or a driver issue. What does parted /dev/3TBlun/ print report where /dev/3TBlun is the device containing the 3TB lun? We had seen this behavior with some 1.6.9 and 1.7.x versions of parted. The only way to fix it was to rebuild parted with 1.8.x > > We noticed that, when the emulex driver loads, the following error > message is reported: > > Emulex LightPulse Fibre Channel SCSI driver 8.0.16.32 > > Copyright(c) 2003-2007 Emulex. All rights reserved. > ACPI: PCI Interrupt 0000:2d:00.0[A] -> GSI 18 (level, low) > -> IRQ 185 > PCI: Setting latency timer of device 0000:2d:00.0 to 64 > lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received Data: > x2 x4 x1000 > lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received Data: > x2 x4 x1000 > lpfc 0000:2d:00.0: 0:1303 Link Up Event x3 received Data: x3 > x1 x10 x0 > scsi5 : IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x on > PCI bus 2d device 00 irq 185 port 0 > Vendor: HITACHI Model: OPEN-V*3 Rev: 5007 > Type: Direct-Access ANSI SCSI > revision: 03 > sdb : very big device. try to use READ CAPACITY(16). This is what our JackRabbit reports ... > sdb : READ CAPACITY(16) failed. This is not what our JackRabbit reports. [...] > The problem is with the READ CAPACITY(16) failed, but we are unable to > find the source of this error. > > We conducted several experiments without success: > > - Tried compiling the latest driver from Emulex (8.0.16.32 > ) - same error > - Tried Knoppix (2.6.19) and Gentoo LiveCD (2.6.19 ) , and CentOS 4.4 > - same error Sounds a great deal like parted. > - Tried to boot Belenix (Solaris 32 bit live), failed to boot completely > (may be unrelated issue) > > We have a temporary workaround in place: We created 3x1TB disks and used > LVM to create a striped 3TB volume with ext3 FS. This works fine. > > RedHat claims ext3 and RHEL4 supports disks upto 8TB and 16TB > respectively (since RHEL4u2) ... yeah. > > I would like to know if anyone on the list has any pointers that can > help us solve the issue. Please run the parted command as indicated. Lets see what the partition table thinks it is. Do you have data on that partition? Can you remake the label on that device with a new version of parted? > > Regards > Anand Vaidya > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From nixon at nsc.liu.se Wed Oct 3 14:13:48 2007 From: nixon at nsc.liu.se (Leif Nixon) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: (Mark Hahn's message of "Wed, 3 Oct 2007 11:49:42 -0400 (EDT)") References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> Message-ID: Mark Hahn writes: > I think you should ask them whether the driver supports the 16-byte > read-capacity. and you should ask the target provider (hitachi) > whether > they support >2TB luns and whether they implement 16-byte commands. As far as I know, neither the Hitachi AMS series nor the Hitachi NSC series support LUNs > 2TB. -- Leif Nixon - Systems expert ------------------------------------------------------------ National Supercomputer Centre - Linkoping University ------------------------------------------------------------ From xingqiuyuan at gmail.com Wed Oct 3 09:00:12 2007 From: xingqiuyuan at gmail.com (xingqiu yuan) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price Message-ID: Dear All We are going to buy a small size beowulf cluster with 32 multicore processors (INTEL or AMD multicore processor). We don't care the network, but we need a big storage disks. Anybody can give us some suggestions about the hardware and the price? best regards Xingqiu Yuan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071003/27c000d0/attachment.html From merc4krugger at gmail.com Wed Oct 3 09:59:18 2007 From: merc4krugger at gmail.com (Krugger) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: <4703C2F7.2010100@sanger.ac.uk> References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> <4703BD51.9020001@sanger.ac.uk> <4703BFB8.1020606@berkeley.edu> <4703C2F7.2010100@sanger.ac.uk> Message-ID: Hi, It usually takes some more time to get a stable storage with over 2 TB partitions. In my personal experience we had to update the SAN firmware, after that we figured out that the SCSI board that connected to the SAN hardware also was bad and had to have it replaced. Also remember to enable large block device support. But after this the storage has been stable with over 2TB partitions. Krugger > You are right; luns over 2TB are not any more risky than luns under 2TB. If they > work "out of the box" on your particular setup, great. > > However, if they don't work then you are into a world of pain working out > whether it is your storage controller, SAN fabric, HBA, multipath > implementation, kernel, partition table or filesystem which is broken. > > It gets especially bad in SAN environments where you have multiple > kernel/storage combinations. Keeping luns < 2TB means that there is a reasonable > expectation that things will keep on working when you change the storage > presentation. > > Cheers, > > Guy From zahir.tari at rmit.edu.au Wed Oct 3 19:18:44 2007 From: zahir.tari at rmit.edu.au (Zahir Tari) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] OTM 07 Call for Participation Message-ID: <8EE2E0AC-69D5-4F27-9E9F-CE1E8F62CE47@rmit.edu.au> Dear all, This is to let you know that the selection process for the OTM 2007 Federated event just finished. This event will be held in Algarve (Portugal) during the period of 25-30 November 2007, and it involves the following major conferences: - International Symposium on Distributed Objects and Applications (DOA'07) - International Conference on Cooperative Information Systems (CoopIS'07) - International Symposium on Grid computing, high-performAnce and Distributed Applications (GADA'07) - International Conference on Ontologies, Databases and Applications of Semantics (ODBASE'07) - International Symposium on Information Security (IS'07) We are proud to announce you the four exceptional keynote speakers for the OTM 2007 event: - Donald Ferguson (Microsoft) on "The Internet Service Bus" - Mark Little (Red Hat) on "Transaction Processing in a Service Oriented Architecture" - York Sure (SAP) on "Towards the next Generation Value Networks" - Dennis Gannon (Indiana University) on "A Service Architecture for eScience Grid Gateways" - Whitfield Diffie (Sun Microsystems) on "Cryptography: Past, Present and Future" Details about the list of accepted papers can be found at http://www.cs.rmit.edu.au/fedconf/index.html?page=accepted In additional to the conferences, OTM 2007 is also running exceptional workshops covering various topics (e.g. agents in web services, privacy, context-aware mobile services...). Hope to see you at OTM 2007 in Portugal! Please send us an e-mail to fedconf@cs.rmit.edu.au for any information. Regards, Zahir [Tari] & Robert [Meersman] -- OTM 2007 General Co-Chairs -- http://www.cs.rmit.edu.au/fedconf/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071004/36756fe0/attachment.html From hahn at mcmaster.ca Thu Oct 4 07:11:27 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: References: Message-ID: > We are going to buy a small size beowulf cluster with 32 multicore > processors (INTEL or AMD multicore processor). does "multi" mean 2 or 4 to you? also, are you talking dual-socket or single (or quad)? > We don't care the network, but we need a big storage disks. that's even more poorly defined. how big is big, and does it need to be raid, and if so what level of redundancy and hotpluggability? by big do you mean capacity, or do you also mean sustained bandwidth (presumably you don't mean to maximize random seeks, or do you?) what space footprint is acceptable? are you sure you can't tolerate shared storage of any sort, even if over a high-bandwidth media? > Anybody can give us some suggestions about > the hardware and the price? no. your query could mean "1-socket 1U server with 2x 750G local" (probably around $1500), or it could mean "4-socket quadcore with 128G ram and 12 15K rpm SAS disks" (more like $50k). if you're really taking the extreme of high disk-to-cpu ratio, HP has a product which puts 14xSATA disks in 2U with a single socket for about $11k (US list). sun's thumper is 48x750 in 4U, I think; I don't know what kind of cpu it has locally. From anandvaidya.ml at gmail.com Wed Oct 3 21:36:56 2007 From: anandvaidya.ml at gmail.com (Anand Vaidya) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: <4703C773.2040102@scalableinformatics.com> References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> <4703C773.2040102@scalableinformatics.com> Message-ID: <200710041236.56399.anandvaidya.ml@gmail.com> My apologies for top posting. Thanks to all the respondents. I have collated your comments and questions into one response. Hope it is more useful... [Joe Landman]: recommended using a newer version of parted [My Reply]: We are seeing the error right when the driver loads, even before Linux is fully booted up. So I guess parted versions etc have no role to play. [Leif Nixon]: As far as I know, neither the Hitachi AMS series nor the Hitachi NSC series support LUNs > 2TB. [My Reply]: The Hitachi folks here claims any LUN size (< physical capacity) is configurable, 3TB definitely should be fine on the Hitachi storage side, but cannot comment on OS / HBA-drivers etc. I did some search and found a document online that stated 2TB is max LUN size for some Enterprise storage , but I am still trying to ascertain the fact for Hitachi. However, we have seen the HDS Storage config GUI where a 3TB LUN was created and zoned correctly. [Guy Coates]: Recommended staying with 2TB LUNs. [My Reply]: While I agree that staying under 2TB and using LVM to stripe will avoid a lot of issues, this sizing is needed by the end customer, .... but I thought with x86_64 linux is 64-bit clean everywhere! (SCSI LBA atleast?) Moreover, I can earn karma points if I can find a bug or two in linux or linux driver :-) and contribute to the improvement of GNU/Linux. So while there is a temporary solution, I intend to pursue the matter to completion. [Mark Hahn]: Suggested checking whether driver supports 16-byte READ CAPACITY. [My Reply]: I did check with Emulex. They say, "lpfc" driver does use 16-byte read cap. They think the Linux SCSI layer is the underlying cause. I am still working with their support staff on gathering detailed info via "System Grab Diag Tool" They say "The error would occur immediately after the lpfc driver loads because the target is then available (the Fibre Channel controller of the HDS storage) and the SCSI mid-layer has queried the LUNs, that then generates the error. The lpfc driver does not query the LUNs and so would never request a capacity nor any other inforamation. The SCSI mid-layer and higher levels would perform this function." One thing I am very happy about is the level of support offered by Emulex, even though I have not purchased the product from them directly and cannot even furnish an Emulex partno. for the HBA. I wish other companies were as good at support. Regards Anand On Thursday 04 October 2007 00:46:43 Joe Landman wrote: > Hi Anand: > > Anand Vaidya wrote: > > Dear Beowulfers, > > > > We ran into a problem with large disks which I suspect is fairly common, > > however the usual solutions are not working. IBM, RedHat have not been > > able to provide any useful answers so I am turning to this list for > > help. (Emulex is still helping, but I am not sure how far they can go > > without access to the hardware) > > > > Details: > > > > * Linux Cluster for Weather modelling > > > > * IBM Bladecenter blades and an IBM x3655 Opteron head node FC attached > > to a Hitachi Tagmastore SAN storage, Emulex LightPulse FC HBA, > > PCI-Express, Dual port > > > > * RHEL 4update5, x86_64 kernel 2.6.9-55 SMP and RHEL provided Emulex > > driver (lpfc) and lpfcdfc also installed > > There is a problem in some parted versions (prior to 1.8.x) where they > munge the partition table on gpt/large disks. > > Apart from suggesting a more modern kernel (2.6.22.6 or so), there may > be other things you can do. > > > * GPT partition created with parted > > I presume this is 1.6.19 parted? > > > There is one 2TB LUN, works fine. > > > > There is a 3TB LUN on the Hitachi SAN which is reported as "only" 2199GB > > ( 2.1TB) , > > Yup. Sounds like either the parted problem, or a driver issue. > > What does > > parted /dev/3TBlun/ print > > report where /dev/3TBlun is the device containing the 3TB lun? > > We had seen this behavior with some 1.6.9 and 1.7.x versions of parted. > The only way to fix it was to rebuild parted with 1.8.x > > > We noticed that, when the emulex driver loads, the following error > > message is reported: > > > > Emulex LightPulse Fibre Channel SCSI driver 8.0.16.32 > > > > Copyright(c) 2003-2007 Emulex. All rights reserved. > > ACPI: PCI Interrupt 0000:2d:00.0[A] -> GSI 18 (level, low) > > -> IRQ 185 > > PCI: Setting latency timer of device 0000:2d:00.0 to 64 > > lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received Data: > > x2 x4 x1000 > > lpfc 0000:2d:00.0: 0:1305 Link Down Event x2 received Data: > > x2 x4 x1000 > > lpfc 0000:2d:00.0: 0:1303 Link Up Event x3 received Data: x3 > > x1 x10 x0 > > scsi5 : IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x on > > PCI bus 2d device 00 irq 185 port 0 > > Vendor: HITACHI Model: OPEN-V*3 Rev: 5007 > > Type: Direct-Access ANSI SCSI > > revision: 03 > > sdb : very big device. try to use READ CAPACITY(16). > > This is what our JackRabbit reports ... > > > sdb : READ CAPACITY(16) failed. > > This is not what our JackRabbit reports. > > [...] > > > The problem is with the READ CAPACITY(16) failed, but we are unable to > > find the source of this error. > > > > We conducted several experiments without success: > > > > - Tried compiling the latest driver from Emulex (8.0.16.32 > > ) - same error > > - Tried Knoppix (2.6.19) and Gentoo LiveCD (2.6.19 ) , and CentOS 4.4 > > - same error > > Sounds a great deal like parted. > > > - Tried to boot Belenix (Solaris 32 bit live), failed to boot completely > > (may be unrelated issue) > > > > We have a temporary workaround in place: We created 3x1TB disks and used > > LVM to create a striped 3TB volume with ext3 FS. This works fine. > > > > RedHat claims ext3 and RHEL4 supports disks upto 8TB and 16TB > > respectively (since RHEL4u2) > > ... yeah. > > > I would like to know if anyone on the list has any pointers that can > > help us solve the issue. > > Please run the parted command as indicated. Lets see what the partition > table thinks it is. > > Do you have data on that partition? Can you remake the label on that > device with a new version of parted? > > > Regards > > Anand Vaidya > > > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf From rickyfingers2000 at gmail.com Thu Oct 4 07:14:04 2007 From: rickyfingers2000 at gmail.com (John Hancock) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] no shared state, shared state with explicit locking, shared state without explicit locks Message-ID: <222fc29f0710040714h66579620r26ac383379d3f01d@mail.gmail.com> Just ran across this piece of advice on slashdot: "I'm not familiar with all of those libraries mentioned in the story, but I'll bet that those three classifications (no shared state, shared state with explicit locking, shared state without explicit locks) probably cover the models used by most if not all of them. If you understand the trade-offs in those, you can produce a sensible design, and then the toolkit or framework you use to code it up is mostly just an implementation detail." -Anonymous Brave Guy (457657) A quick google and a search of the mail list archives did not yield anything that looked like a good explanation of what the differences are between shared state, shared state with explicit locking, and a shared state without explicit locks libraries. Is Anonymous Brave Guy using unusual terminology for concepts that usually go by other names? If anyone on the list would care to share his/her explanation/opinion about what Anonymous Brave Guy wrote, I'd be most honored. -j From hahn at mcmaster.ca Thu Oct 4 13:02:39 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] no shared state, shared state with explicit locking, shared state without explicit locks In-Reply-To: <222fc29f0710040714h66579620r26ac383379d3f01d@mail.gmail.com> References: <222fc29f0710040714h66579620r26ac383379d3f01d@mail.gmail.com> Message-ID: this question is referring to a thread here: http://developers.slashdot.org/article.pl?sid=07/10/03/0021253&from=rss > but I'll bet that those three classifications (no shared state, shared > state with explicit locking, shared state without explicit locks) I think there's one big distinction: shared or not shared. I figured the author was talking about functional/dataflow languages which don't have shared state (including MPI, in a sense) versus languages which assume shared memory such as OpenMP. within the shared-memory approach, explicit locking is pretty clear. I presume "without explicit locks" just means crude systems like Java's monitored data types. IMO, the penitude of parallel packages mainly shows that we don't have good answers yet... From landman at scalableinformatics.com Thu Oct 4 15:56:44 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: References: Message-ID: <47056FAC.5090405@scalableinformatics.com> Mark Hahn wrote: [...] > if you're really taking the extreme of high disk-to-cpu ratio, > HP has a product which puts 14xSATA disks in 2U with a single socket > for about $11k (US list). sun's thumper is 48x750 in 4U, I think; I don't > know what kind of cpu it has locally. ... FWIW our JackRabbit unit supports 48 x 1TB (1000GB) drives in 5U ... A terabyte here, a terabyte there, and soon you are talking about real storage ... :) Sun uses Opteron 280's, which are dual core socket 940 processors. They are limited to 4 cores and 16 GB ram. Again, FWIW, our JackRabbit can hit 16 cores and 64 GB ram. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From hunting at ix.netcom.com Thu Oct 4 18:16:10 2007 From: hunting at ix.netcom.com (Michael Huntingdon) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <47056FAC.5090405@scalableinformatics.com> References: <47056FAC.5090405@scalableinformatics.com> Message-ID: <200710050116.l951GLU9023688@bluewest.scyld.com> Actually Mark..I just did an HP DL320 (the system I believe you were talking about) with 9TB (12x750 SATA) for a client for under $8K. This include $400 or so for three years of on site services, before requesting any additional (special) discount. So your $11K estimate is more than a bit high. ~m At 03:56 PM 10/4/2007, Joe Landman wrote: >Mark Hahn wrote: > >[...] > > > if you're really taking the extreme of high disk-to-cpu ratio, > > HP has a product which puts 14xSATA disks in 2U with a single socket > > for about $11k (US list). sun's thumper is 48x750 in 4U, I think; I don't > > know what kind of cpu it has locally. > >... FWIW our JackRabbit unit supports 48 x 1TB (1000GB) drives in 5U ... > >A terabyte here, a terabyte there, and soon you are talking about real >storage ... :) > >Sun uses Opteron 280's, which are dual core socket 940 processors. They >are limited to 4 cores and 16 GB ram. Again, FWIW, our JackRabbit can >hit 16 cores and 64 GB ram. > >-- >Joseph Landman, Ph.D >Founder and CEO >Scalable Informatics LLC, >email: landman@scalableinformatics.com >web : http://www.scalableinformatics.com > http://jackrabbit.scalableinformatics.com >phone: +1 734 786 8423 >fax : +1 866 888 3112 >cell : +1 734 612 4615 >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf ************************************************************************************************ Systems Performance Consultants Michael Huntingdon Higher Education Technology Office (408) 294-6811 1981 Randolph Dr. Cell (707) 478-0226 San Jose, CA 95128 fax (601) 510-3808 Web: <http://www.spcnet.com> hunting@ix.netcom.com ************************************************************************************************ From hahn at mcmaster.ca Thu Oct 4 20:39:35 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <200710050116.l951GBpq011916@zafron7.UTS.McMaster.CA> References: <47056FAC.5090405@scalableinformatics.com> <200710050116.l951GBpq011916@zafron7.UTS.McMaster.CA> Message-ID: > Actually Mark..I just did an HP DL320 (the system I believe you were talking > about) with 9TB (12x750 SATA) for a client for under $8K. This include $400 > or so for three years of on site services, before requesting any additional > (special) discount. So your $11K estimate is more than a bit high. well, it was 14 disks (it's a model that mounts two more disks in the back), and the price came straight off HP's website... (I may have included an IB card, too.) if anyone from HP is listening, a model with two quad-core-capable sockets would be a good move. think of it as the perfect map-reduce brick ;) regards, mark hahn. From lindahl at pbm.com Thu Oct 4 21:42:57 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: References: <47056FAC.5090405@scalableinformatics.com> <200710050116.l951GBpq011916@zafron7.UTS.McMaster.CA> Message-ID: <20071005044257.GA22517@bx9.net> On Thu, Oct 04, 2007 at 11:39:35PM -0400, Mark Hahn wrote: > if anyone from HP is listening, a model with two quad-core-capable > sockets would be a good move. think of it as the perfect map-reduce brick > ;) Quad cores use the same socket as duals. It's just a modest bios difference. I suspect map-reduce bricks are headed towards single-socket quad-cores, but what do I know? ;-) -- greg From nixon at nsc.liu.se Thu Oct 4 23:51:07 2007 From: nixon at nsc.liu.se (Leif Nixon) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: (Mark Hahn's message of "Thu, 4 Oct 2007 23:39:35 -0400 (EDT)") References: <47056FAC.5090405@scalableinformatics.com> <200710050116.l951GBpq011916@zafron7.UTS.McMaster.CA> Message-ID: Mark Hahn writes: >> Actually Mark..I just did an HP DL320 (the system I believe you were >> talking about) with 9TB (12x750 SATA) for a client for under $8K. >> This include $400 or so for three years of on site services, before >> requesting any additional (special) discount. So your $11K estimate >> is more than a bit high. > > well, it was 14 disks (it's a model that mounts two more disks in the back), > and the price came straight off HP's website... (I may have included an IB > card, too.) > > if anyone from HP is listening, a model with two quad-core-capable > sockets would be a good move. think of it as the perfect map-reduce > brick ;) I'd prefer they put their effort into supplying a raid controller that can actually detect corrupt raid sets. (cf. the recent thread "Big storage") -- Leif Nixon - Systems expert ------------------------------------------------------------ National Supercomputer Centre - Linkoping University ------------------------------------------------------------ From andrew at moonet.co.uk Fri Oct 5 01:30:40 2007 From: andrew at moonet.co.uk (andrew holway) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: References: <47056FAC.5090405@scalableinformatics.com> <200710050116.l951GBpq011916@zafron7.UTS.McMaster.CA> Message-ID: I just did a quote for 10TB storage. 3Ware Escalade 9650SE + battery backup (16 port) 12 x Hitachi 1TB Enterprise SATAII in raid 6 2 x Intel Xeon Woodcrest 5160 - 3GHz 8 x 2048MB DDRII-667 All for ?6,886.00 Which I realise would buy a small house in the USA at the moment :) Ta Andy On 05/10/2007, Leif Nixon wrote: > Mark Hahn writes: > > >> Actually Mark..I just did an HP DL320 (the system I believe you were > >> talking about) with 9TB (12x750 SATA) for a client for under $8K. > >> This include $400 or so for three years of on site services, before > >> requesting any additional (special) discount. So your $11K estimate > >> is more than a bit high. > > > > well, it was 14 disks (it's a model that mounts two more disks in the back), > > and the price came straight off HP's website... (I may have included an IB > > card, too.) > > > > if anyone from HP is listening, a model with two quad-core-capable > > sockets would be a good move. think of it as the perfect map-reduce > > brick ;) > > I'd prefer they put their effort into supplying a raid controller that > can actually detect corrupt raid sets. (cf. the recent thread "Big storage") > > -- > Leif Nixon - Systems expert > ------------------------------------------------------------ > National Supercomputer Centre - Linkoping University > ------------------------------------------------------------ > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From andrew at moonet.co.uk Fri Oct 5 01:31:05 2007 From: andrew at moonet.co.uk (andrew holway) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: References: <47056FAC.5090405@scalableinformatics.com> <200710050116.l951GBpq011916@zafron7.UTS.McMaster.CA> Message-ID: Oh and this includes 3 years onsite On 05/10/2007, andrew holway wrote: > I just did a quote for 10TB storage. > > 3Ware Escalade 9650SE + battery backup (16 port) > 12 x Hitachi 1TB Enterprise SATAII in raid 6 > 2 x Intel Xeon Woodcrest 5160 - 3GHz > 8 x 2048MB DDRII-667 > > All for ?6,886.00 > > Which I realise would buy a small house in the USA at the moment :) > > Ta > > Andy > > > > On 05/10/2007, Leif Nixon wrote: > > Mark Hahn writes: > > > > >> Actually Mark..I just did an HP DL320 (the system I believe you were > > >> talking about) with 9TB (12x750 SATA) for a client for under $8K. > > >> This include $400 or so for three years of on site services, before > > >> requesting any additional (special) discount. So your $11K estimate > > >> is more than a bit high. > > > > > > well, it was 14 disks (it's a model that mounts two more disks in the back), > > > and the price came straight off HP's website... (I may have included an IB > > > card, too.) > > > > > > if anyone from HP is listening, a model with two quad-core-capable > > > sockets would be a good move. think of it as the perfect map-reduce > > > brick ;) > > > > I'd prefer they put their effort into supplying a raid controller that > > can actually detect corrupt raid sets. (cf. the recent thread "Big storage") > > > > -- > > Leif Nixon - Systems expert > > ------------------------------------------------------------ > > National Supercomputer Centre - Linkoping University > > ------------------------------------------------------------ > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > From hahn at mcmaster.ca Fri Oct 5 07:52:50 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: References: <47056FAC.5090405@scalableinformatics.com> <200710050116.l951GBpq011916@zafron7.UTS.McMaster.CA> Message-ID: > I'd prefer they put their effort into supplying a raid controller that > can actually detect corrupt raid sets. (cf. the recent thread "Big storage") not me. raid is too important to be trusted to hardware - have you tried MD's check/scrub features? though afaik it doesn't have a way to switch on verification during normal reads (or, for that matter, verify-after-write.) in any case, I suspect that the trend is away from block-level raid... regards, mark hahn. From geoff at galitz.org Fri Oct 5 08:07:49 2007 From: geoff at galitz.org (Geoff Galitz) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: Message-ID: <200710051508.l95F7xvL030704@b.mail.sonic.net> Why do you automatically distrust hardware raid? -geoff -----Original Message----- not me. raid is too important to be trusted to hardware - have you tried MD's check/scrub features? though afaik it doesn't have a way to switch on verification during normal reads (or, for that matter, verify-after-write.) in any case, I suspect that the trend is away from block-level raid... From hahn at mcmaster.ca Fri Oct 5 09:08:25 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <200710051508.l95F7xvL030704@b.mail.sonic.net> References: <200710051508.l95F7xvL030704@b.mail.sonic.net> Message-ID: > Why do you automatically distrust hardware raid? it's that I trust software raid more. upgradability, ability to audit source code, portability to other machines, etc. there are certainly still some cases where the host cpu is scarce enough to want to avoid the processing overhead. and in principle, I believe HW raid _could_ be faster if the PCI* bus (or memory) is the bottleneck. let me put it this way: do you know what goes on in your HW raid's firmware? personally, I like sausage, but often avoid eating it because I hate to notice the bits of unidentified clearly-not-meat foo in it. imagine trying to verify that a HW raid controller actually does perform online parity verification. it would involve moving disks out of the array to be carefully corrupted on another machine before being returned to test. with SW raid, I can audit the code, even add counters and diagnostics, stop the raid, corrupt a parity block, and start it again to verify it all works... mostly though, it seems clear that a lot of the hardening aspects of ZFS are increasingly important. it's also a "doh!" kind of realization that raid should be done on a per-file basis. (consider the benefits of writing as raid1, which later transparently transforms to raid5 or some form of FEC. also consider that it's rare for user-level codes to choose the right blocksize and alignment to permit full-stripe raid5 writes, but easy to arrange if you're doign it in the filesystem.) regards, mark hahn. > -----Original Message----- > > > not me. raid is too important to be trusted to hardware - have you > tried MD's check/scrub features? though afaik it doesn't have a way > to switch on verification during normal reads (or, for that matter, > verify-after-write.) > > in any case, I suspect that the trend is away from block-level raid... From buccaneer at rocketmail.com Fri Oct 5 09:12:40 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <47056FAC.5090405@scalableinformatics.com> Message-ID: <391984.76010.qm@web30612.mail.mud.yahoo.com> Please see below for my comments --- Joe Landman wrote: > Mark Hahn wrote: > > [...] > A terabyte here, a terabyte there, and soon you are > talking about real > storage ... :) > > Sun uses Opteron 280's, which are dual core socket > 940 processors. They > are limited to 4 cores and 16 GB ram. Again, FWIW, > our JackRabbit can > hit 16 cores and 64 GB ram. We tested a 24TB x4500 using the same methodology we use to test all of our storage and we were pleasantly surprised. ____________________________________________________________________________________ Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games. http://sims.yahoo.com/ From xingqiuyuan at gmail.com Thu Oct 4 16:48:53 2007 From: xingqiuyuan at gmail.com (xingqiu yuan) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <47056FAC.5090405@scalableinformatics.com> References: <47056FAC.5090405@scalableinformatics.com> Message-ID: In deed, we understand that our question is too general, and it is really very hard to what the price is. But we really don't have any good idea about which type of cluster structure is suitable for us. We work on computational plasma physics, basically doing some magnetohydrodynamic (MHD) and particle simulations for fusion and space plasmas. Most of our codes were written in C++ and Fortran using MPI, so in this point of view, our codes can be run on any type of supercomputers, but the code use a lot of memory and output huge data. We want to buy a small cluster to fill in the gap between our Lab. and supercomputer center. On 10/5/07, Joe Landman wrote: > > Mark Hahn wrote: > > [...] > > > if you're really taking the extreme of high disk-to-cpu ratio, > > HP has a product which puts 14xSATA disks in 2U with a single socket > > for about $11k (US list). sun's thumper is 48x750 in 4U, I think; I > don't > > know what kind of cpu it has locally. > > ... FWIW our JackRabbit unit supports 48 x 1TB (1000GB) drives in 5U ... > > A terabyte here, a terabyte there, and soon you are talking about real > storage ... :) > > Sun uses Opteron 280's, which are dual core socket 940 processors. They > are limited to 4 cores and 16 GB ram. Again, FWIW, our JackRabbit can > hit 16 cores and 64 GB ram. > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web : http://www.scalableinformatics.com > http://jackrabbit.scalableinformatics.com > phone: +1 734 786 8423 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071005/a1ab13fc/attachment.html From jdvw at tticluster.com Fri Oct 5 04:12:07 2007 From: jdvw at tticluster.com (John Van Workum) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: References: <47056FAC.5090405@scalableinformatics.com> <200710050116.l951GBpq011916@zafron7.UTS.McMaster.CA> Message-ID: We are pricing out a modest storage system too. We are looking at using Areca (http://www.areca.com.tw/products/pcie341.htm). Their new Intel 800MHz IOP341 I/O processor based RAID cards with 2GB cache look impressive. Anyone have experience with these? Seems 3Ware are more popular on this list. Regards, John On 10/5/07, andrew holway wrote: > > I just did a quote for 10TB storage. > > 3Ware Escalade 9650SE + battery backup (16 port) > 12 x Hitachi 1TB Enterprise SATAII in raid 6 > 2 x Intel Xeon Woodcrest 5160 - 3GHz > 8 x 2048MB DDRII-667 > > All for ?6,886.00 > > Which I realise would buy a small house in the USA at the moment :) > > Ta > > Andy > > > > On 05/10/2007, Leif Nixon wrote: > > Mark Hahn writes: > > > > >> Actually Mark..I just did an HP DL320 (the system I believe you were > > >> talking about) with 9TB (12x750 SATA) for a client for under $8K. > > >> This include $400 or so for three years of on site services, before > > >> requesting any additional (special) discount. So your $11K estimate > > >> is more than a bit high. > > > > > > well, it was 14 disks (it's a model that mounts two more disks in the > back), > > > and the price came straight off HP's website... (I may have included > an IB > > > card, too.) > > > > > > if anyone from HP is listening, a model with two quad-core-capable > > > sockets would be a good move. think of it as the perfect map-reduce > > > brick ;) > > > > I'd prefer they put their effort into supplying a raid controller that > > can actually detect corrupt raid sets. (cf. the recent thread "Big > storage") > > > > -- > > Leif Nixon - Systems expert > > ------------------------------------------------------------ > > National Supercomputer Centre - Linkoping University > > ----------------------------------------------------------- > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071005/6f79082c/attachment.html From becker at scyld.com Fri Oct 5 11:42:19 2007 From: becker at scyld.com (Donald Becker) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 9th Annual Beowulf Bash: Announcement and sponsorship opportunity Message-ID: Ninth Annual Beowulf Bash We are putting together our annual Beowulf Bash. It will take place at its traditional place and time: during the week of the IEEE Supercomputing Conference. This year SC07 is in Reno NV during the week of Nov 10-15 We are trying to keep a casual event with broad sponsorship. The attraction is the other attendees, rather than high-end food or entertainment. Read that as "beer, snacks, and people you always wanted to talk to." We are tenatively planning the party for Tuesday or Wednesday evening 6-8pm in the downtown Reno area. Probably Third Street Blues (formerly The Blue Note) or a similar low-key bar. (BTW, if any reader knows of other vendor events that we should avoid, please let me know.) If your company (or even you as an individual) would like to help sponsor the event, please contact me, becker@beowulf.org Sponsors - get their name up in lights (Errrmm, or maybe just vinyl signs... you bring a sign, there will be some ambient light.) - are part of the brief greeting in the middle of the party. - have the opportunity for technical, hands-on demos at the bash - have their logos on the beowulf.org site through the end of SC07, and on the 2007 yearbook page after the event. -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From becker at scyld.com Fri Oct 5 12:26:22 2007 From: becker at scyld.com (Donald Becker) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] BWBUG / Beowulf Users Group meeting Oct 9 2007 at Georgetown Message-ID: Baltimore Washington Beowulf User Group Meeting Date: 9 Oct 2007 at 2:30 pm - 5:00pm. Location: Georgetown University at Whitehaven Street 3300 Whitehaven Street, Washington DC 20007 Speaker: Donald Becker, CTO of Scyld Software and Penguin Computing Host: Michael Fitzmaurice Here is the announcement from Mike: We are pleased to announce that Donald Becker the founder of the Beowulf Project at NASA, the founder of Scyld the world leader in Cluster Management Software and the CTO of Penguin Computing will be our special guest speaker. October 9th, 2007 at Georgetown University from 2:30 to 5:00 PM. Please join us for a very informative talk on High Performance Computing. Meeting Location: 3300 Whitehaven Street, Washington DC 20007. (This is NOT at the main Georgetown U campus. This is an off campus building one block from Wisconsin Avenue). If you have suggestions regarding speakers for the next meeting or meetings in the future please send them to beowulfUG@aol.com -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From nixon at nsc.liu.se Fri Oct 5 13:17:59 2007 From: nixon at nsc.liu.se (Leif Nixon) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <200710051508.l95F7xvL030704@b.mail.sonic.net> (Geoff Galitz's message of "Fri, 5 Oct 2007 17:07:49 +0200") References: <200710051508.l95F7xvL030704@b.mail.sonic.net> Message-ID: "Geoff Galitz" writes: > Why do you automatically distrust hardware raid? To some extent I share Mark's sentiment. I certainly trust the Linux kernel more than the firmware in a cheap raid controller. -- Leif Nixon - Systems expert ------------------------------------------------------------ National Supercomputer Centre - Linkoping University ------------------------------------------------------------ From diep at xs4all.nl Sat Oct 6 15:24:09 2007 From: diep at xs4all.nl (Vincent Diepeveen) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 9th Annual Beowulf Bash: Announcement References: Message-ID: <001001c80867$a1f48730$0900a8c0@objection> Any chance of us in Europe seeing something of this online in a live webcam broadcast? Thanks, Vincent ----- Original Message ----- From: "Donald Becker" To: "Beowulf Mailing List" ; "Beowulf Announcement mailing list" Sent: Friday, October 05, 2007 8:42 PM Subject: [Beowulf] 9th Annual Beowulf Bash: Announcement and sponsorshipopportunity > > > > Ninth Annual Beowulf Bash > > We are putting together our annual Beowulf Bash. It will take place at > its traditional place and time: during the week of the IEEE Supercomputing > Conference. This year SC07 is in Reno NV during the week of Nov 10-15 > > We are trying to keep a casual event with broad sponsorship. > The attraction is the other attendees, rather than high-end food or > entertainment. Read that as "beer, snacks, and people you always wanted > to talk to." > > We are tenatively planning the party for Tuesday or Wednesday evening > 6-8pm in the downtown Reno area. Probably Third Street Blues (formerly > The Blue Note) or a similar low-key bar. (BTW, if any reader knows of > other vendor events that we should avoid, please let me know.) > > If your company (or even you as an individual) would like to help > sponsor the event, please contact me, becker@beowulf.org > > Sponsors > - get their name up in lights (Errrmm, or maybe just vinyl signs... > you bring a sign, there will be some ambient light.) > - are part of the brief greeting in the middle of the party. > - have the opportunity for technical, hands-on demos at the bash > - have their logos on the beowulf.org site through the end of SC07, > and on the 2007 yearbook page after the event. > > > -- > Donald Becker becker@scyld.com > Penguin Computing / Scyld Software > www.penguincomputing.com www.scyld.com > Annapolis MD and San Francisco CA > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From john.leidel at gmail.com Sat Oct 6 15:55:08 2007 From: john.leidel at gmail.com (John Leidel) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 9th Annual Beowulf Bash: Announcement In-Reply-To: <001001c80867$a1f48730$0900a8c0@objection> References: <001001c80867$a1f48730$0900a8c0@objection> Message-ID: <1191711308.697.75.camel@e521.site> Don't know if they have an internet connection at the bar... doesn't look like many wifi hotspots within a reasonable distance : http://www.wi-fihotspotlist.com/search.php?street=125+W+3rd +St&city=Reno&state=NV&zip=89501&country=US&network=&proximity=1&submit.x=100&submit.y=12&submit=Show+HotSpots On Sun, 2007-10-07 at 00:24 +0200, Vincent Diepeveen wrote: > Any chance of us in Europe seeing something of this online in a live webcam > broadcast? > > Thanks, > Vincent > ----- Original Message ----- > From: "Donald Becker" > To: "Beowulf Mailing List" ; "Beowulf Announcement > mailing list" > Sent: Friday, October 05, 2007 8:42 PM > Subject: [Beowulf] 9th Annual Beowulf Bash: Announcement and > sponsorshipopportunity > > > > > > > > > > Ninth Annual Beowulf Bash > > > > We are putting together our annual Beowulf Bash. It will take place at > > its traditional place and time: during the week of the IEEE Supercomputing > > Conference. This year SC07 is in Reno NV during the week of Nov 10-15 > > > > We are trying to keep a casual event with broad sponsorship. > > The attraction is the other attendees, rather than high-end food or > > entertainment. Read that as "beer, snacks, and people you always wanted > > to talk to." > > > > We are tenatively planning the party for Tuesday or Wednesday evening > > 6-8pm in the downtown Reno area. Probably Third Street Blues (formerly > > The Blue Note) or a similar low-key bar. (BTW, if any reader knows of > > other vendor events that we should avoid, please let me know.) > > > > If your company (or even you as an individual) would like to help > > sponsor the event, please contact me, becker@beowulf.org > > > > Sponsors > > - get their name up in lights (Errrmm, or maybe just vinyl signs... > > you bring a sign, there will be some ambient light.) > > - are part of the brief greeting in the middle of the party. > > - have the opportunity for technical, hands-on demos at the bash > > - have their logos on the beowulf.org site through the end of SC07, > > and on the 2007 yearbook page after the event. > > > > > > -- > > Donald Becker becker@scyld.com > > Penguin Computing / Scyld Software > > www.penguincomputing.com www.scyld.com > > Annapolis MD and San Francisco CA > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From James.P.Lux at jpl.nasa.gov Sat Oct 6 17:20:10 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 9th Annual Beowulf Bash: Announcement In-Reply-To: <1191711308.697.75.camel@e521.site> References: <001001c80867$a1f48730$0900a8c0@objection> <1191711308.697.75.camel@e521.site> Message-ID: <6.2.3.4.2.20071006170958.03118da0@mail.jpl.nasa.gov> At 03:55 PM 10/6/2007, John Leidel wrote: >Don't know if they have an internet connection at the bar... doesn't >look like many wifi hotspots within a reasonable distance : > >http://www.wi-fihotspotlist.com/search.php?street=125+W+3rd >+St&city=Reno&state=NV&zip=89501&country=US&network=&proximity=1&submit.x=100&submit.y=12&submit=Show+HotSpots Not to mention that most restaurants and bars, or businesses for that matter, do not allow any sort of photography or video from their premises. The reasons are varied, and not always legitimate, but, other than some sort of obvious "taking a picture of my friends at the table" will typically result in the manager asking you to stop (or worse). see, e.g., http://lessig.org/blog/2003/05/dear_starbucks_say_it_aint_tru.html >On Sun, 2007-10-07 at 00:24 +0200, Vincent Diepeveen wrote: > > Any chance of us in Europe seeing something of this online in a > live webcam > > broadcast? > > > > Thanks, > > Vincent > > ----- Original Message ----- > > From: "Donald Becker" > > To: "Beowulf Mailing List" ; "Beowulf Announcement > > mailing list" > > Sent: Friday, October 05, 2007 8:42 PM > > Subject: [Beowulf] 9th Annual Beowulf Bash: Announcement and > > sponsorshipopportunity > > From gerry.creager at tamu.edu Sat Oct 6 20:07:06 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 9th Annual Beowulf Bash: Announcement In-Reply-To: <1191711308.697.75.camel@e521.site> References: <001001c80867$a1f48730$0900a8c0@objection> <1191711308.697.75.camel@e521.site> Message-ID: <47084D5A.1030702@tamu.edu> I can bring my Sprint 3G card... John Leidel wrote: > Don't know if they have an internet connection at the bar... doesn't > look like many wifi hotspots within a reasonable distance : > > http://www.wi-fihotspotlist.com/search.php?street=125+W+3rd > +St&city=Reno&state=NV&zip=89501&country=US&network=&proximity=1&submit.x=100&submit.y=12&submit=Show+HotSpots > > On Sun, 2007-10-07 at 00:24 +0200, Vincent Diepeveen wrote: >> Any chance of us in Europe seeing something of this online in a live webcam >> broadcast? >> >> Thanks, >> Vincent >> ----- Original Message ----- >> From: "Donald Becker" >> To: "Beowulf Mailing List" ; "Beowulf Announcement >> mailing list" >> Sent: Friday, October 05, 2007 8:42 PM >> Subject: [Beowulf] 9th Annual Beowulf Bash: Announcement and >> sponsorshipopportunity >> >> >>> >>> >>> Ninth Annual Beowulf Bash >>> >>> We are putting together our annual Beowulf Bash. It will take place at >>> its traditional place and time: during the week of the IEEE Supercomputing >>> Conference. This year SC07 is in Reno NV during the week of Nov 10-15 >>> >>> We are trying to keep a casual event with broad sponsorship. >>> The attraction is the other attendees, rather than high-end food or >>> entertainment. Read that as "beer, snacks, and people you always wanted >>> to talk to." >>> >>> We are tenatively planning the party for Tuesday or Wednesday evening >>> 6-8pm in the downtown Reno area. Probably Third Street Blues (formerly >>> The Blue Note) or a similar low-key bar. (BTW, if any reader knows of >>> other vendor events that we should avoid, please let me know.) >>> >>> If your company (or even you as an individual) would like to help >>> sponsor the event, please contact me, becker@beowulf.org >>> >>> Sponsors >>> - get their name up in lights (Errrmm, or maybe just vinyl signs... >>> you bring a sign, there will be some ambient light.) >>> - are part of the brief greeting in the middle of the party. >>> - have the opportunity for technical, hands-on demos at the bash >>> - have their logos on the beowulf.org site through the end of SC07, >>> and on the 2007 yearbook page after the event. >>> >>> >>> -- >>> Donald Becker becker@scyld.com >>> Penguin Computing / Scyld Software >>> www.penguincomputing.com www.scyld.com >>> Annapolis MD and San Francisco CA >>> >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From wrankin at ee.duke.edu Sun Oct 7 08:47:59 2007 From: wrankin at ee.duke.edu (Bill Rankin) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: References: <200710051508.l95F7xvL030704@b.mail.sonic.net> Message-ID: <53F2139C-1065-47BF-9CE0-B7EC147AA93C@ee.duke.edu> On Oct 5, 2007, at 4:17 PM, Leif Nixon wrote: > "Geoff Galitz" writes: > >> Why do you automatically distrust hardware raid? > > To some extent I share Mark's sentiment. I certainly trust the > Linux kernel more than the firmware in a cheap raid controller. Let me offer up a somewhat concrete example of a problem with hardware raid. A local group around here kept some Very Important Data on a hardware raid array. Due to several factors, a backup was not made of certain data. The device lost a drive and started an automagic rebuild on one of the hot spares. The sudden beating that the other drives took (because of the rebuild) caused a second hard drive to fail (always a concern with RAID5). Since the data was not fully backed up, the drives were sent out for a Very Expensive Recovery. Most of the data was recovered but once the drives were reinstalled in the enclosure, the hardware raid could not be made to understand that all the drives were now okay. It essentially got itself into an unrecoverable state that could not be changed by us mere mortals (since data formats and such on hardware raid tend to be proprietary). So the entire array had to be sent out for another Even More Expensive Recovery to get the data back. Now while this is kind of a "perfect storm" in turns of hardware and data failure, it does illustrate the extent of control that you give up when going with a hardware raid solution. I think that the higher end vendors (ie. NetApp, EMC, et al) have their reliability up to the point where this is much less of a risk. But for the low-end beer budget cluster, software raid is probably still the way to go. As for the "mid-tier" vendors, I would be very cautious and pay close attention to the worst case data lose scenario. Good luck, -bill From joelja at bogus.com Sat Oct 6 12:09:17 2007 From: joelja at bogus.com (Joel Jaeggli) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux In-Reply-To: <4703BD51.9020001@sanger.ac.uk> References: <6f99ad660710030429h2dd12e4eicb2f999d2bb0531c@mail.gmail.com> <4703BD51.9020001@sanger.ac.uk> Message-ID: <4707DD5D.5080300@bogus.com> Guy Coates wrote: > Luns over 2TB are a bad idea. There are just too many reasons why they might not > work, and trying to track down the right one is a pain. > > Your workaround to use the LVM to stripe 3x1TB luns together is the way to go. > (You really want to use LVM anyhow, as trying to do SAN storage without it is a > pain.) > > You've covered the usual Gotchas with 2TB luns (dos partition tables and 32 bit > kernels), so I'd suspect your storage array. Check with Hitachi that the array > supports creating luns over 2TB. Many don't, and for maximum confusion, some > allow you to create >2TB luns, even though you can't subsequently use them. I have at least two disk arrays that will assemble units greater than 2TB but have to export them as multiple luns... Some of these controller designs are quite old despite being in newish (fibre-channel-sata enclosures) this is all seems to stem internally from their using 32 bit signed intergers for block count. (the more things change the more they stay the same) My approach (we designed it this way from the outset) was to assemble 2TB raid-6 stripes into a single 12TB volume using LVM (this was back in the era of 500GB drives). > Cheers, > > Guy > From jmdavis1 at vcu.edu Sun Oct 7 09:54:42 2007 From: jmdavis1 at vcu.edu (Mike Davis) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <53F2139C-1065-47BF-9CE0-B7EC147AA93C@ee.duke.edu> References: <200710051508.l95F7xvL030704@b.mail.sonic.net> <53F2139C-1065-47BF-9CE0-B7EC147AA93C@ee.duke.edu> Message-ID: <47090F52.7040402@vcu.edu> And what would happen if 2 drives died on a software RAID5? The problem with the example is that it could happen whether one uses software or hardware RAID. The real issue is that important data was stored and not backed up. Bad things happen when you have a bad storage strategy. I have run HW RAID for well over a decade. I've used units from manufacturers and integrators. I've had HW from Apple, EMC, Sun, DEC, IBM as well as from small shops like Partners Data.. I've also run SW RAID primarily for less critical data. Regardless of which method I choose, making sure that there are regular reliable backups is important. Controllers can have problems, but so can software. Mike Davis Bill Rankin wrote: > > On Oct 5, 2007, at 4:17 PM, Leif Nixon wrote: > >> "Geoff Galitz" writes: >> >>> Why do you automatically distrust hardware raid? >> >> >> To some extent I share Mark's sentiment. I certainly trust the >> Linux kernel more than the firmware in a cheap raid controller. > > > Let me offer up a somewhat concrete example of a problem with > hardware raid. > > A local group around here kept some Very Important Data on a hardware > raid array. Due to several factors, a backup was not made of certain > data. The device lost a drive and started an automagic rebuild on > one of the hot spares. The sudden beating that the other drives took > (because of the rebuild) caused a second hard drive to fail (always a > concern with RAID5). > > Since the data was not fully backed up, the drives were sent out for > a Very Expensive Recovery. Most of the data was recovered but once > the drives were reinstalled in the enclosure, the hardware raid could > not be made to understand that all the drives were now okay. It > essentially got itself into an unrecoverable state that could not be > changed by us mere mortals (since data formats and such on hardware > raid tend to be proprietary). So the entire array had to be sent out > for another Even More Expensive Recovery to get the data back. > > Now while this is kind of a "perfect storm" in turns of hardware and > data failure, it does illustrate the extent of control that you give > up when going with a hardware raid solution. I think that the higher > end vendors (ie. NetApp, EMC, et al) have their reliability up to the > point where this is much less of a risk. But for the low-end beer > budget cluster, software raid is probably still the way to go. As > for the "mid-tier" vendors, I would be very cautious and pay close > attention to the worst case data lose scenario. > > Good luck, > > -bill > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From eugen at leitl.org Sun Oct 7 10:17:37 2007 From: eugen at leitl.org (Eugen Leitl) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <47090F52.7040402@vcu.edu> References: <200710051508.l95F7xvL030704@b.mail.sonic.net> <53F2139C-1065-47BF-9CE0-B7EC147AA93C@ee.duke.edu> <47090F52.7040402@vcu.edu> Message-ID: <20071007171737.GW4005@leitl.org> On Sun, Oct 07, 2007 at 12:54:42PM -0400, Mike Davis wrote: > Controllers can have problems, but so can software. The point is that you have to keep a hardware spare with hardware RAID. This can make things much more expensive. Also, software RAID is typically free, while a hardware RAID of similiar performance several k$, and you have to double that if you have to keep a hardware spare. A minus point that software RAID can write out garbage from bit-flipped RAM when the power goes out. This is more of a problem with some file systems (xfs?) than with others. Not a problem with UPS, though. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From bill at cse.ucdavis.edu Sun Oct 7 12:54:28 2007 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <200710051508.l95F7xvL030704@b.mail.sonic.net> References: <200710051508.l95F7xvL030704@b.mail.sonic.net> Message-ID: <47093974.6090204@cse.ucdavis.edu> Geoff Galitz wrote: > > > Why do you automatically distrust hardware raid? Because they are low volume parts designed to handle failure modes in very complicated environments. If you buy a hardware RAID card you very well could have the only one on the planet with that exact config. Variables include raid controller, hardware revision of the controller, which drives you have (and their revision), the motherboard (and it's BIOS version), etc. So when a drive fails in a strange way you might very well have a problem that nobody else on the planet has had. Additionally you have to gain expertise in the particular details, quirks, and bugs of that RAID controller. The higher end RAID setups of course do not let you pick your own drives and some even change the BIOS of the drives to they can guarantee that they have tested the various failure modes. Of course the even higher end models put the disks in their own enclosure so they can control 100% of the environment of the drives including nasty little details like power quality, airflow/temp, controller, vibration, etc. I've seen a significant number of quirks in 3ware, storage works, areca, and dell perc (lsi logic?) controllers. The related forums discuss the numerous landmines related to their huge variety of options. In one particular case I bought a 3ware 6800 (the then current high end 3ware) which was advertised as supporting RAID-5 and ended up losing a filesystem, I called support, they said upgrade the firmware. Which I did, and lost another filesystem. I called back, they said oh try a newer driver... which I did, and lost another filesystem. They then gave a nervous laugh and said "Yeah, they do that, we recommend you buy the new 7xxx series, the 6800 wasn't really intended to run raid-5. Software RAID worked fine. Linux software RAID on the other hand is popular, free, robust, and has likely already encountered any strange and wacky behavior from your motherboard, revision of disks, brokenness from hardware. There's likely 1000 times as many software RAIDs out in production as there are any particular RAID card, RAID firmware, RAID driver, disk hardware, and disk firmware. Additionally you have to buy TWO hardware raids, often you end up with significantly less performance, and often the following questions are rather hard to answer: * Can I migrate a RAID to another machine? * Can I split disks different partitions can be in different RAIDs * Can I be emailed when the RAID changes state? * Can I migrate the RAID to larger disks gradually (I.e. 2 250GB disks to 2 500GB disks without having 4 slots/ports) * Can I control RAID rebuild speed? * Enable ECC scrubbing on my schedule? * Can I migrate the RAID to completely different hardware to debug if it's a RAID controller issue? * Can I grow/shrink raid volumes as well as the disk used per drive? Sure, they can be answered, but frankly it takes more time than I'm willing to invest in the flavor of the month card, firmware, and linux kernel driver. Especially since it's a small market there seems to be dramatic differences in price/performance among those trying to gain market share. Way back when it was adaptec, then 3ware was an upstart, then areca, and now it seems like adaptec is making a big push with their newer 16 port ish adaptec controllers. I've found linux software raid almost always faster than hardware RAID, much more reliable, and pleasingly consistent. Uptimes on busy servers with UPS are often over 500 days, even back when the linux uptime counter still rolled. During a disaster I'd much rather debug and troubleshoot a software RAID then trying to find one of the few experts in the world on some particular hardware configuration. From landman at scalableinformatics.com Sun Oct 7 13:23:24 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <53F2139C-1065-47BF-9CE0-B7EC147AA93C@ee.duke.edu> References: <200710051508.l95F7xvL030704@b.mail.sonic.net> <53F2139C-1065-47BF-9CE0-B7EC147AA93C@ee.duke.edu> Message-ID: <4709403C.9090801@scalableinformatics.com> Bill Rankin wrote: > Let me offer up a somewhat concrete example of a problem with hardware > raid. > > A local group around here kept some Very Important Data on a hardware > raid array. Due to several factors, a backup was not made of certain > data. The device lost a drive and started an automagic rebuild on one Let me state the obvious here. And yes, I know I am likely "preaching to the choir" RAID is not a backup solution. Again, RAID is not a backup solution. If you run without a backup, we can pretty much guarantee that you are going to lose data at some point in time. Again, RAID is not a backup solution. I don't know if I mentioned it, but RAID is not a backup solution. Anyone who believes otherwise is begging for trouble. RAID is not a backup solution. Backing up your data is *ALWAYS* important, RAID or not. Even if it is just a mirror of the data. > of the hot spares. The sudden beating that the other drives took > (because of the rebuild) caused a second hard drive to fail (always a > concern with RAID5). [... anecdote elided ...] RAID is not a backup solution, anyone mistakenly using it as such *will* be burned. > Now while this is kind of a "perfect storm" in turns of hardware and > data failure, it does illustrate the extent of control that you give up > when going with a hardware raid solution. I think that the higher end Er... with all due respect, this wasn't a hardware issue. This was a policy issue. If your data is important, back it up. It doesn't matter if it is on a hardware or software RAID, you absolutely, positively must to a cost-benefit analysis of the value of the data and the time/effort/money it would cost to recover when (not if) something goes bump in the night. RAID is not a backup solution. Not sure I mentioned this. All hardware has failure modes. All software has bugs. Your choice is which set of problems are easier to deal with. We have seen crappy hardware, and abominable software. Bugs in the linux kernel (no, there couldn't be any, nah... impossible ...) could just as easily wreck your day as a misguided firmware/hardware bug. Backups are a risk mitigation strategy. If you have important data, you need to back it up. Moreover, I argue that you need multiple modalities of backup/restore. Call this 20+ years of experience in losing data and thinking (naively) that the backup that I have will actually restore... properly. > vendors (ie. NetApp, EMC, et al) have their reliability up to the point > where this is much less of a risk. But for the low-end beer budget Er... ah... ok. All of them have similar issues. I occasionally hear how vendor X's (make the appropriate substitution for X) item, such as a network card, or disk drive is *obviously* much better than what is available in the mass market, which is why they charge so much more for it. The last time a customer noted that about one of the above named vendors (network card as it turned out), I asked them to pull back the label on the card and see what was underneath it. Turns out it was a plain old mass market card with a (vendor X) label slapped on it. I am sorry to report that for the vast majority of cases of which I am aware, they (the above named vendors X) use generally the same mass market stuff you and I do. Don't mistake this, EMC, Netapp and others *do* offer value. It just isn't in slapping a new label on something, charging 10x for it, and somehow convincing the people paying for it that it is magically special (that is, unless their label maker has some serious undocumented mojo in that label ...) Their value is in hyperactive support. > cluster, software raid is probably still the way to go. As for the > "mid-tier" vendors, I would be very cautious and pay close attention to > the worst case data lose scenario. What we tell all our customers (aside from RAID is not a backup solution) is that they want to minimize risk. Where is the risk? Well you can trace it out. There are many ways to mitigate risk, and reduce down time. RAIN is a great example. But you can build RAIN out of software RAID as easily as hardware RAID. Remember, all have bugs, your job is to figure out (or work with someone who does this for you) how to reduce the impact of potential bugs. RAID is not a backup, and if you run without one, well, ... > Good luck, ... yeah. > > -bill > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From lindahl at pbm.com Sun Oct 7 15:10:30 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Mon Mar 15 01:06:31 2010 Subject: [Beowulf] [AMD64] Gentoo or Fedora In-Reply-To: <200709011646.37506.csamuel@vpac.org> References: <4d2c60b30708240810k5ecfec13t3edffec6e391e8f2@mail.gmail.com> <46D84850.4060106@sicortex.com> <20070831155440.6bd338d1@daggett> <200709011646.37506.csamuel@vpac.org> Message-ID: <20071007221030.GA5468@bx9.net> Sorry that this is a "late hit" on this topic, but every time someone mentions Gentoo, I have to count to 100,000 before I say anything. >From what I can tell, the dependency stuff in Gentoo mostly works. If you try to not update any packages unless they have a security issue, you will discover a steady trickle of things that don't work: packages that block themselves, for example. Or you (allegedly) have to update gcc/binutils/the kitchen sink in order to update tar. Or packages that won't rebuild because they really did need a newer gcc, but it wasn't in their dependencies. And so forth. Only by updating all the time will you stick with the herd, and then things mostly work. At this very moment I'm having to emerge a ton of stuff on a server because openssl was updated for security reasons. Hm, elm doesn't compile anymore, I wonder if anyone will notice if I just delete it? This kind of thing just doesn't happen with Red Hat or SUSE. The whole "you can build with compiler flags for your cpu so your system will be faster thing" doesn't ring true, either. First off, most computational clusters don't spend a ton of time executing code that's part of the system. They spend time executing your user code. Second, compiling the whole system with -O3 --ultra-fast-flags is just asking for bugs. Do you think that gcc is tested with -O3 --ultra-fast-flags, building itself and all of its test suites? Well, actually, no. Most gcc testing is done with the flags that Red Hat and SUSE with the default flags that they build their distros with. The rest is just luck. And with 100,000,000 lines of code, you'll find weird bugs popping up in the oddest places, if you looked hard enough. And this isn't a special flaw of gcc, you won't find anyone anywhere on the planet building their system with any compiler with its super optimization flags turned on. So: Benefit? None. Cost? Dependency nightmares, plus bug risk. Friends don't let friends use Gentoo on clusters. Or production servers. -- greg From lindahl at pbm.com Sun Oct 7 15:42:40 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] [AMD64] Gentoo or Fedora In-Reply-To: <20071007221030.GA5468@bx9.net> References: <4d2c60b30708240810k5ecfec13t3edffec6e391e8f2@mail.gmail.com> <46D84850.4060106@sicortex.com> <20070831155440.6bd338d1@daggett> <200709011646.37506.csamuel@vpac.org> <20071007221030.GA5468@bx9.net> Message-ID: <20071007224240.GA25696@bx9.net> On Sun, Oct 07, 2007 at 03:10:30PM -0700, Greg Lindahl wrote: > Hm, elm doesn't compile > anymore, I wonder if anyone will notice if I just delete it? Of course, my CEO noticed about 10 minutes later! I told him to use a real mailer, like mutt. ;-) -- greg From gerry.creager at tamu.edu Sun Oct 7 21:47:55 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] [AMD64] Gentoo or Fedora In-Reply-To: <20071007224240.GA25696@bx9.net> References: <4d2c60b30708240810k5ecfec13t3edffec6e391e8f2@mail.gmail.com> <46D84850.4060106@sicortex.com> <20070831155440.6bd338d1@daggett> <200709011646.37506.csamuel@vpac.org> <20071007221030.GA5468@bx9.net> <20071007224240.GA25696@bx9.net> Message-ID: <4709B67B.3030100@tamu.edu> Er, ah, Greg, don't hold back. How do you REALLY feel? Greg Lindahl wrote: > On Sun, Oct 07, 2007 at 03:10:30PM -0700, Greg Lindahl wrote: > >> Hm, elm doesn't compile >> anymore, I wonder if anyone will notice if I just delete it? > > Of course, my CEO noticed about 10 minutes later! > > I told him to use a real mailer, like mutt. ;-) > > -- greg > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From csamuel at vpac.org Sun Oct 7 22:24:50 2007 From: csamuel at vpac.org (Chris Samuel) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] Odd Infiniband scaling behaviour Message-ID: <200710081524.55374.csamuel@vpac.org> Hi fellow Beowulfers.. We're currently building an Opteron based IB cluster, and are seeing some rather peculiar behaviour that has had us puzzled for a while. If I take a CPU bound application, like NAMD, I can run an 8 CPU job on a single node and it pegs the CPUs at 100% (this is built using Charm++ configured as an MPI system and using MVAPICH 0.9.8p3 with the Portland Group Compilers). If I then run 2 x 4 CPU jobs of the *same* problem, they all run at 50% CPU. If I run 4 x 2 CPU jobs, again the same problem, they run at 25%.. ..and yes, if I run 8 x 1 CPU jobs they run at around 12-13% CPU! I then replicated the same problem with the example MPI cpi.c program, to rule out some odd behaviour in NAMD. What really surprised me was when testing CPI built using OpenMPI (which doesn't use IB on our system) the problem vanished and I could run 8 x 1 CPU jobs, each using 100%! So (at the moment) it looks like we're seeing some form of contention on the Infiniband adapter.. 07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev a0) Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] Flags: fast devsel, IRQ 19 Memory at feb00000 (64-bit, non-prefetchable) [size=1M] Memory at fd800000 (64-bit, prefetchable) [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 We see this problem with the standard CentOS kernel, with the latest stable kernel (2.6.22.9) and with 2.6.23-rc9-git5 (which completely rips out and replaced the CPU scheduler with Ingo Molnar's CFS). This is on a SuperMicro based system with AMD's Barcelona quad core CPU (1.9GHz), but I see the same behaviour (scaled down) on dual core Opterons too. I've looked at what "modinfo ib_mthca" says are the tuneable options, but the few I've played with ("msi_x" and "tune_pci") haven't made any noticeable difference, sadly.. Has anyone else run into this or got any clues they could pass on please ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://www.scyld.com/pipermail/beowulf/attachments/20071008/2d47448a/attachment.bin From geoff at galitz.org Mon Oct 8 00:38:41 2007 From: geoff at galitz.org (Geoff Galitz) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <53F2139C-1065-47BF-9CE0-B7EC147AA93C@ee.duke.edu> Message-ID: <200710080738.l987cf9I012383@b.mail.sonic.net> I would argue that the situation you describe is a result of that particular RAID adapter or that particular make and model is just inappropriate (no offense) I have certainly seen lots of RAID arrays where multiple drives die at approx the same time, but I find that usually: - multiple drives die at the same time from the same production batch, and my vendor replaces those drives with no questions asked - the drives that died exceeded their production lifetime at approx the same time I'm not sure how a software RAID solution would work around that. It is clear that most of the folks in the list favor software raid, but I have worked with both hardware and software RAID and with different OS and hardware vendors. RAID setups are no different than any other component should be spec'd carefully for their intended target, including the disks. I do favor hardware RAID, myself. I have never had any unexplained data corruption or unresolved performance or recovery issues on my watch. I tend to favor hardware RAID due to the fact I can rely on a level of conformity across an install base (if needed), more flexible admin tools, better support (I specifically choose adapters from trusted vendors) and lower admin overhead. I could certainly choose a cheaper hardware RAID adapter which would result in some of the problem I am trying to avoid... that is where doing the research comes in. I am one of those guys who like to move as many operations closer to the hardware layer as possible.... but it doest cost. Having said that, I am also running software RAID in a medium scale environment now (Redhat Linux and FreeBSD) and it works just fine (along side our hardware RAID systems). I also observe that many vendors will fully populate a RAID-5 array and create no hot-spare (DELL) or only one hot-spare. I usually create two hot-spares in order to give myself the wiggle room to run a medium-large datacenter with a small staff. No need to rush in if a single disk craps out. That would also avoid any kind of "rebuild storm." The data component is so important that I have never had a problem recommending a cluster with such a configuration. Just my two cents. -geoff -------------------------------------- Let me offer up a somewhat concrete example of a problem with hardware raid. A local group around here kept some Very Important Data on a hardware raid array. Due to several factors, a backup was not made of certain data. The device lost a drive and started an automagic rebuild on one of the hot spares. The sudden beating that the other drives took (because of the rebuild) caused a second hard drive to fail (always a concern with RAID5). From wrankin at ee.duke.edu Mon Oct 8 05:05:22 2007 From: wrankin at ee.duke.edu (Bill Rankin) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] 32 nodes cluster price In-Reply-To: <200710080738.l987cf9I012383@b.mail.sonic.net> References: <200710080738.l987cf9I012383@b.mail.sonic.net> Message-ID: <478F1C96-9BAB-48B1-A0F3-DFC927ACE438@ee.duke.edu> On Oct 8, 2007, at 3:38 AM, Geoff Galitz wrote: > I would argue that the situation you describe is a result of that > particular RAID adapter or that particular make and model is just > inappropriate (no offense) None taken. I should have been clearer on the point I was trying to make. First the clarifications: I never meant to imply that RAID is a backup substitute. Treating it as such is foolish in a production environment for many obvious reasons. I mentioned the backup issue here (it was a failure of the existing backup system, not a standard policy) to explain why the group in question had to go to such lengths to restore the data that was on the disks. The main point that I was trying to make was the the proprietary nature of the HW raid controller that they used made recovery from a double disk failure a much more lengthy and expensive process than it would have been with the software implementation (in this specific case). For an inexpensive/small installation, I personally feel that software raid allows for better control and management of resources with a minimal (if any) performance hit. That's all I really wanted to say. -bill "Don't shoot me, I'm only the piano player" rankin From wrankin at ee.duke.edu Mon Oct 8 05:08:27 2007 From: wrankin at ee.duke.edu (Bill Rankin) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] [AMD64] Gentoo or Fedora In-Reply-To: <20071007224240.GA25696@bx9.net> References: <4d2c60b30708240810k5ecfec13t3edffec6e391e8f2@mail.gmail.com> <46D84850.4060106@sicortex.com> <20070831155440.6bd338d1@daggett> <200709011646.37506.csamuel@vpac.org> <20071007221030.GA5468@bx9.net> <20071007224240.GA25696@bx9.net> Message-ID: On Oct 7, 2007, at 6:42 PM, Greg Lindahl wrote: > On Sun, Oct 07, 2007 at 03:10:30PM -0700, Greg Lindahl wrote: > >> Hm, elm doesn't compile >> anymore, I wonder if anyone will notice if I just delete it? > > Of course, my CEO noticed about 10 minutes later! > > I told him to use a real mailer, like mutt. ;-) cat /var/spool/userid | lpr ;-) -b From gerry.creager at tamu.edu Mon Oct 8 05:24:26 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] [AMD64] Gentoo or Fedora In-Reply-To: References: <4d2c60b30708240810k5ecfec13t3edffec6e391e8f2@mail.gmail.com> <46D84850.4060106@sicortex.com> <20070831155440.6bd338d1@daggett> <200709011646.37506.csamuel@vpac.org> <20071007221030.GA5468@bx9.net> <20071007224240.GA25696@bx9.net> Message-ID: <470A217A.2060808@tamu.edu> Let's see... what was the printer definition for that Centronics dot-matrix lump in the store room? Bill Rankin wrote: > > On Oct 7, 2007, at 6:42 PM, Greg Lindahl wrote: > >> On Sun, Oct 07, 2007 at 03:10:30PM -0700, Greg Lindahl wrote: >> >>> Hm, elm doesn't compile >>> anymore, I wonder if anyone will notice if I just delete it? >> >> Of course, my CEO noticed about 10 minutes later! >> >> I told him to use a real mailer, like mutt. ;-) > > cat /var/spool/userid | lpr > > ;-) > > -b > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From anajafi at mail.ipm.ir Mon Oct 8 05:51:08 2007 From: anajafi at mail.ipm.ir (Seyed Abouzar Najafi Shoshtari) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution Message-ID: <20071008124403.M2100@mail.ipm.ir> Dear beowulf experts, we are planning to build a beowulf cluster with 8 nodes (8xCPUs, Intel 6600 Quadcore 8MB, 8GB RAM )and a Dell-Server as the master node (2xCPU Xeon Quad Core 1.6GHz, 4TB Hard, 18GB RAM). Which linux distribution would be ideal for our case? Thanks in advance for your help. Abouzar -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From gerry.creager at tamu.edu Mon Oct 8 06:23:23 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <20071008124403.M2100@mail.ipm.ir> References: <20071008124403.M2100@mail.ipm.ir> Message-ID: <470A2F4B.5050108@tamu.edu> Fedora core 6 is where I'd start today. SuSE 10.x is a very good second choice. We've also tried ROCKS and haven't been too impressed. ROCKS installs easily and replicates to the nodes but FC6 and kickstart is just too easy and offers a bit more usability in our experience gerry Seyed Abouzar Najafi Shoshtari wrote: > Dear beowulf experts, > > we are planning to build a beowulf cluster > with 8 nodes (8xCPUs, Intel 6600 Quadcore 8MB, 8GB RAM )and > a Dell-Server as the master node (2xCPU Xeon Quad Core 1.6GHz, 4TB Hard, 18GB > RAM). > > Which linux distribution would be ideal for our case? > > Thanks in advance for your help. > > Abouzar > -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From jakob at unthought.net Mon Oct 8 06:30:29 2007 From: jakob at unthought.net (Jakob Oestergaard) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <20071008124403.M2100@mail.ipm.ir> References: <20071008124403.M2100@mail.ipm.ir> Message-ID: <20071008133029.GK21977@unthought.net> On Mon, Oct 08, 2007 at 05:21:08PM +0430, Seyed Abouzar Najafi Shoshtari wrote: > Dear beowulf experts, > > we are planning to build a beowulf cluster > with 8 nodes (8xCPUs, Intel 6600 Quadcore 8MB, 8GB RAM )and > a Dell-Server as the master node (2xCPU Xeon Quad Core 1.6GHz, 4TB Hard, 18GB > RAM). > > Which linux distribution would be ideal for our case? Which is best - a cup of tea or a fighter jet? I guess it depends on what you need most. All reasonably modern distributions will support the hardware just fine. What you need to find out, is; 1) What software will you run, and which distributions support the software (or, which distributions are supported by the software) 2) How can you get support - various vendors or non-vedors give you different options depending on your needs 3) How do you get security updates in a timely fashion? Is this an issue at all? 4) What about non-security upgrades? Do you need a very stable platform, or would you prefer something that upgrades libraries and tools more frequently? In other words; find out what you need - then pick what you need. -- / jakob From behnia at gmail.com Mon Oct 8 06:46:20 2007 From: behnia at gmail.com (Farid Behnia) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <20071008133029.GK21977@unthought.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> Message-ID: <5789cc120710080646n40749be9qad4e5118aace341f@mail.gmail.com> I agree with Jacob. You're asking a very broad question and you need to narrow it down by determining your requirements. What distributions have your worked with already? I've had experience with several solutions but I got down with FAI and Debian. BTW it's good to see another fellow countryman here! On 10/8/07, Jakob Oestergaard wrote: > > On Mon, Oct 08, 2007 at 05:21:08PM +0430, Seyed Abouzar Najafi Shoshtari > wrote: > > Dear beowulf experts, > > > > we are planning to build a beowulf cluster > > with 8 nodes (8xCPUs, Intel 6600 Quadcore 8MB, 8GB RAM )and > > a Dell-Server as the master node (2xCPU Xeon Quad Core 1.6GHz, 4TB Hard, > 18GB > > RAM). > > > > Which linux distribution would be ideal for our case? > > Which is best - a cup of tea or a fighter jet? > > I guess it depends on what you need most. > > All reasonably modern distributions will support the hardware just fine. > > What you need to find out, is; > 1) What software will you run, and which distributions support the > software (or, > which distributions are supported by the software) > 2) How can you get support - various vendors or non-vedors give you > different > options depending on your needs > 3) How do you get security updates in a timely fashion? Is this an issue > at all? > 4) What about non-security upgrades? Do you need a very stable platform, > or would > you prefer something that upgrades libraries and tools more frequently? > > In other words; find out what you need - then pick what you need. > > -- > > / jakob > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071008/c86cc7ff/attachment.html From b.wagman at comcast.net Mon Oct 8 06:48:55 2007 From: b.wagman at comcast.net (Barnet Wagman) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <20071008133029.GK21977@unthought.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> Message-ID: <470A3547.2060606@comcast.net> Does any one use Centos on Beowulf nodes? Of course Centos is really just Redhat, but many people prefer it for use on servers. From tjrc at sanger.ac.uk Mon Oct 8 07:10:08 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <20071008124403.M2100@mail.ipm.ir> References: <20071008124403.M2100@mail.ipm.ir> Message-ID: On 8 Oct 2007, at 1:51 pm, Seyed Abouzar Najafi Shoshtari wrote: > Dear beowulf experts, > > we are planning to build a beowulf cluster > with 8 nodes (8xCPUs, Intel 6600 Quadcore 8MB, 8GB RAM )and > a Dell-Server as the master node (2xCPU Xeon Quad Core 1.6GHz, 4TB > Hard, 18GB > RAM). > > Which linux distribution would be ideal for our case? You are lighting the blue touchpaper. Basically anything will work. There's much less difference between Linux distributions than people think. They basically differ in the way you install packages, and in some cases in the locations of configuration files. But that's about it. Go with whatever distribution you or your admins are already familiar with. Personally, I prefer Debian-derived distributions to Red Hat workalikes, but that's just me. Some people like the Gentoo build-it- all-yourself approach. Some like the Rocks do-all-the-clustering-for- me approach. But they can all do the job. It depends to a certain extent on how much you want (or need) to get your own hands dirty. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From wrankin at ee.duke.edu Mon Oct 8 07:48:45 2007 From: wrankin at ee.duke.edu (Bill Rankin) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A3547.2060606@comcast.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> Message-ID: <866AF227-3044-433E-A301-1755D536A053@ee.duke.edu> Yes, we use it with good effect on our 500+ node cluster at Duke. It's currently running Centos-4. I think that the only issue is that some of our developers require newer releases of a couple packages, but it's easy enough to maintain a local yum repository with those packages. It's been a good stable platform for us. -bill On Oct 8, 2007, at 9:48 AM, Barnet Wagman wrote: > Does any one use Centos on Beowulf nodes? Of course Centos is > really just Redhat, but many people prefer it for use on servers. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From jmdavis1 at vcu.edu Mon Oct 8 08:07:07 2007 From: jmdavis1 at vcu.edu (Mike Davis) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <866AF227-3044-433E-A301-1755D536A053@ee.duke.edu> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <866AF227-3044-433E-A301-1755D536A053@ee.duke.edu> Message-ID: <470A479B.9030800@vcu.edu> My experience is similar to Bill's. We've been using CentOs 3,4 for the past few years on our larger clusters. It is a good choice for stability, good performance, and since it is RH for SW compatability. Mike Davis Bill Rankin wrote: > Yes, we use it with good effect on our 500+ node cluster at Duke. > It's currently running Centos-4. I think that the only issue is that > some of our developers require newer releases of a couple packages, > but it's easy enough to maintain a local yum repository with those > packages. > > It's been a good stable platform for us. > > -bill > > > On Oct 8, 2007, at 9:48 AM, Barnet Wagman wrote: > >> Does any one use Centos on Beowulf nodes? Of course Centos is really >> just Redhat, but many people prefer it for use on servers. >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From hahn at mcmaster.ca Mon Oct 8 08:21:07 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <20071008124403.M2100@mail.ipm.ir> References: <20071008124403.M2100@mail.ipm.ir> Message-ID: > with 8 nodes (8xCPUs, Intel 6600 Quadcore 8MB, 8GB RAM )and > a Dell-Server as the master node (2xCPU Xeon Quad Core 1.6GHz, 4TB Hard, 18GB > RAM). > > Which linux distribution would be ideal for our case? the distribution has nothing to do with your hardware. just choose a distro that you are comfortable with - there cannot possibly be any general answer, since all extremes of personal/professional preference are represented. personally, I choose RH-ish distros (centos, fedora) mainly because they are fairly conventional in structure as visible to the admin. for instance, no gratuitous reinvention of sysvinit, widely used package format, etc. IMO, your master node is over-powered both cpu and memory-wise. if you have a single master node in a cluster, it performs three main duties: - fileserver. this requires practically no cpu or memory, just disks and net. if you have a cachable read-heavy workload, then more memory on a fileserver may not be wasted. - cluster management, such as syslogs, scheduler, etc. very low cpu or memory load here. - logins. this might be the place people run the compiler, etc. but unless they're doing heavy visualization, only a small fraction of a cpu per user is necessary, and not much memory. rather than spending a lot of money on a highly configured master node, I'd probably split admin/fileservice to a box "uncontaminated" by users. regards, mark hahn. From hahn at mcmaster.ca Mon Oct 8 08:25:31 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A3547.2060606@comcast.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> Message-ID: > Does any one use Centos on Beowulf nodes? Of course Centos is really just > Redhat, but many people prefer it for use on servers. sure. my organization is using centos wherever possible. we have some history with RH-like distros, and a large installed base of HP's XC, which is RHEL-based. what do you ask from a distro? I want it to be reasonably up-to-date, use a decent package system (like yum+rpm), not get in the way of NFS-root, etc. From tom.elken at qlogic.com Mon Oct 8 09:32:55 2007 From: tom.elken at qlogic.com (Tom Elken) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] Odd Infiniband scaling behaviour In-Reply-To: <200710081524.55374.csamuel@vpac.org> References: <200710081524.55374.csamuel@vpac.org> Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A015950D5@AVEXCH1.qlogic.org> > -----Original Message----- > [mailto:beowulf-bounces@beowulf.org] On Behalf Of Chris Samuel > Sent: Sunday, October 07, 2007 10:25 PM > To: beowulf@beowulf.org > Subject: [Beowulf] Odd Infiniband scaling behaviour > > Hi fellow Beowulfers.. > > We're currently building an Opteron based IB cluster, and are > seeing some rather peculiar behaviour that has had us puzzled > for a while. To give us more info about your "scaling" problem, can you tell us 1) the elapsed run-time of the four scenarios you mention (or relative run-times)? 2) how you measured the CPU usage? Thanks, Tom > > If I take a CPU bound application, like NAMD, I can run an 8 CPU job > on a single node and it pegs the CPUs at 100% (this is built using > Charm++ configured as an MPI system and using MVAPICH 0.9.8p3 > with the Portland Group Compilers). > > If I then run 2 x 4 CPU jobs of the *same* problem, they all > run at 50% CPU. > > If I run 4 x 2 CPU jobs, again the same problem, they run at 25%.. > > ..and yes, if I run 8 x 1 CPU jobs they run at around 12-13% CPU! > > I then replicated the same problem with the example MPI cpi.c > program, to rule out some odd behaviour in NAMD. > > What really surprised me was when testing CPI built using > OpenMPI (which doesn't use IB on our system) the problem > vanished and I could run 8 x 1 CPU jobs, each using 100%! > > So (at the moment) it looks like we're seeing some form of > contention on the Infiniband adapter.. > > 07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost > III Lx HCA] (rev a0) > Subsystem: Mellanox Technologies MT25204 [InfiniHost > III Lx HCA] > Flags: fast devsel, IRQ 19 > Memory at feb00000 (64-bit, non-prefetchable) [size=1M] > Memory at fd800000 (64-bit, prefetchable) [size=8M] > Capabilities: [40] Power Management version 2 > Capabilities: [48] Vital Product Data > Capabilities: [90] Message Signalled Interrupts: > 64bit+ Queue=0/5 Enable- > Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 > Capabilities: [60] Express Endpoint IRQ 0 > > We see this problem with the standard CentOS kernel, with the > latest stable kernel (2.6.22.9) and with 2.6.23-rc9-git5 > (which completely rips out and replaced the CPU scheduler > with Ingo Molnar's CFS). > > This is on a SuperMicro based system with AMD's Barcelona > quad core CPU (1.9GHz), but I see the same behaviour (scaled > down) on dual core Opterons too. > > I've looked at what "modinfo ib_mthca" says are the tuneable > options, but the few I've played with ("msi_x" and > "tune_pci") haven't made any noticeable difference, sadly.. > > Has anyone else run into this or got any clues they could > pass on please ? > > cheers, > Chris > -- > Christopher Samuel - (03) 9925 4751 - Systems Manager The > Victorian Partnership for Advanced Computing P.O. Box 201, > Carlton South, VIC 3053, Australia VPAC is a not-for-profit > Registered Research Agency > From tjrc at sanger.ac.uk Mon Oct 8 09:38:03 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> Message-ID: <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> On 8 Oct 2007, at 4:21 pm, Mark Hahn wrote: > the distribution has nothing to do with your hardware. > just choose a distro that you are comfortable with - there cannot > possibly be any general answer, since all extremes of personal/ > professional preference are represented. > > personally, I choose RH-ish distros (centos, fedora) mainly because > they are fairly conventional in structure as visible to the admin. > for instance, no gratuitous reinvention of sysvinit, widely used > package format, etc. > > IMO, your master node is over-powered both cpu and memory-wise. > if you have a single master node in a cluster, it performs three > main duties: > - fileserver. this requires practically no cpu or memory, just disks > and net. if you have a cachable read-heavy workload, then more > memory on a fileserver may not be wasted. > - cluster management, such as syslogs, scheduler, etc. very low > cpu or memory load here. That's not always true. It depends on the scheduler and the workload. LSF, for example, keeps its list of pending jobs in RAM as well as on disk. This can make the scheduler use a lot of memory if the number of pending jobs is large (and for embarrassingly parallel workloads that number can be very large indeed). Our scheduler node (with 8GB RAM) ran out of memory once when a particularly over- zealous user submitted 1.5 million jobs in one go... needless to say, the user in question had the cluebat vigorously applied, but to LSF's credit, it kept going. Just veeerrrrryyyy sllllooooowwwwlllllyyyyy. > - logins. this might be the place people run the compiler, etc. > but unless they're doing heavy visualization, only a small fraction > of a cpu per user is necessary, and not much memory. > > rather than spending a lot of money on a highly configured master > node, > I'd probably split admin/fileservice to a box "uncontaminated" by > users. But still, I agree with you, in general. Actually, these days, we keep the LSF master on a separate node from the login nodes, to protect it from wayward user action. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ajt at rri.sari.ac.uk Mon Oct 8 09:40:21 2007 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> Message-ID: <470A5D75.8030903@rri.sari.ac.uk> Tim Cutts wrote: > [...] > You are lighting the blue touchpaper. Basically anything will work. > There's much less difference between Linux distributions than people > think. They basically differ in the way you install packages, and in > some cases in the locations of configuration files. But that's about > it. Go with whatever distribution you or your admins are already > familiar with. Hello, Tim. I agree with you in principle, but in practice many binary packages only work properly under the target release of the target Linux distribution because of their dependencies on particular shared libraries etc. At the source level, of course, none of this matters. However, what many people want (me included) is a team of dedicated volunteers doing the hard work of making sure a collection of binary packages all work well together. > Personally, I prefer Debian-derived distributions to Red Hat workalikes, > but that's just me. Some people like the Gentoo build-it-all-yourself > approach. Some like the Rocks do-all-the-clustering-for-me approach. > But they can all do the job. It depends to a certain extent on how much > you want (or need) to get your own hands dirty. I also prefer Debian-based distro's and still run the openMosix kernel under an Ubuntu 6.06.1 LTS server installation on our Beowulf cluster. What I like about APT (the Debian package manager) is the dependency checking and conflict resolution capabilities of "aptitude", which is more robust than the older "apt-get". I previously ran Red Hat 5.3->9 and I've used both "up2date" and "yum". Neither of these is as capable of resolving package conflicts and dependencies as APT. I used APT for RPM when I ran RH9 for exactly this reason. In my opinion, the package management system is a very important factor to take into account when choosing a distribution, as well as the range of tried and tested binary packages that are available. In that respect, Debian/Ubuntu has a lot to recommend it. Tony. -- Dr. A.J.Travis, | mailto:ajt@rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687 From john.hearns at streamline-computing.com Mon Oct 8 09:41:51 2007 From: john.hearns at streamline-computing.com (John Hearns) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A3547.2060606@comcast.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> Message-ID: <470A5DCF.20009@streamline-computing.com> Barnet Wagman wrote: > Does any one use Centos on Beowulf nodes? Of course Centos is really > just Redhat, but many people prefer it for use on servers. We have several sites using Scientific Linux, which is along the same lines as CentOS. From lindahl at pbm.com Mon Oct 8 10:07:53 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A3547.2060606@comcast.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> Message-ID: <20071008170753.GA27519@bx9.net> On Mon, Oct 08, 2007 at 08:48:55AM -0500, Barnet Wagman wrote: > Does any one use Centos on Beowulf nodes? Of course Centos is really > just Redhat, but many people prefer it for use on servers. >From what I can tell, CentOS is the #1 distro for clusters. Most folks are familiar with Red Hat-style administration, and CentOS is free. -- greg From hahn at mcmaster.ca Mon Oct 8 10:09:36 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A5DCF.20009@streamline-computing.com> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> Message-ID: >> Does any one use Centos on Beowulf nodes? Of course Centos is really just >> Redhat, but many people prefer it for use on servers. > > We have several sites using Scientific Linux, which is along the same lines > as CentOS. I was surprised how very much like centos - I had the impression SL was more of a stable-means-versions-from-5-years-ago distro, but it appears to be mostly a repackaging of RHEL, and reasonably up-to-date. from a quick glance at the SL-5.0 readme, the number of customizations is quite small, so I do wonder what the point is. (_not_ meant as a criticism!). From john.hearns at streamline-computing.com Mon Oct 8 10:21:21 2007 From: john.hearns at streamline-computing.com (John Hearns) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> Message-ID: <470A6711.8070309@streamline-computing.com> Mark Hahn wrote: >> > up-to-date. from a quick glance at the SL-5.0 readme, the number > of customizations is quite small, so I do wonder what the point is. > (_not_ meant as a criticism!). SL exists to populate the huge data centres at CERN and Fermilab, and as a consequence many, many HEP groups have adopted it. The point is similar to CentOS - the bottom line cost. However, there have been discussions on the SL list about future releases, and having SL as an 'overlay' on top of CentOS, ie. an additional set of repositories with the Fermilab/CERN needd packages (for instance AFS support, Castor, Cernlibs...) I don't claim to know what stage these have reached, or if there is any seriousness in this. From rgb at phy.duke.edu Mon Oct 8 10:19:57 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A479B.9030800@vcu.edu> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <866AF227-3044-433E-A301-1755D536A053@ee.duke.edu> <470A479B.9030800@vcu.edu> Message-ID: On Mon, 8 Oct 2007, Mike Davis wrote: > My experience is similar to Bill's. We've been using CentOs 3,4 for the past > few years on our larger clusters. It is a good choice for stability, good > performance, and since it is RH for SW compatability. The only thing I'd comment on that is negative about it is one of its "advantages". There is a narrow line between stability and stagnation, and you have to figure out which side of that line your cluster will fall on. Specifically, the fact that Centos/RHEL is frozen for two year intervals has two disadvantages for some people: a) The hardware it supports is left behind by the real leading edge of hardware design. In some cases this doesn't really matter -- many motherboards and CPUs are sufficiently "generic" that Centos 4 will still work on them. In others, however, it just plain will not. This is a chipset by chipset, motherboard by motherboard, NIC by NIC sort of question and YMMV from "what's the fuss all about" to "showstopper". My own experience here is that Centos is useless for laptops for this very reason -- laptops evolve too fast and quickly leave Centos behind both at the device level and at the desktop/utility software level. It is of questionable utility on desktops -- I had showstopper problems on a number of AMD64 machines when first they came out, for example, that FC X eventually supported perfectly. This could well be a problem for you if your cluster were to be AMD64 based, right? Centos 4 was also the last release that required split UP and SMP kernels, which could create certain problems. b) The libraries it provides are left behind by the real leading edge of library development. Again, this can range in impact from "no big deal" to "showstopper", depending on just what libraries your code uses. A specific example of this historically was the Gnu Scientific Library -- seems like the kind of thing likely to be useful in a cluster, no? Unfortunately, the version "frozen" in Centos was so far behind the STABLE release version by the end of the two year cycle that a lot of code, including mine, that ran just fine using the stable release wouldn't run on a Centos cluster without going to the trouble of rebuilding the library and setting up a private repo for updating at least selected libraries more aggressively. This isn't THAT big a deal, but it is very definitely an added "cost" and needs to be considered when making the decision. It is worth noting that Fedora has actually proven remarkably stable if one simply adds a 3-6 month offset into the time you do the next upgrade, and that more and more cluster tools have been included with Fedora out of the box to the point where a very passable cluster node can be installed without ANY custom software builds, straight out of the updated repo base. It isn't on a par with Debian (where every known package is available somewhere, somehow) but it is a decent compromise between the surf-the-wave aspect of Debian and the stodginess of Centos. So a good way of putting it is that Centos is a good solution where it is a good solution, and a really terrible one otherwise. With conservative, known to be supported hardware and little likelihood of adding bleeding edge stuff to your cluster over the next Centos cycle, with more or less frozen and acceptible library requirements you can install it and forget it, letting it yum update for the lifetime of the cluster nodes. If you plan to add ever new and ever more exotic hardware, if you KNOW that the library you depend most on is under active development and you need to be able to track that development closely as bugs are fixed and features added, fedora or debian might well be better choices. I myself agree with Gerry -- Fedora whatever with PXE/kickstart is hard to beat for diskful nodes, and after tracking FC from 2 on I'm no longer worried about its "stability" or update cycle's finite lifetime. I appreciate the fact that it has a remarkably large set of packages that are built right into it and consistently maintained as a part of the distribution, and the fact that it is ALMOST rapidly varying enough to keep up with laptop hardware, maybe with a one-version delay. But this is very much a religious type thing, and as has already been remarked, you can use ANY linux distro and build a cluster out of it. The only real difference is how hard you have to work to do so -- how much of the work required is "done for you" in the prebuilt distribution and how much you have to do for yourself from the (after all!) open sources. rgb > > > Mike Davis > > > Bill Rankin wrote: >> Yes, we use it with good effect on our 500+ node cluster at Duke. It's >> currently running Centos-4. I think that the only issue is that some of >> our developers require newer releases of a couple packages, but it's easy >> enough to maintain a local yum repository with those packages. >> >> It's been a good stable platform for us. >> >> -bill >> >> >> On Oct 8, 2007, at 9:48 AM, Barnet Wagman wrote: >> >>> Does any one use Centos on Beowulf nodes? Of course Centos is really just >>> Redhat, but many people prefer it for use on servers. >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From gerry.creager at tamu.edu Mon Oct 8 10:27:55 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> Message-ID: <470A689B.5090507@tamu.edu> It's almost identical to CentOS, and the idea is to knock off the nameplate to allow non-proprietary distribution of the stable RHEL stuff. gc Mark Hahn wrote: >>> Does any one use Centos on Beowulf nodes? Of course Centos is really >>> just Redhat, but many people prefer it for use on servers. >> >> We have several sites using Scientific Linux, which is along the same >> lines as CentOS. > > I was surprised how very much like centos - I had the impression > SL was more of a stable-means-versions-from-5-years-ago distro, > but it appears to be mostly a repackaging of RHEL, and reasonably > up-to-date. from a quick glance at the SL-5.0 readme, the number > of customizations is quite small, so I do wonder what the point is. > (_not_ meant as a criticism!). > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From rgb at phy.duke.edu Mon Oct 8 10:29:23 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A5D75.8030903@rri.sari.ac.uk> References: <20071008124403.M2100@mail.ipm.ir> <470A5D75.8030903@rri.sari.ac.uk> Message-ID: On Mon, 8 Oct 2007, Tony Travis wrote: > What I like about APT (the Debian package manager) is the dependency checking > and conflict resolution capabilities of "aptitude", which is more robust than > the older "apt-get". I previously ran Red Hat 5.3->9 and I've used both > "up2date" and "yum". Neither of these is as capable of resolving package > conflicts and dependencies as APT. I used APT for RPM when I ran RH9 for > exactly this reason. > > In my opinion, the package management system is a very important factor to > take into account when choosing a distribution, as well as the range of tried > and tested binary packages that are available. In that respect, Debian/Ubuntu > has a lot to recommend it. It is worth noting that (while yes, up2date sucks and has always sucked) yum in FC 7 is a far, far cry from yum in RH 9. Dependency hell is always a bad thing, but very, very few people have experienced it with yum since maybe FC 4 or 5, if not earlier. With "enhanced" Fedora (the base distro, updates, and either "extras" for <7 and/or add-on repos like livna) one almost never encounters a package that doesn't just install, pulling dependencies as needed perfectly. When one DOES encounter a problem it is 9 times out of 10 the fault of the RPM-builder, not yum. Neither yum not apt can actually control the builder of the packages they manage, they can only do their best with the dependencies those packages request. It is always possible to do stupid things like insert circular references or requirements for packages that aren't in the distro, and it isn't fair to blame the package manager when those packages fail. rgb > > Tony. > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From rgb at phy.duke.edu Mon Oct 8 10:43:08 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> Message-ID: On Mon, 8 Oct 2007, Mark Hahn wrote: >>> Does any one use Centos on Beowulf nodes? Of course Centos is really just >>> Redhat, but many people prefer it for use on servers. >> >> We have several sites using Scientific Linux, which is along the same lines >> as CentOS. > > I was surprised how very much like centos - I had the impression > SL was more of a stable-means-versions-from-5-years-ago distro, > but it appears to be mostly a repackaging of RHEL, and reasonably > up-to-date. from a quick glance at the SL-5.0 readme, the number > of customizations is quite small, so I do wonder what the point is. > (_not_ meant as a criticism!). You've got me. At one time, it was a good way to get Centos plus a build of cernlib (of use to high energy physicists) plus a little of this and that. Nowadays, pretty much all RPM-based development goes into Fedora and migrates back into RHEL and thence to Centos and SL. Since Fedora is anywhere from 0 to 2 years ahead of the current RHEL/Centos snap -- well, at this point cernlib is in F7 ready to roll, and I'd bet the same is true of just about any of the other enhancements. I think the main issues have already been laid out. RHEL/Centos are good where vendors require "binary compatibility" on closed source software, as the standard of said binary compatibility. It is bad where library currency and support for the latest hardware and having the latest software tools (including compilers, GUI tools, and so on) are an issue. The "cost" of Fedora's rapid development cycle turns out to have been seriously overestimated and it is pretty easy for even a single sysadmin in charge of a large cluster to ride the fedora wave and upgrade every year or so. Its instability has been overestimated as well -- by simply delaying adoption for the first 3 months or so post the release you plan to upgrade to you give plenty of time for most of the bugs to be squashed by the early implementers (and that's an issue for Centos/RHEL as well, as they are typically "identical" to Fedora every couple of years, bugs and all). Fedora 7 with modest enhancements appears to have some 8500 packages. Far short of Debian, but plenty big enough to include just about all mainstream useful packages for any cluster or LAN. rgb > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From jmdavis1 at vcu.edu Mon Oct 8 10:52:45 2007 From: jmdavis1 at vcu.edu (Mike Davis) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <866AF227-3044-433E-A301-1755D536A053@ee.duke.edu> <470A479B.9030800@vcu.edu> Message-ID: <470A6E6D.5050801@vcu.edu> Robert G. Brown wrote: > On Mon, 8 Oct 2007, Mike Davis wrote: > >> My experience is similar to Bill's. We've been using CentOs 3,4 for >> the past few years on our larger clusters. It is a good choice for >> stability, good performance, and since it is RH for SW compatability. > > The only thing I'd comment on that is negative about it is one of its > "advantages". There is a narrow line between stability and stagnation, > and you have to figure out which side of that line your cluster will > fall on. Specifically, the fact that Centos/RHEL is frozen for two year > intervals has two disadvantages for some people: > I don't see this as a problem in a production cluster. The fact is that I've been doing this stuff for a little over two decades and I can build anything that I need for an application. For me a manual library build for CentOs 3 is easier than trying to find support for FC4 or reinstalling FC 1x per year. My CentOs 3 nodes have had less than 2hours downtime in 2 years and that was due to a Power Upgrade at their location, that required a complete shutdown of all machines on the floor. Now I should say, that I don't use diskless nodes, each node has its own OS disk and most have a separate /tmp disk for scratch use. That is one reason that we differ on OS, I believe. Mike From buccaneer at rocketmail.com Mon Oct 8 11:15:03 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <20071008170753.GA27519@bx9.net> Message-ID: <287707.47809.qm@web30605.mail.mud.yahoo.com> --- Greg Lindahl wrote: > On Mon, Oct 08, 2007 at 08:48:55AM -0500, Barnet > Wagman wrote: > > > Does any one use Centos on Beowulf nodes? Of > course Centos is really > > just Redhat, but many people prefer it for use on > servers. > > >From what I can tell, CentOS is the #1 distro for > clusters. Most folks > are familiar with Red Hat-style administration, and > CentOS is free. *Sigh* The best distro is the one that gets the most of YOUR work done in a given amount of time. ____________________________________________________________________________________ Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=list&sid=396545469 From gerry.creager at tamu.edu Mon Oct 8 11:16:37 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A6E6D.5050801@vcu.edu> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <866AF227-3044-433E-A301-1755D536A053@ee.duke.edu> <470A479B.9030800@vcu.edu> <470A6E6D.5050801@vcu.edu> Message-ID: <470A7405.2040507@tamu.edu> Mike Davis wrote: > Robert G. Brown wrote: >> On Mon, 8 Oct 2007, Mike Davis wrote: >> >>> My experience is similar to Bill's. We've been using CentOs 3,4 for >>> the past few years on our larger clusters. It is a good choice for >>> stability, good performance, and since it is RH for SW compatability. >> >> The only thing I'd comment on that is negative about it is one of its >> "advantages". There is a narrow line between stability and stagnation, >> and you have to figure out which side of that line your cluster will >> fall on. Specifically, the fact that Centos/RHEL is frozen for two year >> intervals has two disadvantages for some people: >> > > I don't see this as a problem in a production cluster. The fact is that > I've been doing this stuff for a little over two decades and I can build > anything that I need for an application. For me a manual library build > for CentOs 3 is easier than trying to find support for FC4 or > reinstalling FC 1x per year. My CentOs 3 nodes have had less than 2hours > downtime in 2 years and that was due to a Power Upgrade at their > location, that required a complete shutdown of all machines on the floor. > > Now I should say, that I don't use diskless nodes, each node has its own > OS disk and most have a separate /tmp disk for scratch use. That is one > reason that we differ on OS, I believe. Guess I've only been doing this for about 14 years. Thanks! I feel younger already. I prefer a disk on each node, with a backup OS and /tmp. We PXEboot all our nodes, and we update the node OS when we update the PXE stuff, too. I've had episodes where we lost a node and were able to salvage some degree of a run, so I think it's justified. It can make restarts of MM5 a little interesting, however. I prefer to not bring down the clusters for a rather random OS upgrade, so we, too, tend to run older stuff. Unlike RGB, my codes are pretty happy with the older libraries. I guess I'm lucky. Still, when we do up grade, I'll be putting Fedora on both clusters. As RGB states, so much of the "extra" stuff is already integrated that my workload to compile by hand is reduced. gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From buccaneer at rocketmail.com Mon Oct 8 11:34:52 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A6E6D.5050801@vcu.edu> Message-ID: <697252.42214.qm@web30601.mail.mud.yahoo.com> --- Mike Davis wrote: > I don't see this as a problem in a production > cluster. The fact is that > I've been doing this stuff for a little over two > decades and I can build > anything that I need for an application. For me a > manual library build > for CentOs 3 is easier than trying to find support > for FC4 or > reinstalling FC 1x per year. My CentOs 3 nodes have > had less than 2hours > downtime in 2 years and that was due to a Power > Upgrade at their > location, that required a complete shutdown of all > machines on the floor. > > Now I should say, that I don't use diskless nodes, > each node has its own > OS disk and most have a separate /tmp disk for > scratch use. That is one > reason that we differ on OS, I believe. You should use what works best for you. But, building software on RHEL/CentOS is way more difficult for the most part than building software under Fedora. That's the difference between ~1200 programs and thousands of programs in a distro. ____________________________________________________________________________________ Yahoo! oneSearch: Finally, mobile search that gives answers, not web links. http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC From gerry.creager at tamu.edu Mon Oct 8 11:48:47 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <697252.42214.qm@web30601.mail.mud.yahoo.com> References: <697252.42214.qm@web30601.mail.mud.yahoo.com> Message-ID: <470A7B8F.8090601@tamu.edu> Buccaneer for Hire. wrote: > --- Mike Davis wrote: > >> I don't see this as a problem in a production >> cluster. The fact is that >> I've been doing this stuff for a little over two >> decades and I can build >> anything that I need for an application. For me a >> manual library build >> for CentOs 3 is easier than trying to find support >> for FC4 or >> reinstalling FC 1x per year. My CentOs 3 nodes have >> had less than 2hours >> downtime in 2 years and that was due to a Power >> Upgrade at their >> location, that required a complete shutdown of all >> machines on the floor. >> >> Now I should say, that I don't use diskless nodes, >> each node has its own >> OS disk and most have a separate /tmp disk for >> scratch use. That is one >> reason that we differ on OS, I believe. > > You should use what works best for you. > > But, building software on RHEL/CentOS is way more > difficult for the most part than building software > under Fedora. That's the difference between ~1200 > programs and thousands of programs in a distro. I've had very little difficulty building programs under CentOS, all in all. In fact, lately, I've had problems with Fedora, as it is close to bleeding edge, but the Centos stuff just works. I may not be able to do it with native RPMs, but I still remember how to drive a compiler, too. gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From buccaneer at rocketmail.com Mon Oct 8 11:59:49 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A7B8F.8090601@tamu.edu> Message-ID: <153180.13919.qm@web30608.mail.mud.yahoo.com> --- Gerry Creager wrote: > Buccaneer for Hire. wrote: > > --- Mike Davis wrote: > > > >> I don't see this as a problem in a production > >> cluster. The fact is that > >> I've been doing this stuff for a little over two > >> decades and I can build > >> anything that I need for an application. For me a > >> manual library build > >> for CentOs 3 is easier than trying to find > support > >> for FC4 or > >> reinstalling FC 1x per year. My CentOs 3 nodes > have > >> had less than 2hours > >> downtime in 2 years and that was due to a Power > >> Upgrade at their > >> location, that required a complete shutdown of > all > >> machines on the floor. > >> > >> Now I should say, that I don't use diskless > nodes, > >> each node has its own > >> OS disk and most have a separate /tmp disk for > >> scratch use. That is one > >> reason that we differ on OS, I believe. > > > > You should use what works best for you. > > > > But, building software on RHEL/CentOS is way more > > difficult for the most part than building software > > under Fedora. That's the difference between ~1200 > > programs and thousands of programs in a distro. > > I've had very little difficulty building programs > under CentOS, all in > all. In fact, lately, I've had problems with > Fedora, as it is close to > bleeding edge, but the Centos stuff just works. I > may not be able to do > it with native RPMs, but I still remember how to > drive a compiler, too. I build a lot of software for my projects here at work and for my non-profit and although they have taken RHEL and done it better-the basic philosophy behind RHEL prevents from moving too far afield. ____________________________________________________________________________________ Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting From jmdavis1 at vcu.edu Mon Oct 8 12:20:18 2007 From: jmdavis1 at vcu.edu (Mike Davis) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <697252.42214.qm@web30601.mail.mud.yahoo.com> References: <697252.42214.qm@web30601.mail.mud.yahoo.com> Message-ID: <470A82F2.8090500@vcu.edu> Buccaneer for Hire. wrote: > > You should use what works best for you. > > But, building software on RHEL/CentOS is way more > difficult for the most part than building software > under Fedora. That's the difference between ~1200 > programs and thousands of programs in a distro. > It might be a problem with rpm's here as well. But I build most of our scientific software (with only a couple of exceptions) from source. So my g03, vasp, Atlas, fftw2, fftw3, gamess, lammps, mpich-1.2.7 (intel and PGI), mpich2 (intel and PGI), mrbayes, openmpi (intel and PGI), MX, deMon, NRLMOL are all source builds. With the exception of VASP (which required tweaking of MPI and compilers) all were straight forward. Non source builds are abaqus and adf, currently. Mike From landman at scalableinformatics.com Mon Oct 8 12:59:25 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <287707.47809.qm@web30605.mail.mud.yahoo.com> References: <287707.47809.qm@web30605.mail.mud.yahoo.com> Message-ID: <470A8C1D.3080108@scalableinformatics.com> Buccaneer for Hire. wrote: > *Sigh* The best distro is the one that gets the most > of YOUR work done in a given amount of time. ... without you pulling out your remaining hair (for we the folicly challenged/diminshed) in order to be able to start doing your work in the first place. Distributions are just big collections of stuff. Some collections are better built/constructed than others. Some are just piss-poor, and require you to rebuild sections of the infrastructure in order to properly rebuild packages to be useful. Some come with everything including the kitchen sink. The distros I like the most for cluster building are OpenSuSE, Fedora, and Ubuntu. This tends to change over time. All have modern kernels, and despite efforts to the contrary on the part of some package builders (see if this sounds familiar "we only support distro X"), stuff just mostly works on them without significant pain. Well, some minor nits (4k stacks hrrumph!), but these distros tend to be pretty good (yeah, I do like Fedora). The issue when you are building the cluster is whether or not your hardware (nice, shiny, new) does in fact support the distro you would like to use. Chances are, if you are using "older" more established technology on the HW side, you likely have a reasonable shot of getting more conservative distros to be supported (RHEL/Centos, SuSE, ...) to work. There are other issues, some of which are biting people now on other lists, whereby some of the distro's choices come back to haunt them with a fury. Such as supporting ext3 as their advanced file system. Doesn't work well when you have storage units larger than ext3's maximum file system size, or files larger than ext3 can handle. RHEL is famous for this, if you need to use large disks or large files, they have effectively precluded using that distro. You can use Fedora, with xfs and jfs and not run into this issue, though there are others (4k stacks, SELinux, cough cough!) that can make a grown distro user cry. Luckily, those issues are all solvable. At the end of the day, its about which pile-o-packages you want, and which will get you running as quickly as possible with the minimum pain possible. This is an important aspect, which often gets lost in the shuffle. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From buccaneer at rocketmail.com Mon Oct 8 13:09:00 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A8C1D.3080108@scalableinformatics.com> Message-ID: <916617.57800.qm@web30609.mail.mud.yahoo.com> When you have spent multi-millions of dollars writing and maintaining internal code, one's options become limited. Add to that the fact that we are in the middle of a software/technique morph (as I have mentioned in other posts) and you find you have to make trade offs. For most of our cluster we mostly stick with Fedora-with the exception of a number high availability Stratus boxes that require RHEL. These we look at as more like black boxes. Almost all of our disk nodes are Fedora 7 now-with the exception of a few that are RHEL-with the understanding that performance will not compare-and they don't. Compute nodes are mostly Fedora also. New commercial needs means I am going to get a rack of 64 RHEL4 boxes into production. My personal favorite? My laptop runs Fedora 7. ____________________________________________________________________________________ Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games. http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow From rgb at phy.duke.edu Mon Oct 8 13:35:00 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A6E6D.5050801@vcu.edu> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <866AF227-3044-433E-A301-1755D536A053@ee.duke.edu> <470A479B.9030800@vcu.edu> <470A6E6D.5050801@vcu.edu> Message-ID: On Mon, 8 Oct 2007, Mike Davis wrote: > Robert G. Brown wrote: >> On Mon, 8 Oct 2007, Mike Davis wrote: >> >>> My experience is similar to Bill's. We've been using CentOs 3,4 for the >>> past few years on our larger clusters. It is a good choice for stability, >>> good performance, and since it is RH for SW compatability. >> >> The only thing I'd comment on that is negative about it is one of its >> "advantages". There is a narrow line between stability and stagnation, >> and you have to figure out which side of that line your cluster will >> fall on. Specifically, the fact that Centos/RHEL is frozen for two year >> intervals has two disadvantages for some people: >> > > I don't see this as a problem in a production cluster. The fact is that I've > been doing this stuff for a little over two decades and I can build anything > that I need for an application. For me a manual library build for CentOs 3 is > easier than trying to find support for FC4 or reinstalling FC 1x per year. My > CentOs 3 nodes have had less than 2hours downtime in 2 years and that was due > to a Power Upgrade at their location, that required a complete shutdown of > all machines on the floor. Which is perfectly reasonable, and I agree. The tradeoff is whether, and how much, you have to rebuild stuff (and how hard that stuff is to rebuild). At one time building e.g. cernlib was a cosmic pain in the buttocks, for example -- I mean serious pain. Been there, done that. It was one of the major appeals of "Scientific Linux" when it first came out -- cernlib was prebuilt for it, although somehow the libraries it used were such that you couldn't just rpmbuild --rebuild its source RPM back on Centos or for that matter Fedora. But I've built (and tried to build) it from all the way back when tarballs where what you had to work with, and where you might expect to make twenty or thirty hacks to get through the build. Paaaaiiin. Now it IS in fedora (and may be in Centos for all that I know). Because it is in fedora, I'm reasonably sure that rpmbuild --rebuild is all that is needed to rebuild it under Centos anyway. But one's choice WAS build it from scratch, unpackaged, or use fedora, hmmm... In other words, if the issue is one or two libraries, and they are decently packaged and have simple dependencies, and if the Centos kernel WORKS for your hardware, then Centos is an excellent choice. I was just pointing out that there were some issues that one SHOULD think about while making the decision lest one plan to install Centos on brand new bleeding edge hardware only to learn that the kernel doesn't have support for its chipset, or that your users need some five-library constellation that is constantly being updated in Fedora (but which works decently from snap to snap) but that has to be built, then rebuilt, then re-rebuilt every three or four months under Centos. In either case you can make things work with either distro (or any other linux distro, really) -- they ONLY differ on how much work one has to do and what kind of work it is that must be done to make them work. If you want transparent access to the latest tools and libraries that are constantly being added to fedora, well, that favors fedora -- to enhance your cluster instead of building a library or application constellation you just add a yum install command to your nightly cluster update and poof, there it is the next day ready to use, auto updating. If you plan to update your cluster with brand spiffy new nodes with bleeding edge motherboards with integrated orange juicers and jolt cola dispensers, I'd argue that that favors Fedora as well, as chances are much better that your jolt cola dispenser will "just work" if you kickstart install the latest fedora on them instead of a two year old Centos, and the alternative is to not use your new systems until the latest Centos is released (and pray they work then or the next time they will is two MORE years later) or start dusting off those kernel building skills and figure out how to add a whole stack of libraries leading to the orange-juicer controls to your aging Centos install. If your cluster really only runs one application (or a small and straightforward set of applications), and Centos supports your cluster hardware and that application's library requirements out of the box, well heck -- of course it is nice to be able to do an installation and then just forget the cluster for the rest of its operational lifetime. Done it myself, many times (though not with Centos per se). If you are comfortable inserting the requirement on future cluster hardware purchases "and must be able to boot and run Centos X normally out of the box" that's fine too. If your primary applications REQUIRE RHEL/Centos libraries for binary compatibility, or if you use proprietary software that PROBABLY would work perfectly with Fedora but where they won't answer your hotline questions or requests for support unless you are using RHEL (they usually won't even accept "Centos" as a substitute in this case, you just have to tell a tiny fib) well, that makes Centos a really good choice! How difficult is it to use Fedora instead of Centos in a production cluster? What is the cost tradeoff? Basically it is a day or so of work, per expected upgrade. The chores involved are typically: a) Mirror a the distribution repo from a suitable site onto your local install server. time required -- call it an hour to set up a script, then five minutes to hack it to point at the current distro per instance, plus a lag period where you actually do something else and the download completes. b) Clone your PXE/DHCP targets and your kickstart files. Install vmlinuz and initrd from the new distro under tftpboot. Time required: call it an hour, although it probably won't take that long. c) Pick a node -- any node -- to prototype with. Network boot the node, into the cloned kickstart file. Monitor the progress, especially noting where packages are missing or weirdness occurs. d) Look over the package list for the new release, look over what is missing, resolve conflicts, add stuff that looks like it would be useful that before you maybe had to build on your own (like cernlib or ganglia or openmpi). Test the node with your primary applications (or not). If they are "boring" and would run out of the box on Centos, they'll almost certainly run out of the box post rebuild under Fedora. If they use libraries that might have signficantly changed, well, it's a good idea to test them. This process can take anywhere from a few minutes to hours or even days. In my case it would take minutes because my applications are boring and I know damn well they'll rebuild under Fedora whatever. If anything, they're likely to have trouble BACKporting to run under e.g. Centos 4 -- from my own personal point of view backporting libraries is a total pain and a process fraught with peril, especially if they or their dependencies have significantly evolved in the meantime. However, there are definite exceptions -- getting Cisco's vpnclient recently died on my in a mere UPDATE of the fedora kernel, because they changed a bunch of stuff in the skbuff stack. As this example demonstrates, many of those exceptions involve proprietary software that nobody maintains (really!). If they maintain it, they have to spend some of the money you pay them on testing and debugging, after all, and they hate that. Far easier to just insist that you freeze your environment in a version that is the last one for which they grudgingly were FORCED to make it work. Once you are satisfied, and have modified your kickstart file to match your new improved package list (which may involve no modification at all, mind you) and you've tested it successfully -- you basically set a toggle that causes all your nodes to reboot some dark evening and PXE-reinstall themselves from the kickstart file. You COULD even do this unattended, although it would probably be foolish to. Either way the time required to initiate the upgrade is very small (per node) and if you did your node testing adequately and encounter no further problems you can read a novel or exercise or something while it occurs. Add it all up, and you find that doing a node upgrade might take you anywhere from half a day to a day and a half. You have to do this work at least one time roughly 18 months after initially installing the nodes, and again if you plan to run them after 3 years, although of course you can ELECT to update more often than that if Fedora has a fabulous new library added to it that halves the runtime of a lot of numerical code based on it, or if somebody adds a really spectacular batch job system to it and replaces the one you're paying for with something that works better and is easier to use and is free, or if you just need it to run the hardware you buy a year from now (or else you have to build a custom kernel, test it, and maintain an update stream for it by hand forever) and you'd rather run the same OS release on your entire cluster. This is what you are trading off against the labor required to instead keep a pool of libraries and/or the kernel more aggressively up to date. To my experience, it doesn't take many library or auxiliary tool rebuilds, and probably only ONE kernel rebuild, to compete with the work required to upgrade fedora one time over the course of a cluster's lifetime. Either way, you can also choose to run Centos on the servers for your cluster (a common enough decision, since server hardware tends to be less aggressively upgraded and since server "stability" is paramount, although as I said the INstability of Fedora is largely urban myth anytime sixty days or so post initial release) or you can follow through and bootstrap your servers to Fedora current every now and then as well. One LAST issue that should be addressed is what your USERS are running on their desktops. This is a nontrivial questions for some cluster architectures. If they are running Fedora (so that they get the latest versions of X, support for their super-duper graphics adaptors and cameras, the best possible list of printers, so that their laptops have an even chance of working with their audio and network devices, so that they can get "flight of the amazon queen" on their desktop) AND if the cluster architecture is "flat, with NFS shared across the accessing LAN" -- so that basically the "cluster" is just a pile of headless workstations on the same LAN and mounting the same disk as the desktops and/or laptops -- then there is a STRONG incentive to use fedora on the cluster nodes. If you don't, users cannot do a build on their desktop and drop the resulting binaries into the execution queue for the cluster or otherwise arrange for them to be run in distributed fashion. It is difficult to assess the cost-benefit tradeoffs here because the costs are all to the sysadmin and the benefits are all to the user, but they are substantial, and almost certainly would strongly favor using the same thing everywhere, probably Fedora. After all, so it is a pain in the sysdmin's behind once every 18 months or so to run Fedora, but a) he/she's doing so anyway to support the desktops, and most of the steps above are already accomplished before he or she STARTS on the cluster nodes, and b) the users REALLY REALLY save a lot of time being able to build, test, debug, and even carry out small prototyping runs on their own desktops without needing to do all of that on "the cluster" or a special "build box" with the right library/distro set up, and there are probably a lot more of them than there are sysadmins. To re-summarize: People considering what distribution to use to build a cluster should think about the following: a) What are they familiar with? You can build a cluster on top of any distribution, but if you are a Debian expert you're going to find building a Centos cluster painful and vice versa. I think most of us on list would advise "go with your strength" here unless you encounter a really good reason to do otherwise, at least until you have multiple strengths or unless you have no strengths at all and have to start from scratch. In that and all cases, continue asking yourself: b) How scalable is it? I personally think that scalability and automation are key elements in the decision, both for clusters and their closely related client/server LAN installs. Lots of people like FAI. I personally am fond of PXE/kickstart. Warewulf is pretty easy and scalable. Pick something where you do work once, then implement it across the cluster (or LAN) without having to do more per-node work than "booting it", possibly making a BIOS level choice as to boot target and a DHCP choice as to installation image. This is one of Linux's STRENGTHS, especially compared to e.g. Windows. If you have to do more work than that (per node) you probably need to reexamine your distro choice or learn new tools associated with the distribution. c) How well does it work with my preferred hardware (now and in the future)? Again, different distros have very different track records for staying hardware-current out of the box, for obvious reasons. If you want to track bleeding edge hardware, select a bleeding edge distro, not one that has a 2+ year release cycle and doesn't change much in between except for bug fixes. I WISH that this weren't the case -- I wish that Centos maintained the kernel and device list much more aggressively than they have in the past -- but they haven't and that's a simple fact. And yet, frequently, it doesn't matter (so don't interpret this as "Centos is bad"). d) How well does it provision libraries (or non-library packages, toolsets) I'm likely to need in my cluster's work, and how rapidly do those libraries vary? Note that this question can easily work EITHER WAY -- in some cases one wants NO variation so your proprietary applications get just the right library, period. In other cases your users will be sitting there growling at you and nagging you to build/add a current version of first this library, then that one as they use features that just aren't there in the library space of the long-lifetime distros after their first six months. Note that some libraries are rapidly varying and under continuous development with really important changes occurring with some frequency, while others have APIs that have varied little over years. Maybe your cluster applications need one type, maybe the other. Maybe (God help you) both. There is no set answer to this question, and it may even require a least of possible evils compromise. e) What do we use in our LAN for desktops and laptops? Again, this is almost a no brainer (and is a question that may not even occur to people who run an isolated "cluster compute center", rather than a cluster integrated with and immediately accessible to a departmental or research LAN). If all your LAN desktops are running Debian but your cluster is running two year old Centos, you're forcing them into a very unnatural and restrictive work model and you're going to be constantly pressured to backport from one to the other. You're also significantly increasing your work load as you try to cope with the differences between Debian management and Centos management. An "ideal" situation to craft is one where there is a smooth path from desktop to cluster, where a "make" of your source tree on one is pretty much guaranteed to produce run-ready binaries (including access to all required dynamic link libraries) for the other, within the boundaries of HARDWARE architecture, not distribution or binary compatibility model within a hardware type. In a lot of clusters it makes sense to completely flatten NFS space as well to further facilitate this -- project space that is commonly mounted across cluster and server and desktop so that even paths on the cluster match paths on your desktop. Yes, it is perfectly possible to get by without this, and some cluster architectures (especially "standalone" clusters, regular beowulf type clusters, computer center type clusters) work just fine without it, but there are lots of research clusters where this is the way they are set up, with little or no barrier between the desktop LAN and the cluster LAN. In the latter case, the time saved on development and testing and implementation vastly outweighs the time spent keeping e.g. Fedora up to date on cluster nodes, in part because you have to keep Fedora up to date on your desktops anyway, and the only difference between a cluster node and a desktop is the package selection in a kickstart file or post-install yum scripts. f) Sundry other issues, or "miscellaneous things to look at that I can't quite pin down here". For example, if you're building a cluster out of obsolete and underequipped systems (a thing I not infrequently advise e.g. high school students on offline) you may find that running the latest release of ANY distribution out of the box is quite difficult, but that you can get the linux on a rescue CD to boot quite nicely, or perhaps a really old version of some existing distribution. Or perhaps you're playing the "let's install a graphics card and use it to do computations" game, where the libraries and cross-compilers that enable it only exist pretty much for one distribution, maybe. Or your cluster is part of a grid, or part of a flat LAN or a standalone beowulf, or has to integrate with a Windows cluster where the best you can hope for for "compatibility" is cygwin-alikeness (so you pick the most cygwin-like distro you can). Every cluster is different, everybody's needs are different, and YMMV. Which is why none of the EXPERTS here are going to say "you should ALWAYS use Fedora 6, but only the version that was available three months ago before they screwed up the kernel and with openmpi built from fresh source as it isn't up to date in the form our users expect". The best choice isn't universal, it is dictated by your own degree of experience and knowledge, your design goals, your application mix, your cluster type, your cluster's general environment and support structure, and even then -- you can get ANY linux distro to WORK for your cluster even if you get one or more of these "wrong" in the sense that ex post facto they turn out to be suboptimal. Live and learn, in other words, and be prepared to experiment and change as it makes sense to do so. rgb > > Now I should say, that I don't use diskless nodes, each node has its own OS > disk and most have a separate /tmp disk for scratch use. That is one reason > that we differ on OS, I believe. > > > Mike > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From hahn at mcmaster.ca Mon Oct 8 14:00:52 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <866AF227-3044-433E-A301-1755D536A053@ee.duke.edu> <470A479B.9030800@vcu.edu> Message-ID: > "advantages". There is a narrow line between stability and stagnation, > and you have to figure out which side of that line your cluster will > fall on. Specifically, the fact that Centos/RHEL is frozen for two year > intervals has two disadvantages for some people: I think it's wise to always assume that you will be adding updated packages to your cluster, regardless of which distro you select. perhaps there are close-to-turn-key systems where this is not the case, but anything past a personal cluster is bound to require some fiddling. > a) The hardware it supports is left behind by the real leading edge of > hardware design. a valid point, though for the most part, it's really only the kernel that has to deal with edgy hardware issues. your version of glibc probably doesn't care for instance. often, when people report HW issues with a distro, all they're really talking about is trouble booting the install kernel. of course, there's no need for cluster nodes to be using a distros kernel at all, let alone the install-disk's one. > b) The libraries it provides are left behind by the real leading edge > of library development. Again, this can range in impact from "no big > deal" to "showstopper", depending on just what libraries your code uses. and whether you really care about what the distro does. IMO, any significant cluster should probably have its own versions of performance- and security-relevant libraries anyway. the hardest part of having local versions is in deciding on a policy on when to update the versions and how to test. the actual download/patch/compile/install is a matter of a few minutes. > This isn't THAT big a deal, but it is very definitely an added "cost" > and needs to be considered when making the decision. right. IMO, if you're really trying to eliminate costs, just fix on some distro and freeze the config entirely. that means not updating hardware, but you're pinching pennies, right? it also means minimizing your exposure to security issues, which should mean agressive firewalling, limitation of user access, and minimization of the number of installed packages. for instance, why let users login directly? sure, if there's a privilege-elevation exploit, it's probably doable from a batch job, but it still helps. I always like to see not-cluster-relevant packages removed, as well - probably no need for a printing subsystem, for instance, or any desktop packages like evolution. regards, mark hahn. From rgb at phy.duke.edu Mon Oct 8 14:09:35 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <916617.57800.qm@web30609.mail.mud.yahoo.com> References: <916617.57800.qm@web30609.mail.mud.yahoo.com> Message-ID: On Mon, 8 Oct 2007, Buccaneer for Hire. wrote: > My personal favorite? My laptop runs Fedora 7. Yeah, mine too...;-) My own experience regarding back vs forward porting -- In many cases one simply cannot backport, because the libraries you need aren't there and ain't a-gonna be there unless you do WAY more work than you EVER want to do. There is a reason distributions are "distributions", after all... Forward porting, on the other hand, tends to be pretty easy. Distributions rarely lose CAPABILITY even if they may have different header files and when one command/subroutine call is deprecated and eventually obsoleted in favor of another. I maintain some truly ancient sources (e.g. jove), and I've had little difficulty getting those sources to build under FC-whatever even when I've had to go in and hack one particular subroutine call out in favor of another in five source files, or change the include headers so things are still findable at build time. And a LOT of sources -- pretty much any sources written in anything approximating what is laughably considered "portable" ANSI or Posix compatible C -- just build. Type "make", stand back, possibly after doing the ./configure --prefix=/usr thing, possibly not. But still, as I note in a previous response, there is no knee-jerk correct answer to the question "what is the best distro for a cluster". For most people, the best one is the one they are the most comfortable with, installing, managing, building software for. For people that know more than one distro (or are confident they can learn whatever they need to to run additional distros or change distros), well, there are all of those "questions" to ask yourself and try to objectively answer for the distros you wish to consider -- how scalable is its install, how well-provisioned is it in terms of libraries, how aggressively do you need to upgrade it (nothing MAKES you upgrade ANY cluster that is network isolated and fully functional as it is, even if it is running Fedora whatever and there is no longer any update stream for it), how well does it match your hardware requirements, what's the relative cost in work to BOTH manage it AND use it in comparison to alternatives, what kind of environment is it running in (and are there any hidden economies of scale there) and are there any one-of-a-kind gotchas to think about? rgb -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From ajt at rri.sari.ac.uk Mon Oct 8 14:25:07 2007 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <470A5D75.8030903@rri.sari.ac.uk> Message-ID: <470AA033.1050609@rri.sari.ac.uk> Mark Hahn wrote: >> What I like about APT (the Debian package manager) is the dependency checking and conflict resolution capabilities of "aptitude", which is more robust than > > I'm curious - how does a conflict happen, and how is it resolved? > I guess that this must have to do with packages which specify particular versions of packages they depend on, or perhaps minimal > versions of them. but a conflict seems to imply that you'd have a package which ultimately has conflicting dependencies. Hello, Mark. Debian has a very large range of packages, some of which are known to conflict a priori because they solve similar problems, but they are maintained by different people with different objectives. Each package maintainer has a responsibility to declare known conflicts with other packages. However, the permutations of packages that could be combined in an installation are also very large. APT tracks files that are used in all packages and automatically detects conflicts. The APT conflict resolver suggests strategies to solve problems like this, and also the dependencies that are caused by package upgrades. It's not a new idea, but it works very well. This statement is in the current Debian release notes: "2.1.1 Package management aptitude is the preferred program for package management from console. aptitude supports most command line operations of apt-get and has proven to be better at dependency resolution than apt-get. If you are still using dselect, you should switch to aptitude as the official frontend for package management. For etch an advanced conflict resolving mechanism has been implemented in aptitude that will try to find the best solution if conflicts are detected because of changes in dependencies between packages." http://www.debian.org/releases/stable/i386/release-notes/ch-whats-new.en.html > how often does this happen, and is it mainly the result of misconfiguration? It doesn't happen very often, but when it does "aptitude" has got me out of several deep holes... It's not because of misconfiguration. It's because the 'topological' map of package dependencies is complex, and not all possible interactions between all packages can be anticipated because the search space of package combinations is extremely large. > and resolution is simply having multiple version of some depended-on package installed, right? Sounds so easy, doesn't it ;-) >> the older "apt-get". I previously ran Red Hat 5.3->9 and I've used both "up2date" and "yum". Neither of these is as capable of resolving package conflicts and dependencies as APT. I used APT for RPM when I ran RH9 for exactly this reason. > > hmm, I've never had any problems with yum. Fair enough, but I found that it placed more of the burden of package management on me than I wanted. APT for RPM was wonderful by comparison. I used it for several years until the Fedora Legacy Archive stopped supporting RH9. At this point I had already been evaluating Debian on one of our Beowulf servers, but I decided to use Ubuntu instead. Best wishes, Tony. -- Dr. A.J.Travis, | mailto:ajt@rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687 From ajt at rri.sari.ac.uk Mon Oct 8 14:28:59 2007 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A725B.8040300@nada.kth.se> References: <20071008124403.M2100@mail.ipm.ir> <470A5D75.8030903@rri.sari.ac.uk> <470A725B.8040300@nada.kth.se> Message-ID: <470AA11B.5070807@rri.sari.ac.uk> Jon Tegner wrote: > Tony Travis wrote: >> >> I also prefer Debian-based distro's and still run the openMosix kernel >> under an Ubuntu 6.06.1 LTS server installation on our Beowulf cluster. >> >> What I like about APT (the Debian package manager) is the dependency >> checking and conflict resolution capabilities of "aptitude", which is >> more robust than the older "apt-get". I previously ran Red Hat 5.3->9 >> and I've used both "up2date" and "yum". Neither of these is as capable >> of resolving package conflicts and dependencies as APT. I used APT for >> RPM when I ran RH9 for exactly this reason. >> >> In my opinion, the package management system is a very important >> factor to take into account when choosing a distribution, as well as >> the range of tried and tested binary packages that are available. In >> that respect, Debian/Ubuntu has a lot to recommend it. >> >> Tony. > Hi, > > regarding ubuntu, has anyone tried their adoption of kickstart? Hello, Jon. I just PXE boot an NFSROOT linux-2.4.26-om1 kernel on the nodes, and use UNFS3 with ClusterNFS enabled to share the root filesystem of one of the servers with the nodes. Best wishes, Tony. -- Dr. A.J.Travis, | mailto:ajt@rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687 From ajt at rri.sari.ac.uk Mon Oct 8 15:43:39 2007 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <470A5D75.8030903@rri.sari.ac.uk> Message-ID: <470AB29B.5040506@rri.sari.ac.uk> Robert G. Brown wrote: > [...] > It is worth noting that (while yes, up2date sucks and has always sucked) > yum in FC 7 is a far, far cry from yum in RH 9. Dependency hell is > always a bad thing, but very, very few people have experienced it with > yum since maybe FC 4 or 5, if not earlier. Fair point - My experience of "yum" was in RH9. As I've mentioned here before I crashed and burned trying to upgrade RH9 to FC2. I had better experience with Debian, and I wanted to use NEBC's Bio-Linux packages: http://envgen.nox.ac.uk/biolinux.html These are Debian binaries - not the rpm biolinux repository: http://www.biolinux.org/wiki/index.php/Main_Page This set me off down the Debian/Ubuntu path, which I had already decided to explore anyway. I'm running an openMosix kernel (linux-2.4.26-om1) under 'biobuntu' (NEBC Bio-Linux + Ubuntu) on our Beowulf: http://bioinformatics.rri.sari.ac.uk We support VNC logins via SSH, and use lots of desktop applications. I realise this influences my view about what is the 'best' distribution, and why the package manager is so important. This is a small (92 node) cluster, not 'BIG' iron like many people on this list run! However, it's fairly typical of the sort of DIY cluster the discussion is about... Tony. -- Dr. A.J.Travis, | mailto:ajt@rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687 From buccaneer at rocketmail.com Mon Oct 8 16:05:47 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470AB29B.5040506@rri.sari.ac.uk> Message-ID: <425916.21533.qm@web30611.mail.mud.yahoo.com> --- Tony Travis wrote: > We support VNC logins via SSH, and use lots of > desktop applications. I > realise this influences my view about what is the > 'best' distribution, > and why the package manager is so important. This is > a small (92 node) > cluster, not 'BIG' iron like many people on this > list run! However, it's > fairly typical of the sort of DIY cluster the > discussion is about... I use packages as much as I can and will spend the time building RPMS is needed-because of all the benefits it brings to the table. YUM is great. The new versions of RHEL will now be using YUM as well. I think how you manage and maintain your cluster (whatever the size) determines if you go home at a reasonable hour and get to stay home on weekends. And it determines how much processing is performed-and that after all is why we have our jobs. ____________________________________________________________________________________ Yahoo! oneSearch: Finally, mobile search that gives answers, not web links. http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC From csamuel at vpac.org Mon Oct 8 17:20:55 2007 From: csamuel at vpac.org (Chris Samuel) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] Odd Infiniband scaling behaviour - *SOLVED* - MVAPICH2 problem In-Reply-To: <200710081524.55374.csamuel@vpac.org> References: <200710081524.55374.csamuel@vpac.org> Message-ID: <200710091020.55874.csamuel@vpac.org> On Mon, 8 Oct 2007, Chris Samuel wrote: > If I then run 2 x 4 CPU jobs of the *same* problem, they all run at > 50% CPU. With big thanks to Mark Hahn, this problem is solved. Infiniband is exonerated, it was the MPI stack that was the problem! Mark suggested that this sounded like a CPU affinity problem, and he was right. Turns out that when you build MVAPICH2 (in our case mvapich2-0.9.8p3) on an AMD64 or EM64T system is defaults to compiling in and enabling CPU affinity support. So if we take an example of 4 x 2 CPU jobs, it has the unfortunate effect of binding all those MPI processes to the first 2 cores in the system - hence why we see only 25% CPU utilisation per process (watched via top, and evident by the comparative run time). Fortunately though it does check the users environment for the variable MV2_ENABLE_AFFINITY and if that is set to 0 then the affinity setting is bypassed. So simply modifying my PBS script to include: export MV2_ENABLE_AFFINITY=0 before using mpiexec [1] to launch the jobs results in a properly performing system again! I'm currently running 4 x 2 CPU NAMD jobs and they're back to properly consuming 100% CPU per process. Phew! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://www.scyld.com/pipermail/beowulf/attachments/20071009/b80235de/attachment.bin From john.leidel at gmail.com Mon Oct 8 19:51:45 2007 From: john.leidel at gmail.com (John Leidel) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] Supercomputing Conference Public Calendar Message-ID: <1191898305.5082.35.camel@e521.site> All, just wanted to inform the community that I've just posted a public google calendar of events for SC07. I plan on updating it with the various public parties and gatherings during the week of Nov 10-16 out in Reno. [I always seem to miss the really fun events...]. I thought it might be easier to have a single point of reference for all the various goings-on of SC. The only catch is, I need those in the community to send me details of their events/BOFs/gatherings/etc so I may post them in the public calendar. For details on posting events and links to the calendar, see the thread at insideHPC : http://insidehpc.com/2007/10/08/sc07-conference-public-events-calendar/ BTW, thanks to Donald Becker in jump starting the idea with his post on the Beowulf event. cheers john From Bogdan.Costescu at iwr.uni-heidelberg.de Tue Oct 9 03:36:15 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> Message-ID: On Mon, 8 Oct 2007, Robert G. Brown wrote: > RHEL/Centos are good where vendors require "binary compatibility" on > closed source software, as the standard of said binary > compatibility. What strikes me in this whole discussion is the ideea of 'one distribution fits all' when applied to all nodes of a cluster and all applications that run on that cluster. In the days of PXE booting, with several solutions readily available for either building a node from scratch (like kickstart) or booting a prebuilt setup with NFS-root or ramdisk, what's so difficult in matching on request a node, an application and a distribution/custom setup ? Real case: A quantum mechanics code that we have bought some years ago was provided only as staticly-linked binaries. They have worked fine on the current distros at that time and we have succesfully used them on CentOS-3 (2.4 kernel). However we discovered the hard way on the new CentOS-5 (2.6 kernel) that the statically linked binaries didn't work anymore as the kernel interfaces have changed - but, after a few lines were changed in the config files and the nodes rebooted, the binaries were again happily running in their required configuration. Of course, the admin is responsible in defining which distributions/custom setups can run on a certain node, based on the hardware of that node and the kernel of the distribution/custom setup. But after this is done, the user can limit his/her jobs to running on these nodes or ask the queueing system to set up a node according to the requirements of the job (I think that term is 'provisioning'). Sure, it helps in this case to run a distribution with long support (like RHEL/CentOS/SL, SLES or Ubuntu LTS) such that you don't have to waste too much time yourself with updates, especially security related ones. > Far short of Debian, but plenty big enough to include just about all > mainstream useful packages for any cluster or LAN. I'm making sure that any cluster related package that is part of the default distribution is not part of what the nodes get to run. Why ? Because very often the common ground options used for building the package (which is a good idea for a widely used distribution) don't fit _my_ setup. So, I take the fact that the distibution offers me all the needed tools as a fallback, but I'm always trying to match as well as possible all the components. And if you search the archives of the LAM/MPI mailing lists you'll see the larger picture... -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From deadline at eadline.org Tue Oct 9 05:32:19 2007 From: deadline at eadline.org (Douglas Eadline) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best Linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> Message-ID: <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> Excellent point. I have often thought that "diskless" provisioning opens up lots of opportunities to create custom node groups based on kernels or distributions. Throw in a virtualized head node and many ISV requirements could be handled this way e.g. a virtualized Suse environment running on top of Red Hat could request 32 Suse nodes from the scheduler (running under a Red Hat instance). The scheduler just provisions nodes as needed and sets them in a low power state when not being used. Going with fully virtualized nodes is another option provided the applications are still close to hardware. Note that diskless provisioning does not imply diskless nodes, if you need local drives, then you can still use them in a diskless booting scheme. Not nailing an OS to the hard drive on cluster nodes has lots of advantages. -- Doug > On Mon, 8 Oct 2007, Robert G. Brown wrote: > >> RHEL/Centos are good where vendors require "binary compatibility" on >> closed source software, as the standard of said binary >> compatibility. > > What strikes me in this whole discussion is the ideea of 'one > distribution fits all' when applied to all nodes of a cluster and all > applications that run on that cluster. In the days of PXE booting, > with several solutions readily available for either building a node > from scratch (like kickstart) or booting a prebuilt setup with > NFS-root or ramdisk, what's so difficult in matching on request a > node, an application and a distribution/custom setup ? > > Real case: A quantum mechanics code that we have bought some years ago > was provided only as staticly-linked binaries. They have worked fine > on the current distros at that time and we have succesfully used them > on CentOS-3 (2.4 kernel). However we discovered the hard way on the > new CentOS-5 (2.6 kernel) that the statically linked binaries didn't > work anymore as the kernel interfaces have changed - but, after a few > lines were changed in the config files and the nodes rebooted, the > binaries were again happily running in their required configuration. > > Of course, the admin is responsible in defining which > distributions/custom setups can run on a certain node, based on the > hardware of that node and the kernel of the distribution/custom setup. > But after this is done, the user can limit his/her jobs to running on > these nodes or ask the queueing system to set up a node according to > the requirements of the job (I think that term is 'provisioning'). > Sure, it helps in this case to run a distribution with long support > (like RHEL/CentOS/SL, SLES or Ubuntu LTS) such that you don't have to > waste too much time yourself with updates, especially security related > ones. > >> Far short of Debian, but plenty big enough to include just about all >> mainstream useful packages for any cluster or LAN. > > I'm making sure that any cluster related package that is part of the > default distribution is not part of what the nodes get to run. Why ? > Because very often the common ground options used for building the > package (which is a good idea for a widely used distribution) don't > fit _my_ setup. So, I take the fact that the distibution offers me all > the needed tools as a fallback, but I'm always trying to match as well > as possible all the components. And if you search the archives of the > LAM/MPI mailing lists you'll see the larger picture... > > -- > Bogdan Costescu > > IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen > Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY > Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 > E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > !DSPAM:470b5a8976572020149523! > -- Doug From deadline at clustermonkey.net Tue Oct 9 05:37:44 2007 From: deadline at clustermonkey.net (Douglas Eadline) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] MorphMPI article on ClusterMonkey Message-ID: <52364.192.168.1.1.1191933464.squirrel@mail.eadline.org> Toon Knapen has written an article on Morph MPI on ClusterMonkey: Unleashing MPI: MorphMPI http://www.clustermonkey.net//content/view/213/32/ The concept was discussed a while back on the list. You might also be interested in the Microwulf article as well: Microwulf: Breaking the $100/GFLOP Barrier http://www.clustermonkey.net//content/view/211/33/ -- Doug From ajt at rri.sari.ac.uk Tue Oct 9 05:42:18 2007 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best linux distribution In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> Message-ID: <470B772A.5070208@rri.sari.ac.uk> Bogdan Costescu wrote: > On Mon, 8 Oct 2007, Robert G. Brown wrote: > >> RHEL/Centos are good where vendors require "binary compatibility" on >> closed source software, as the standard of said binary compatibility. > > What strikes me in this whole discussion is the ideea of 'one > distribution fits all' when applied to all nodes of a cluster and all > applications that run on that cluster. In the days of PXE booting, with > several solutions readily available for either building a node from > scratch (like kickstart) or booting a prebuilt setup with NFS-root or > ramdisk, what's so difficult in matching on request a node, an > application and a distribution/custom setup ? > [...] Hello, Bogdan. I use a combination of PXE booting, NFSROOT and ClusterNFS to set up individual 'dataless' nodes the way I want - I'm using UNFS3, with ClusterNFS extensions: http://unfs3.sourceforge.net/ I know NFS gets a panning on this list, but it suits our needs! Best wishes, Tony. -- Dr. A.J.Travis, | mailto:ajt@rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687 From rgb at phy.duke.edu Tue Oct 9 06:01:28 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] best Linux distribution In-Reply-To: <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> Message-ID: On Tue, 9 Oct 2007, Douglas Eadline wrote: > > Excellent point. I have often thought that "diskless" provisioning > opens up lots of opportunities to create custom node groups > based on kernels or distributions. Throw in a virtualized > head node and many ISV requirements could be handled this way > e.g. a virtualized Suse environment running on top of Red Hat > could request 32 Suse nodes from the scheduler (running under a > Red Hat instance). The scheduler just provisions nodes as needed > and sets them in a low power state when not being used. > Going with fully virtualized nodes is another option provided > the applications are still close to hardware. > > Note that diskless provisioning does not imply diskless nodes, > if you need local drives, then you can still use them in a > diskless booting scheme. Not nailing an OS to the hard drive > on cluster nodes has lots of advantages. This has been the subject for lots of Real Computer Science, some of it done by Jeff Chase and students here at Duke (including Greg Lindahl's ex-student, Justin Moore). See "Cluster On Demand" here: http://www.cs.duke.edu/nicl/cod/ (and there are various other links GIYF). COD is basically (as I understand it) a layer of automated "provisioning software" that takes a user's resource request (written IIRC in xmlish but I could easily be wrong), creates a one-time cluster boot image that satisfies it, allocates nodes from a large, generic pool, reboots them into the boot image, connects them with suitable workspace (part of the provisioning), and even starts up the user's job(s) on them. Or not. The provisioning is pretty much OS-neutral. Want a Windows cluster? No problem (licensing permitting, of course). Solaris? If it will run on the hardware and is supported, sure. Linux obviously -- any flavor, any size, licensed as needed or not. Ditto all the other free and open source OS's. If you have specific needs for libraries, tools, memory, processor count or cores, networking (and the needs can be met within the cluster pool) it will allocate nodes, provision them, and crank them up for you. One of several GOOD things about this is that your nodes GO AWAY after you are done with them. Doing top-secret work for NSA? Once you're done (especially with diskless provisioning) there isn't even a disk image left on the nodes to be reconstructed by means of advanced magnetic analysis... Provisioning really doesn't take very long any more. Diskless almost no time at all, but provisioning a full local boot image needn't take very long either. I don't know the status of this project, but just wanted to point out that this is going on and that one day we may yet see a full open source solution built in to Linux (as Linux is a very reasonable choice for a toplevel platform to run this). All of this can obviously be done by hand, but it's the automation part that is interesting. And of course the advent of serious VM with processor level support means that we will shortly have even more options -- a whole second way of doing it NOW is to create and provision portable VMs and run them under e.g. VMware or whatever. rgb > > -- > Doug > > >> On Mon, 8 Oct 2007, Robert G. Brown wrote: >> >>> RHEL/Centos are good where vendors require "binary compatibility" on >>> closed source software, as the standard of said binary >>> compatibility. >> >> What strikes me in this whole discussion is the ideea of 'one >> distribution fits all' when applied to all nodes of a cluster and all >> applications that run on that cluster. In the days of PXE booting, >> with several solutions readily available for either building a node >> from scratch (like kickstart) or booting a prebuilt setup with >> NFS-root or ramdisk, what's so difficult in matching on request a >> node, an application and a distribution/custom setup ? >> >> Real case: A quantum mechanics code that we have bought some years ago >> was provided only as staticly-linked binaries. They have worked fine >> on the current distros at that time and we have succesfully used them >> on CentOS-3 (2.4 kernel). However we discovered the hard way on the >> new CentOS-5 (2.6 kernel) that the statically linked binaries didn't >> work anymore as the kernel interfaces have changed - but, after a few >> lines were changed in the config files and the nodes rebooted, the >> binaries were again happily running in their required configuration. >> >> Of course, the admin is responsible in defining which >> distributions/custom setups can run on a certain node, based on the >> hardware of that node and the kernel of the distribution/custom setup. >> But after this is done, the user can limit his/her jobs to running on >> these nodes or ask the queueing system to set up a node according to >> the requirements of the job (I think that term is 'provisioning'). >> Sure, it helps in this case to run a distribution with long support >> (like RHEL/CentOS/SL, SLES or Ubuntu LTS) such that you don't have to >> waste too much time yourself with updates, especially security related >> ones. >> >>> Far short of Debian, but plenty big enough to include just about all >>> mainstream useful packages for any cluster or LAN. >> >> I'm making sure that any cluster related package that is part of the >> default distribution is not part of what the nodes get to run. Why ? >> Because very often the common ground options used for building the >> package (which is a good idea for a widely used distribution) don't >> fit _my_ setup. So, I take the fact that the distibution offers me all >> the needed tools as a fallback, but I'm always trying to match as well >> as possible all the components. And if you search the archives of the >> LAM/MPI mailing lists you'll see the larger picture... >> >> -- >> Bogdan Costescu >> >> IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen >> Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY >> Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 >> E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> !DSPAM:470b5a8976572020149523! >> > > > -- > Doug > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From laytonjb at charter.net Tue Oct 9 06:32:05 2007 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Mon Mar 15 01:06:32 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> Message-ID: <470B82D5.9050002@charter.net> The recent emails from rgb and Doug lead me to a question. Has anyone tested codes running under a VM versus running them "natively" on the hardware (native isn't a good word and I hope everyone gets my meaning)? The last word I heard is that performance takes a substantial hit if you are running a code in a VM. Some of the reasons are that the code has only virtualized access to the hardware (particularly the NICs) and memory management is a bit more difficult (although Barcelona with nested page tables should help there). I do remember VirtualIron saying that they had IB drivers so that the VM had direct access to the hardware. Thanks! Jeff From landman at scalableinformatics.com Tue Oct 9 06:51:48 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Mon Mar 15 01:06:32 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: <470B82D5.9050002@charter.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> Message-ID: <470B8774.6020401@scalableinformatics.com> Jeffrey B. Layton wrote: > The recent emails from rgb and Doug lead me to a question. Has anyone > tested codes running under a VM versus running them "natively" on > the hardware (native isn't a good word and I hope everyone gets my > meaning)? The last word I heard is that performance takes a substantial You have two major types, the heavyweight emulators (VMware, et al) and the lighter weight hypervisors. I have seen studies of various codes that show not a huge hit ... the caveat being that these codes spent very little time in system calls (usually thunked to the host somehow) and most of the time computing. > hit if you are running a code in a VM. Some of the reasons are that the > code has only virtualized access to the hardware (particularly the NICs) > and memory management is a bit more difficult (although Barcelona with > nested page tables should help there). I do remember VirtualIron saying > that they had IB drivers so that the VM had direct access to the hardware. The more layers you have to pass through, the lower your overall performance (of course). > > Thanks! > > Jeff > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From andrew at moonet.co.uk Tue Oct 9 07:48:03 2007 From: andrew at moonet.co.uk (andrew holway) Date: Mon Mar 15 01:06:32 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: <470B8774.6020401@scalableinformatics.com> References: <20071008124403.M2100@mail.ipm.ir> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> <470B8774.6020401@scalableinformatics.com> Message-ID: Seems to be some excitement in Europe over Xen, a paravirtualisation package. Ta Andy On 09/10/2007, Joe Landman wrote: > > > Jeffrey B. Layton wrote: > > The recent emails from rgb and Doug lead me to a question. Has anyone > > tested codes running under a VM versus running them "natively" on > > the hardware (native isn't a good word and I hope everyone gets my > > meaning)? The last word I heard is that performance takes a substantial > > You have two major types, the heavyweight emulators (VMware, et al) and > the lighter weight hypervisors. I have seen studies of various codes > that show not a huge hit ... the caveat being that these codes spent > very little time in system calls (usually thunked to the host somehow) > and most of the time computing. > > > hit if you are running a code in a VM. Some of the reasons are that the > > code has only virtualized access to the hardware (particularly the NICs) > > and memory management is a bit more difficult (although Barcelona with > > nested page tables should help there). I do remember VirtualIron saying > > that they had IB drivers so that the VM had direct access to the hardware. > > The more layers you have to pass through, the lower your overall > performance (of course). > > > > > Thanks! > > > > Jeff > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web : http://www.scalableinformatics.com > http://jackrabbit.scalableinformatics.com > phone: +1 734 786 8423 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From deadline at eadline.org Tue Oct 9 08:13:28 2007 From: deadline at eadline.org (Douglas Eadline) Date: Mon Mar 15 01:06:32 2010 Subject: [Beowulf] Writing Opportunities Message-ID: <50644.192.168.1.1.1191942808.squirrel@mail.eadline.org> If anyone is interested in doing some paid writing about clusters, I have some opportunities available. The content is published on a micro-site as part of Linux Magazine: http://www.linux-mag.com/launchpad/business-class-hpc/ I also have been tasked with generating some multi-core content as well. The multi-core topics will not necessarily be about HPC, but be more general in nature. This effort also has a budget for writers. Contact me directly if you are interested. -- Doug From rgb at phy.duke.edu Tue Oct 9 08:27:10 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:32 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: <470B82D5.9050002@charter.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> Message-ID: On Tue, 9 Oct 2007, Jeffrey B. Layton wrote: > The recent emails from rgb and Doug lead me to a question. Has anyone > tested codes running under a VM versus running them "natively" on > the hardware (native isn't a good word and I hope everyone gets my > meaning)? The last word I heard is that performance takes a substantial > hit if you are running a code in a VM. Some of the reasons are that the > code has only virtualized access to the hardware (particularly the NICs) > and memory management is a bit more difficult (although Barcelona with > nested page tables should help there). I do remember VirtualIron saying > that they had IB drivers so that the VM had direct access to the hardware. > > Thanks! I'd actually expect that as always YMMV. For CPU bound code, I don't expect that it WOULD take a hit on modern CPUs with hardware support for virtualization, although yes you'll pay a penalty for adding what amounts to an additional superscheduler on top of the kernel scheduler per se. That is, the toplevel host OS has to get enough cycles that its scheduler doesn't starve because it may be multitasking the VM(s) with pretty much whatever you like. So there is the baseline overhead of a single instance of Linux as a host OS running the VM manager plus whatever tasks you choose to run there (which may be none at all). To give you a very precise idea of the average load associated with the VM manager itself, VMware has been running on a server I help control since April 18th (without a system reboot, although it is due to come down tomorrow for a memory upgrade so you're lucky you asked this today:-). For most of the time from May on, it has been running a VM instance of Linux pretty much full time (the system itself is a dual CPU, dual core IBM server). In all that time, VMware has accumulated about 2100 minutes of CPU, or about a day and a half. Lessee, May, June, July, August, September -- call it 150 days, give or take -- and we see that it has cost around 1% in net overhead. Not, actually a huge burden. However, this system has idle cores at all times and hence does not thrash. CPU bound code, especially on a multicore system with a core to devote to the host OS, probably will not slow down "at all" in a VM partition, although I can probably test that if you give me a day or three (I'm working on a multicore laptop, but I don't have a working linux VM on the system at this particular moment). My other laptop has VM Workstation on it but only a single core. I can probably borrow cycles on a failover server to run some straight numerical benchmark tests but not before Thursday or Friday. Network bound code is a different matter altogether. VMs present the running guest OS with a "virtual" hardware layer that is wrapped up to look like a "standard" widely supported network adapter, one likely to have readily available drivers in any OS. In addition, it can insert itself between the VM's "network" -- an entirely artificial private internal network -- and the real world, acting as a NAT/gateway for the VM guest, or it can actually give over the guest interface to e.g. DHCP or static addressing on the host system's network. Clearly there are differing amounts of work being done by the intermediary host OS in these cases, but pretty much all of them will add LATENCY to any sort of connection, and may eat a bit of bandwidth as well. Again, I'm happy to run e.g. netpipe to test this, in a couple of days, but I'd predict 1.5-3x the native latency of ethernet and 95% of the bandwidth accessible to the (unloaded) host OS, lessened by contention and thrashing caused by multiple VMs sharing an interface. Disk performance can vary quite a bit, as VMs can mount native partitions or e.g. NFS partitions with performance that should be at least comparable to native performance or NFS performance (modified by network efficiency in the latter case), OR all VM disk can itself be virtual. In the case of VMware you have two distinct categories of the latter -- preallocated disk (the "disk" for the VM is basically a big, fixed size file) and "growable" disk (the "disk" is still a file, but it starts out at modest size and then dynamically grows as you fill it up to some maximum). In order this is fast (almost as fast as native), slow, and slower, for again fairly obvious reasons. Nevertheless, there are plenty of times one might want to run the VM in a relatively slow but growable file to minimize impact on your available disk resources while still in PRINCIPLE being able to create a big file without completely reprovisioning the entire VM. VMware (through its better toplevel interface) lets you reprovision things fairly freely with the exception of the fixed-resizable transition. You can add processors (or cores), memory (within sane bounds) and so on to any quiescent VM. It is truly a pretty awesome tool, although IMO it is too expensive by a factor of 3-4. It is the usual thing -- at $50 a seat full retail in single-seat quantities, $25 a seat academic for VMware Workstation, I think they'd sell a zillion seats because it is SO damn useful. Who wouldn't pay $50 to be able to never dual boot their box again? Pop Windows in right where it belongs as a Linux task, add a few other Linux VMs for e.g. code prototyping or node prototyping or putting up a webserver or FTP server that CANNOT compromise your host OS and your valuable data even if an exploit goes unpatched. But for $189 list and well over $100 either academic or bulk, lots and lots of people that might otherwise buy it don't, and VMware fails to take over the world. As I like to say, if Sun Microsystems had sold Unix for the Intel architecture for $50 full list, $25 academic back in 1988 (at which time it should be noted that they HAD a Unix that would run on the 386 and where these prices would have undercut Microsoft's prices for DOS), then Sun would be Microsoft and Microsoft would either be selling unix or out of business. Products that never would have been developed include Windows, OS/2, NT, and -- Linux. Maybe, just maybe, BSD would have still survived. VMware has a similar opportunity now, but the window (so to speak) is rapidly closing as Microsoft is due to co-opt the entire VM market any day now using their standard strategy. VMware has created and proven the market, chip manufacturers have moved to support it, so MS will now implement their own version of VM, ensure that it only works well for Windows VM guests (and may not work at all for other guests), change their licensing to make it more or less illegal to run Windows guests under other VMs (already underway), use their sales channels to insert their product cheaply everywhere, sow some judicious FUD about how their product is secure, reliable, and supported and its competitors are not (backed up with ominous rumblings about license violations, DRM, and legal action). In six months, a year tops, they own 70% of the market, and its open and closed competitors spiral down to gradual extinction as virtualization is a key component of modern server provisioning and failover and not even Linux can hold its own in the server room if Microsoft (say) makes it a license violation to run its Server products in a non-Microsoft VM manager. You heard it here first, folks -- pure crystal ball stuff determined by my awesome psychic powers. VMware's window to beat this absolutely standard operating procedure (used time and time again by Microsoft under precisely these conditions) is probably measured in months. If it weren't for the Vista debacle, it would probably be underway already, as Vista introduced the first of the necessary licensing changes, but until MS's clients are comfortable paying through the nose for Vista business class licenses or better in order to virtualize at all, and until the stink from Vista's amazingly poor performance goes away, they're laying low. VMware needs volume -- massive volume. Volume at the consumer level, to have a CHANCE of creating a bit of a consumer revolution, volume at the server level. Ubiquity. They need low margin high volume sales, not high margin low volume sales. It might already be too late, but then -- it might not! But Alas, nobody listens to an Oracle... rgb (...and yes, fully open source Xen and KVM are safe enough topically as open source products never really go away even if they DO stop being overtly profitable in 80-90% of all instances, but by the time the billion dollar antitrust lawsuit winds down and Microsoft loses (some 2-4 years from now) the issue really will be moot. And VMware doesn't have that luxury -- the dark side of closed licensing of a product that Microsoft "wants" is that they are trying to beat Microsoft at its own game, which is basically impossible. After all, it owns the playing field and the umpires and the balls and the bats, and the world series is over long before an appeal to the commissioners has a chance of being heard.) > > Jeff > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From Bogdan.Costescu at iwr.uni-heidelberg.de Tue Oct 9 08:46:11 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Mon Mar 15 01:06:33 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: <470B82D5.9050002@charter.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> Message-ID: On Tue, 9 Oct 2007, Jeffrey B. Layton wrote: > The last word I heard is that performance takes a substantial hit if > you are running a code in a VM. I attended a talk some years ago by the Bochs author and I remember him saying that one possibility of speeding up the virtual code is to cache the code after "processing" it the first time - this could work very well for a computational code if the performance sensitive inner loop is treated this way. This was especially important for cases where the virtual and native hardware were not the same architecture. > Some of the reasons are that the code has only virtualized access to > the hardware (particularly the NICs) It's the same old problem like user space drivers... You can have a domU Xen guest with direct access to a specific hardware device. Sure, you have to restrict the hardware device to a specific Xen guest, so that neither the host nor any other Xen guests can use that device anymore, but this allows for the "lucky" Xen guest the fast access to the hardware. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From tjrc at sanger.ac.uk Tue Oct 9 08:46:35 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon Mar 15 01:06:33 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> <470B8774.6020401@scalableinformatics.com> Message-ID: On 9 Oct 2007, at 3:48 pm, andrew holway wrote: > Seems to be some excitement in Europe over Xen, a > paravirtualisation package. I ran our bioinformatics benchmark suite on some of our old RLX blades running Xen, just to see what the performance hit was. It was only about 2%, running a mixture of BLAST, HMMER, genewise and exonerate, many of which are fairly I/O intensive. I was quite surprised it was as small a hit as that. That said, I'm still not using virtualisation on our production clusters. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From anajafi at mail.ipm.ir Mon Oct 8 07:39:18 2007 From: anajafi at mail.ipm.ir (Seyed Abouzar Najafi Shoshtari) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] best linux distribution - precisely! Message-ID: <20071008143817.M32781@mail.ipm.ir> Thank you all for your replies. Actually the software we would like to use work fine on every linux distribution. They are mostly astrophysical data reduction softwares and have been tested before on all kind of distributions. But, my question was actually related to the choice of the hardwares, I listed. In other word, have you any experience with linux on this special CPU, namely Intel 6600 Quadcore 8MB on a ASUS P5K motherboard? Abouzar _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From tim.carlson at pnl.gov Mon Oct 8 08:44:52 2007 From: tim.carlson at pnl.gov (Tim Carlson) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <200710081510.l98F9UWc021978@bluewest.scyld.com> References: <200710081510.l98F9UWc021978@bluewest.scyld.com> Message-ID: > From: Gerry Creager > Subject: Re: [Beowulf] best linux distribution > To: Seyed Abouzar Najafi Shoshtari > Cc: beowulf@beowulf.org > Message-ID: <470A2F4B.5050108@tamu.edu> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > > Fedora core 6 is where I'd start today. SuSE 10.x is a very good second > choice. We've also tried ROCKS and haven't been too impressed. ROCKS > installs easily and replicates to the nodes but FC6 and kickstart is > just too easy and offers a bit more usability in our experience > > > gerry I'm not exactly sure what you mean by Rocks "replicates" to the nodes. All the nodes in a Rocks cluster are built using kickstart and you can modify the package list, disk layout, etc through editing of XML files. http://www.rocksclusters.org/rocks-documentation/4.3/customization.html Tim Carlson Voice: (509) 376 3423 Email: Tim.Carlson@pnl.gov Pacific Northwest National Laboratory HPCaNS: High Performance Computing and Networking Services From oplehto at csc.fi Mon Oct 8 08:55:58 2007 From: oplehto at csc.fi (Olli-Pekka Lehto) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Memory limit enforcement Message-ID: <470A530E.7090206@csc.fi> Hello, I'm interested in hearing some best practice solutions in fine-grained management memory resources on clusters of SMPs. How do you enforce real memory usage inside (RHEL-based) cluster nodes running multiple serial jobs simultaneously? More specifically, how to do this efficiently when some of the jobs map copious amounts of virtual memory but have only a fraction of it resident at any given time? As SMP systems keep getting constantly fatter (and the potential for users interfering with each others' jobs increasing) it would be great to have something like AIX's WLM (Workload Manager) on Linux to effectively manage intra-SMP resources. Olli-Pekka -- Olli-Pekka Lehto, Systems Specialist, Special Computing, CSC PO Box 405 02101 Espoo, Finland; tel +358 9 457 2215, fax +358 9 4572302 CSC is the Finnish IT Center for Science, www.csc.fi, e-mail: Olli-Pekka.Lehto@csc.fi From nvigier at mandriva.com Mon Oct 8 09:55:46 2007 From: nvigier at mandriva.com (Nicolas Vigier) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <20071008124403.M2100@mail.ipm.ir> References: <20071008124403.M2100@mail.ipm.ir> Message-ID: <20071008165546.GA29813@mandriva.com> On Mon, 08 Oct 2007, Seyed Abouzar Najafi Shoshtari wrote: > Dear beowulf experts, > > we are planning to build a beowulf cluster > with 8 nodes (8xCPUs, Intel 6600 Quadcore 8MB, 8GB RAM )and > a Dell-Server as the master node (2xCPU Xeon Quad Core 1.6GHz, 4TB Hard, 18GB > RAM). > > Which linux distribution would be ideal for our case? Hi, Any distribution which support your hardware and has packages for mpi libraries and other tools you want to use will work. But you might be interested in trying IGGI : http://wiki.mandriva.com/en/Releases/Mandriva/IGGI It's a distribution based on Mandriva 2006, which includes tools to ease the setup and deployement of a beowulf cluster. The full documentation is available at this address : http://iggi.mandriva.com/ regards, Nicolas From b.wagman at comcast.net Mon Oct 8 10:17:15 2007 From: b.wagman at comcast.net (Barnet Wagman) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Intel quad core nodes? In-Reply-To: <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> References: <20071008124403.M2100@mail.ipm.ir> <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> Message-ID: <470A661B.1030904@comcast.net> I'm moving towards setting up a small cluster (my first), and am thinking about using Intel quad core processors. However, I'm a little concerned about memory contention. I'm (tentatively) going to have one processor per node (this appears to be the cheapest way to go), but I still wonder whether four cores will choke Intel's memory architecture. (AMD's Barcelona may be better in this regard, but it doesn't seem to be available yet, at least not through retail channels). I'd like to hear any opinions on this issue. And if you've used quad core processors, I'd certainly like to hear about your experiences (including which processor you've used). thanks PS I couldn't find any discussion of Intel's quad core processors in the archive, but if I missed one, please point me to it. From kevin.ball at qlogic.com Mon Oct 8 10:18:42 2007 From: kevin.ball at qlogic.com (Kevin Ball) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Odd Infiniband scaling behaviour In-Reply-To: <200710081524.55374.csamuel@vpac.org> References: <200710081524.55374.csamuel@vpac.org> Message-ID: <1191863922.4868.317.camel@ammonite> Hi Chris, I'm not an expert on the Mellanox IB implementation, or MVAPICH, and I won't try to be. Gilead/someone else from Mellanox might be able to give you more specific information, or maybe the OSU guys if you email mvapich-discuss. I have two guesses, based on things I've seen on a range of networks. Guess 1) The numbers you cite look remarkably as though you are running on 8 core nodes with 8, 4, 2, and 1 cores active, and citing the % utilization of the entire node. If you run top without separating out into per-cpu load (via hitting '1' while top) you will, on some OS's (I think I've seen some variation of behavior) see utilization as a percentage of total available CPU. Thus 1 CPU on an 8 core node would show up as 12.5% of CPU. I'm not sure this is what you're seeing, especially since you're not seeing it with OpenMPI, but if the OpenMPI implementation you're using uses threads for various purposes, that might explain it. Guess 2) Particularly in network devices that offload communication from the CPU, if the MPI implementation uses an interrupt-driven communication approach you can get a lot of idle time while waiting for data to arrive. This can lead to very large amounts of idle time. An implementation that polls for data will not show this idle time, so you can see dramatic differences in CPU utilization even though with regards to the job at hand, the same amount of progress is being made. I think that the default configuration of MVAPICH does poll for data, so you would not see lots of idle CPU, but MVAPICH is configurable to the moon and back so how you have it built, I have no idea. Hope this helps! -Kevin On Sun, 2007-10-07 at 22:24, Chris Samuel wrote: > Hi fellow Beowulfers.. > > We're currently building an Opteron based IB cluster, and are seeing > some rather peculiar behaviour that has had us puzzled for a while. > > If I take a CPU bound application, like NAMD, I can run an 8 CPU job > on a single node and it pegs the CPUs at 100% (this is built using > Charm++ configured as an MPI system and using MVAPICH 0.9.8p3 > with the Portland Group Compilers). > > If I then run 2 x 4 CPU jobs of the *same* problem, they all run at > 50% CPU. > > If I run 4 x 2 CPU jobs, again the same problem, they run at 25%.. > > ..and yes, if I run 8 x 1 CPU jobs they run at around 12-13% CPU! > > I then replicated the same problem with the example MPI cpi.c program, > to rule out some odd behaviour in NAMD. > > What really surprised me was when testing CPI built using OpenMPI > (which doesn't use IB on our system) the problem vanished and I could > run 8 x 1 CPU jobs, each using 100%! > > So (at the moment) it looks like we're seeing some form of contention > on the Infiniband adapter.. > > 07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev a0) > Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] > Flags: fast devsel, IRQ 19 > Memory at feb00000 (64-bit, non-prefetchable) [size=1M] > Memory at fd800000 (64-bit, prefetchable) [size=8M] > Capabilities: [40] Power Management version 2 > Capabilities: [48] Vital Product Data > Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- > Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 > Capabilities: [60] Express Endpoint IRQ 0 > > We see this problem with the standard CentOS kernel, with the latest > stable kernel (2.6.22.9) and with 2.6.23-rc9-git5 (which completely > rips out and replaced the CPU scheduler with Ingo Molnar's CFS). > > This is on a SuperMicro based system with AMD's Barcelona quad > core CPU (1.9GHz), but I see the same behaviour (scaled down) on dual > core Opterons too. > > I've looked at what "modinfo ib_mthca" says are the tuneable options, > but the few I've played with ("msi_x" and "tune_pci") haven't made > any noticeable difference, sadly.. > > Has anyone else run into this or got any clues they could pass on please ? > > cheers, > Chris From tegner at nada.kth.se Mon Oct 8 11:09:31 2007 From: tegner at nada.kth.se (Jon Tegner) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] best linux distribution In-Reply-To: <470A5D75.8030903@rri.sari.ac.uk> References: <20071008124403.M2100@mail.ipm.ir> <470A5D75.8030903@rri.sari.ac.uk> Message-ID: <470A725B.8040300@nada.kth.se> Tony Travis wrote: > > I also prefer Debian-based distro's and still run the openMosix kernel > under an Ubuntu 6.06.1 LTS server installation on our Beowulf cluster. > > What I like about APT (the Debian package manager) is the dependency > checking and conflict resolution capabilities of "aptitude", which is > more robust than the older "apt-get". I previously ran Red Hat 5.3->9 > and I've used both "up2date" and "yum". Neither of these is as capable > of resolving package conflicts and dependencies as APT. I used APT for > RPM when I ran RH9 for exactly this reason. > > In my opinion, the package management system is a very important > factor to take into account when choosing a distribution, as well as > the range of tried and tested binary packages that are available. In > that respect, Debian/Ubuntu has a lot to recommend it. > > Tony. Hi, regarding ubuntu, has anyone tried their adoption of kickstart? Regards, /jon From oplehto at csc.fi Tue Oct 9 07:15:09 2007 From: oplehto at csc.fi (Olli-Pekka Lehto) Date: Mon Mar 15 01:06:33 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: <470B82D5.9050002@charter.net> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> Message-ID: <470B8CED.8040906@csc.fi> Jeffrey B. Layton wrote: > The recent emails from rgb and Doug lead me to a question. Has anyone > tested codes running under a VM versus running them "natively" on > the hardware (native isn't a good word and I hope everyone gets my > meaning)? The last word I heard is that performance takes a substantial > hit if you are running a code in a VM. Some of the reasons are that the > code has only virtualized access to the hardware (particularly the NICs) > and memory management is a bit more difficult (although Barcelona with > nested page tables should help there). I do remember VirtualIron saying > that they had IB drivers so that the VM had direct access to the hardware. > > Thanks! > > Jeff > I ran some comparisons this spring on a small 4*2P AMD Opteron cluster using VMWare Player and Workstation. The results seemed quite promising and I'd like to share them but the VMWare EULA restricts the publication of unvetted results: "You may use the Software to conduct internal performance testing and benchmarking studies, the results of which you (and not unauthorized third parties) may publish or publicly disseminate; provided that VMware has reviewed and approved of the methodology, assumptions and other parameters of the study. Please contact VMware at benchmark@vmware.com to request such review." I just sent a request to VMWare about this. Let's see what they say... -- Olli-Pekka Lehto, Systems Specialist, Special Computing, CSC PO Box 405 02101 Espoo, Finland; tel +358 9 457 2215, fax +358 9 4572302 CSC is the Finnish IT Center for Science, www.csc.fi, e-mail: Olli-Pekka.Lehto@csc.fi From wrankin at duke.edu Tue Oct 9 07:30:01 2007 From: wrankin at duke.edu (Bill Rankin) Date: Mon Mar 15 01:06:33 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: <470B8774.6020401@scalableinformatics.com> References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> <470B8774.6020401@scalableinformatics.com> Message-ID: <62EA1D78-EE5F-490D-8595-FCB2AD1A32BB@duke.edu> On Oct 9, 2007, at 9:51 AM, Joe Landman wrote: > Jeffrey B. Layton wrote: >> The recent emails from rgb and Doug lead me to a question. Has anyone >> tested codes running under a VM versus running them "natively" on >> the hardware (native isn't a good word and I hope everyone gets my >> meaning)? The last word I heard is that performance takes a >> substantial > > You have two major types, the heavyweight emulators (VMware, et al) > and the lighter weight hypervisors. I have seen studies of various > codes that show not a huge hit ... the caveat being that these > codes spent very little time in system calls (usually thunked to > the host somehow) and most of the time computing. Yup. My experience (although a little dated) was looking a codes running under Xen (see previous post by RGB regarding the COD project here at Duke). Computationally wise, there can be very low impact if the system is set up right. With earlier versions of Xen, there was significant overhead in the networking layer which could impact MPI and other distributed codes that utilize the IP stack. This was supposedly vastly improved in later versions of Xen, but I don't have any data at hand. I'm not sure how the user-space communication fabrics (OpenIB, etc.) would fair under a VM environment, but theoretically they could be quite good. It could make the VM overhead much more palatable in an HPC environment. -bill From michael.creel at uab.es Tue Oct 9 07:40:30 2007 From: michael.creel at uab.es (Michael Creel) Date: Mon Mar 15 01:06:33 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: <470B82D5.9050002@charter.net> References: <20071008124403.M2100@mail.ipm.ir> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <470B82D5.9050002@charter.net> Message-ID: On 10/9/07, Jeffrey B. Layton wrote: > > The recent emails from rgb and Doug lead me to a question. Has anyone > tested codes running under a VM versus running them "natively" on > the hardware (native isn't a good word and I hope everyone gets my > meaning)? The last word I heard is that performance takes a substantial > hit if you are running a code in a VM. Some of the reasons are that the > code has only virtualized access to the hardware (particularly the NICs) > and memory management is a bit more difficult (although Barcelona with > nested page tables should help there). I do remember VirtualIron saying > that they had IB drivers so that the VM had direct access to the hardware. > > Thanks! > > Jeff > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf I have done this with ParallelKnoppix, on a cluster of 2 2X Xeon 64 bit machines. Running 4 MPI ranks on a 4 node completely virtualized 32 bit PK cluster using VMware server is about 70-80% as efficient as a real 2 node PK cluster running 4 MPI ranks. That's one data point, so worthless by itself that I won't bother describing exactly what I did. Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071009/538a8836/attachment.html From becker at scyld.com Tue Oct 9 09:23:01 2007 From: becker at scyld.com (Donald Becker) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Reminder BWBUG meeting today, Tuesday Oct 9 Message-ID: Note that this is not on the main Georgetown campus, but a location just off of Wisconsin Ave. Baltimore Washington Beowulf User Group Meeting Date: 9 Oct 2007 at 2:30 pm - 5:00pm. Location: Georgetown University at Whitehaven Street 3300 Whitehaven Street, Washington DC 20007 Speaker: Donald Becker, CTO of Scyld Software and Penguin Computing Host: Michael Fitzmaurice I'll be talking about name services and process creation mechanisms for clusters. -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From andrew at moonet.co.uk Tue Oct 9 09:27:34 2007 From: andrew at moonet.co.uk (andrew holway) Date: Mon Mar 15 01:06:33 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> Message-ID: > As I like to say, if Sun Microsystems had sold Unix for the Intel > architecture for $50 full list, $25 academic back in 1988 (at which time > it should be noted that they HAD a Unix that would run on the 386 and > where these prices would have undercut Microsoft's prices for DOS), then > Sun would be Microsoft and Microsoft would either be selling unix or out > of business. Products that never would have been developed include > Windows, OS/2, NT, and -- Linux. Maybe, just maybe, BSD would have > still survived. VMware has a similar opportunity now, but the window > (so to speak) is rapidly closing as Microsoft is due to co-opt the > entire VM market any day now using their standard strategy. > > VMware has created and proven the market, chip manufacturers have moved > to support it, so MS will now implement their own version of VM, ensure > that it only works well for Windows VM guests (and may not work at all > for other guests), change their licensing to make it more or less > illegal to run Windows guests under other VMs (already underway), use > their sales channels to insert their product cheaply everywhere, sow > some judicious FUD about how their product is secure, reliable, and > supported and its competitors are not (backed up with ominous rumblings > about license violations, DRM, and legal action). In six months, a year > tops, they own 70% of the market, and its open and closed competitors > spiral down to gradual extinction as virtualization is a key component > of modern server provisioning and failover and not even Linux can hold > its own in the server room if Microsoft (say) makes it a license > violation to run its Server products in a non-Microsoft VM manager. It seems in europe at least Microsoft are talking to the likes of Xen to get windows into hpc. After the European community's cold reception of CCS they seem willing to float the next version of CCS on unix. No one will trust MS with metal but the users sure do want it. The Longhorn kernel is a lot easier to paravirtualise than previous incarnations so maybe they are finally learning their place, as a linux application. . Ta Andy From i.kozin at dl.ac.uk Tue Oct 9 10:46:17 2007 From: i.kozin at dl.ac.uk (Kozin, I (Igor)) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Intel quad core nodes? In-Reply-To: <470A661B.1030904@comcast.net> Message-ID: Barnet, If you can afford to wait until next month then Harpertown is certainly worth waiting for. But then again (as no doubt will be repeated over and over again) it really does depend what is it you are going to run. Best, Igor -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Barnet Wagman Sent: 08 October 2007 18:17 Cc: beowulf@beowulf.org Subject: [Beowulf] Intel quad core nodes? I'm moving towards setting up a small cluster (my first), and am thinking about using Intel quad core processors. However, I'm a little concerned about memory contention. I'm (tentatively) going to have one processor per node (this appears to be the cheapest way to go), but I still wonder whether four cores will choke Intel's memory architecture. (AMD's Barcelona may be better in this regard, but it doesn't seem to be available yet, at least not through retail channels). I'd like to hear any opinions on this issue. And if you've used quad core processors, I'd certainly like to hear about your experiences (including which processor you've used). thanks PS I couldn't find any discussion of Intel's quad core processors in the archive, but if I missed one, please point me to it. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From peter.st.john at gmail.com Tue Oct 9 11:32:07 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Mon Mar 15 01:06:33 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: <470B8CED.8040906@csc.fi> References: <20071008124403.M2100@mail.ipm.ir> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> <470B8CED.8040906@csc.fi> Message-ID: The restriction in the EULA seems overly broad to me. First, it seems unenforceable (but I'm not a lawyer). Imaging a publisher who says "you may read this book but you may not publish a review of it without approval from us first" or GM saying "you may drive this car but you may not divulge your MPG or cruising speed" etc. Second, it seems highly undesirable for VMWare to enforce it. Would they want to bring more publicity to bad performance results by suing the whistleblower? who is only a customer, not an employee? Of course anyone would sue in the case of libel by fabricating or otherwise mistating poor results. I have nothing against VMWare but I dislike unenforceable, overly-broad restriction clauses. Peter (YJMV, your jurisdiction may vary :-) On 10/9/07, Olli-Pekka Lehto wrote: > > Jeffrey B. Layton wrote: > > The recent emails from rgb and Doug lead me to a question. Has anyone > > tested codes running under a VM versus running them "natively" on > > the hardware (native isn't a good word and I hope everyone gets my > > meaning)? The last word I heard is that performance takes a substantial > > hit if you are running a code in a VM. Some of the reasons are that the > > code has only virtualized access to the hardware (particularly the NICs) > > and memory management is a bit more difficult (although Barcelona with > > nested page tables should help there). I do remember VirtualIron saying > > that they had IB drivers so that the VM had direct access to the > hardware. > > > > Thanks! > > > > Jeff > > > > I ran some comparisons this spring on a small 4*2P AMD Opteron cluster > using VMWare Player and Workstation. The results seemed quite promising > and I'd like to share them but the VMWare EULA restricts the publication > of unvetted results: > > "You may use the Software to conduct internal performance testing and > benchmarking studies, the results of which you (and not unauthorized > third parties) may publish or publicly disseminate; provided that VMware > has reviewed and approved of the methodology, assumptions and other > parameters of the study. Please contact VMware at benchmark@vmware.com > to request such review." > > I just sent a request to VMWare about this. Let's see what they say... > > -- > Olli-Pekka Lehto, Systems Specialist, Special Computing, CSC > PO Box 405 02101 Espoo, Finland; tel +358 9 457 2215, fax +358 9 4572302 > CSC is the Finnish IT Center for Science, www.csc.fi, > e-mail: Olli-Pekka.Lehto@csc.fi > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071009/de466b55/attachment.html From James.P.Lux at jpl.nasa.gov Tue Oct 9 13:31:48 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Mon Mar 15 01:06:33 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <20071008133029.GK21977@unthought.net> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> Message-ID: <6.2.3.4.2.20071009132818.03169070@mail.jpl.nasa.gov> At 08:46 AM 10/9/2007, Bogdan Costescu wrote: >On Tue, 9 Oct 2007, Jeffrey B. Layton wrote: > >>The last word I heard is that performance takes a substantial hit >>if you are running a code in a VM. > >I attended a talk some years ago by the Bochs author and I remember >him saying that one possibility of speeding up the virtual code is >to cache the code after "processing" it the first time - this could >work very well for a computational code if the performance sensitive >inner loop is treated this way. This was especially important for >cases where the virtual and native hardware were not the same architecture. I believe that this isn't particularly new. Back when folks were writing 1401 emulators to run on S/360 for instance, or PDP-11 emulators creating a virtual RSX-11/RT-11 environment on PCs that was a fairly standard strategy. Especially if the code is generated by a compiler and code is easily separable from data, it's sort of like building a specialized compiler from PDP-11 object to x86 object and then adding in a WINE- or Cygwin- like OS services layer. Self modifying code is, of course, somewhat of a challenge... James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From hahn at mcmaster.ca Tue Oct 9 14:07:40 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Intel quad core nodes? In-Reply-To: <470A661B.1030904@comcast.net> References: <20071008124403.M2100@mail.ipm.ir> <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> <470A661B.1030904@comcast.net> Message-ID: > I'm moving towards setting up a small cluster (my first), and am thinking > about using Intel quad core processors. However, I'm a little concerned > about memory contention. very sensible. well, assuming you already know that your workload is cache-unfriendly - do you? > I'm (tentatively) going to have one processor per > node (this appears to be the cheapest way to go), 1 socket nodes are attractive mainly because you get to use lower-grade hardware. oops, that sounds bad - let's call it "higher price/performance". for instance, non-ECC dimms, for intel, even non-FB. "desktop" rather than "server" processors. motherboards which are in much higher levels of production. dispensing with 1U chassis (minitowers instead) saves even more. all these optimizations involve some compromise of the final reliability and/or features. which is not to say that they never make good sense! > but I still wonder whether > four cores will choke Intel's memory architecture. depends on the workload - specifically the application's cache miss rate. just the obvious: if misses are low, don't worry about ram too much... > (AMD's Barcelona may be > better in this regard, but it doesn't seem to be available yet, at least not > through retail channels). they do seem to be available, but not effusively. that's probably because the currently available stepping seems to be a bit of a compromise to release early and/or with low power consumption. in any case, the place where AMD really shines is scaling from 2-4 sockets. at a single socket, the K10 is good, but matchable by Intel. note also that the K10 isn't released yet in non-server versions > I'd like to hear any opinions on this issue. And if you've used quad core > processors, I'd certainly like to hear about your experiences (including > which processor you've used). no slight intended, but I don't think this query makes much sense. there's not really anything new about quad-cores, since over the years there have been chips at most any place along the flops-per-bandwidth ("balance" ala http://www.cs.virginia.edu/stream/analyses.html) continuum. any answer would depend on the particular cache working set of the benchmark - for an attempt at a more general but vague answer, your best bet would probably be to go straight to http://www.spec.org/cpu2006/results/. (hard to tell from your query whether you should pay attention first to int or fp, or even serial vs rate results...) > PS I couldn't find any discussion of Intel's quad core processors in the > archive, but if I missed one, please point me to it. a single intel quad-core chip should behave very similar to a pair of dual-core chips. cache sizes probably won't line up exactly, nor will minor version changes in SSE, etc, and iirc the earliest Core2 chips had just 1066 MHz FSB, vs 1333 now. you can probably come pretty close by populating just one node's dimms on a dual-socket AMD system. regards, mark hahn. From tjrc at sanger.ac.uk Tue Oct 9 14:44:02 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Intel quad core nodes? In-Reply-To: <470A661B.1030904@comcast.net> References: <20071008124403.M2100@mail.ipm.ir> <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> <470A661B.1030904@comcast.net> Message-ID: On 8 Oct 2007, at 6:17 pm, Barnet Wagman wrote: > I'm moving towards setting up a small cluster (my first), and am > thinking about using Intel quad core processors. However, I'm a > little concerned about memory contention. I'm (tentatively) going > to have one processor per node (this appears to be the cheapest way > to go), but I still wonder whether four cores will choke Intel's > memory architecture. (AMD's Barcelona may be better in this > regard, but it doesn't seem to be available yet, at least not > through retail channels). > > I'd like to hear any opinions on this issue. And if you've used > quad core processors, I'd certainly like to hear about your > experiences (including which processor you've used). Well, whether it's an issue or not, it's something we're all going to have to deal with, since neither Intel nor AMD are offering us any choice in the matter in the years to come. We do have some Intel quad core machines in our cluster, which runs embarrassingly parallel single-threaded workloads. Given that workload, I was expecting to see serious memory contention effects. Our applications are by no means tightly written, and they are memory- intensive. But I haven't seen much of a problem with the Intel quad- core CPUs we have. They achieve very good throughput on these tasks. How much further this will go, I do not know. Once we start getting 8 cores and more, who knows. We'll hit a wall at some point, I just don't know when. For more traditional HPC workloads, I don't know how well they work. As with all these things, obtain a test machine or two, and benchmark your code... Regards, Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From andrew at moonet.co.uk Tue Oct 9 15:20:17 2007 From: andrew at moonet.co.uk (andrew holway) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Intel quad core nodes? In-Reply-To: <470A661B.1030904@comcast.net> References: <20071008124403.M2100@mail.ipm.ir> <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> <470A661B.1030904@comcast.net> Message-ID: I was asking a similar question a couple of weeks ago and was unable to get a definite answer. I mostly gathered that amd's decision to build quad core from scratch rather than just bolt two duals together has meant that Barcelona is basically a bit naff. Lots of innovative stuff in there but not quite working properly yet. The benchmark boys that test these things have said that harpertown is worth the wait but I would imagine not as interesting Q2 2008 when AMD lick barcelona into something sexy. I have heard Q4 2008 will bring 32 & 64 core chips. :) ta Andy On 08/10/2007, Barnet Wagman wrote: > I'm moving towards setting up a small cluster (my first), and am > thinking about using Intel quad core processors. However, I'm a little > concerned about memory contention. I'm (tentatively) going to have one > processor per node (this appears to be the cheapest way to go), but I > still wonder whether four cores will choke Intel's memory architecture. > (AMD's Barcelona may be better in this regard, but it doesn't seem to be > available yet, at least not through retail channels). > > I'd like to hear any opinions on this issue. And if you've used quad > core processors, I'd certainly like to hear about your experiences > (including which processor you've used). > > thanks > > > PS I couldn't find any discussion of Intel's quad core processors in the > archive, but if I missed one, please point me to it. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From ken at kschuster.org Tue Oct 9 10:28:09 2007 From: ken at kschuster.org (Ken Schuster) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] "Cloud Computing" Message-ID: <270428.3669.qm@web56410.mail.re3.yahoo.com> IBM and Google disclosed on an initiative the two technology heavyweights hope will help computer science students and researchers learn more about and gain experience in what has come to be known as "cloud computing." Ultimately, the goal is to create a new generation of software developers able to write software that will fully take advantage of the capabilities of parallel computing. Google is "excited to partner with IBM to provide resources which will better equip students and researchers to address today's developing computational challenges," said the company's CEO Eric Schmidt. Read Story on TechNewsWorld Item of interest, I thought Ken Schuster Added Value Services -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071009/bfac213e/attachment.html From jan.heichler at gmx.net Tue Oct 9 10:38:25 2007 From: jan.heichler at gmx.net (Jan Heichler) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Intel quad core nodes? In-Reply-To: <470A661B.1030904@comcast.net> References: <20071008124403.M2100@mail.ipm.ir> <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> <470A661B.1030904@comcast.net> Message-ID: <7710371527.20071009193825@gmx.net> Hallo Barnet, Montag, 8. Oktober 2007, meintest Du: BW> I'm moving towards setting up a small cluster (my first), and am BW> thinking about using Intel quad core processors. However, I'm a little BW> concerned about memory contention. I'm (tentatively) going to have one BW> processor per node (this appears to be the cheapest way to go), but I BW> still wonder whether four cores will choke Intel's memory architecture. BW> (AMD's Barcelona may be better in this regard, but it doesn't seem to be BW> available yet, at least not through retail channels). Yes intel does have a memory bandwidth problem.... BW> I'd like to hear any opinions on this issue. And if you've used quad BW> core processors, I'd certainly like to hear about your experiences BW> (including which processor you've used). ... but it does not to affect performance as hard as one could think. How big the impact is depends mostly on your application. I saw code scaling very good (HPL for example ;-) ) and others where you just get 10% more out of using 8 instead of 4 cores (on a dual socket quadcore system). Another aspect that you should think of is the i/o intensity of your application. Remember that 4 cores in a single socket system will share the same interconnect port. That could hit performance pretty bad. A high speed interconnect could be a solution to that problem but will cost you money. Conclusion: the only good benchmark is your own application. The Intel DualCore performs quite good on a wide range of applications. The QuadCore performs good on some of 'em - but as you said: it seems to be the only available (through retail channels) quadcore to the market right now... Regards Jan From camm at enhanced.com Tue Oct 9 14:00:33 2007 From: camm at enhanced.com (Camm Maguire) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] New beowulf recommendations In-Reply-To: References: <54642ms0sx.fsf_-_@intech19.enhanced.com> Message-ID: <54odf89f66.fsf@intech19.enhanced.com> Greetings, and thank you so much for your very helpful replies here! As you can tell, alas, I have less time for correspondence than I used to! Mark Hahn writes: > > 1) There is an onboard Gigabit NIC which pushes the computational load > > onto the CPU. > > I doubt it. it's fairly easy for nics to perform stateless offload, > and afaik even cheap ones do. the result is that any nic will > give nearly the same CPU overhead. I expect the only "onboard"-ness > here is that the nic is part of the chipset. this matters very little, > since a gigabit nic not going to push the limits of any current bus. > Confirmed! The on board nic has measurably lower latency without inducing any discernible cpu load. There are a few oddities as a function of matrix size, and it does not support jumbo frames (irrelevant in the latency debate), so on balance it looks superior. We've decided to go with an additional card anyway so that we might play with GAMMA someday (requires a particular model here). > > Our vendor states that a server card in the > > PCIExpress slot would have better latency. True? Significant? > > peculiar claim, since pcie actually _adds_ a small latency cost; > they're basing this on offload-type arguments? > This appears to have been a bogus claim, as you suspected. > > 2) We're considering either a Layer 2 or Layer 3 Netgear 48 port > > switch. The backplane bandwiths are 96Gb and 196Gb respectively, > > and the latencies are 20us and 2us. I don't understand how the > > additional bandwidth can be used, > > I'm guessing the l3 switch is GSM7352s, and the l2 is GSM7248. > while 20 vs 2 us is a big difference, your observed Gb latency > is still going to be ~50 us, so it's not a huge big deal. > 57 us measured across a cross-over cable. Somewhat counter-intuitively, 59 us or so through a reasonably old switch. (A very old switch gave 270 us). So my some feat of engineering opaque to me at least, the switch adds virtually no latency in spite of a spec-quoted 'latency' >= 20 us. > if I'm right on the switch models, I think the difference is more > generational and features. the 7248 seems like an older-gen switch, > and lacks not just the L3 stuff but also the 10G options. Once again, you are right -- the lower quoted l3 latency would appear irrelevant for our net latency between nodes. > > it's the 10G options that let them claim 196 Gbps for the GSM7352s, > since besides the 48 normal ports, it's got 8x SFP's and bays for 4x > 10G stacking ports (which btw only adds up to 192 Gbps for me...) > > I'd consider the GSM7352s mainly if I wanted to use the 10G ports > (you might verify that the ports can be used for 10G in general, > rather than only for stacking...) > I don't suppose the 10G impacts latency in any discernible way .... > > but the latency gain seems reason > > enough for the Layer 3. Is it worth an extra $3k? > > I would guess not unless you want the additional features (routing and > 10G.) > > > We are network > > latency bound on our existing 16 node cluster, but I do not know > > how much latency is due to the switch, nor how to find out. > > well, the simplest test is to connect two nodes back-to-back and run a > latency test. compare versus plugged into the switch. > (gigabit ports are all auto-mdi, so you don't need a special crossover > cable for this test.) > > I would guess that your current switch is about the same latency as > the GSM7248, but that you'll measure something like 50 us > back-to-back. so dropping 18 us will not make a dramatic difference: > ie, 70 us vs 52 - that's 25%, but it's still no where near a "real" > interconnect (myrinet, infiniband, 10G, quadrics). 25% might be interesting, but we can't see 'the fat to cut' even using a switch several years old (D-link). > > you should also verify that your nics aren't currently doing some sort > of interrupt mitigation/coalescing, since that will hurt your latency. > Thanks, will look into this. Modern kernels don't seem to have the io and irq module options that the older ones did, and I haven't kept up with the latest means of controlling device interrupts. > if you are truely small-packet latency-bound, and unwilling to consider > a higher-performance interconnect, I think you should contemplate putting > more cores in each box. going from 2 cores per box to 8 or 16 will make a > big difference for smallish jobs that use a small number of nodes > (even if you stick to plain old gigabit). > Alas, we are memory bandwidth bound in this case, our algorithm having as rate limiting step essentially L2 blas calls. What is your favorite switch, 48 port Gigabit? Are they all essentially equivalent? Serial console management would be nice. Take care, and thanks so much again! > > -- Camm Maguire camm@enhanced.com ========================================================================== "The earth is but one country, and mankind its citizens." -- Baha'u'llah From jdvw at tticluster.com Tue Oct 9 19:15:36 2007 From: jdvw at tticluster.com (John Van Workum) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Intel quad core nodes? In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> <470A661B.1030904@comcast.net> Message-ID: I found this article helpful. The author compares the new AMD 2300 quad core with current Intel quad core systems. http://techreport.com/articles.x/13176/1 Regards, John On 10/9/07, andrew holway wrote: > > I was asking a similar question a couple of weeks ago and was unable > to get a definite answer. > > I mostly gathered that amd's decision to build quad core from scratch > rather than just bolt two duals together has meant that Barcelona is > basically a bit naff. Lots of innovative stuff in there but not quite > working properly yet. > > The benchmark boys that test these things have said that harpertown is > worth the wait but I would imagine not as interesting Q2 2008 when AMD > lick barcelona into something sexy. > > I have heard Q4 2008 will bring 32 & 64 core chips. :) > > ta > > Andy > > On 08/10/2007, Barnet Wagman wrote: > > I'm moving towards setting up a small cluster (my first), and am > > thinking about using Intel quad core processors. However, I'm a little > > concerned about memory contention. I'm (tentatively) going to have one > > processor per node (this appears to be the cheapest way to go), but I > > still wonder whether four cores will choke Intel's memory architecture. > > (AMD's Barcelona may be better in this regard, but it doesn't seem to be > > available yet, at least not through retail channels). > > > > I'd like to hear any opinions on this issue. And if you've used quad > > core processors, I'd certainly like to hear about your experiences > > (including which processor you've used). > > > > thanks > > > > > > PS I couldn't find any discussion of Intel's quad core processors in the > > archive, but if I missed one, please point me to it. > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071009/ea375cd8/attachment.html From gdjacobs at gmail.com Tue Oct 9 19:47:49 2007 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] best linux distribution - precisely! In-Reply-To: <20071008143817.M32781@mail.ipm.ir> References: <20071008143817.M32781@mail.ipm.ir> Message-ID: <470C3D55.8010804@gmail.com> Seyed Abouzar Najafi Shoshtari wrote: > Thank you all for your replies. > > Actually the software we would like to use > work fine on every linux distribution. They are > mostly astrophysical data reduction softwares and have > been tested before on all kind of distributions. > > But, my question was actually related to the choice > of the hardwares, I listed. > In other word, have you any experience with linux on > this special CPU, namely > Intel 6600 Quadcore 8MB on a ASUS P5K motherboard? > > Abouzar This is common hardware, and it will work with any modern distro. -- Geoffrey D. Jacobs To have no errors would be life without meaning No struggle, no joy From kewley at gps.caltech.edu Tue Oct 9 20:48:55 2007 From: kewley at gps.caltech.edu (David Kewley) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Memory limit enforcement In-Reply-To: <470A530E.7090206@csc.fi> References: <470A530E.7090206@csc.fi> Message-ID: <200710091748.55939.kewley@gps.caltech.edu> On Monday 08 October 2007, Olli-Pekka Lehto wrote: > Hello, > > I'm interested in hearing some best practice solutions in fine-grained > management memory resources on clusters of SMPs. How do you enforce real > memory usage inside (RHEL-based) cluster nodes running multiple serial > jobs simultaneously? More specifically, how to do this efficiently when > some of the jobs map copious amounts of virtual memory but have only a > fraction of it resident at any given time? > > As SMP systems keep getting constantly fatter (and the potential for > users interfering with each others' jobs increasing) it would be great > to have something like AIX's WLM (Workload Manager) on Linux to > effectively manage intra-SMP resources. > > Olli-Pekka Ah, a family of issues near to my heart. :) I'll ask a broader question: How do you enforce real memory usage in modern Linux *at all*? We were interested in this because we were having user jobs regularly cause nodes to go into an Out Of Memory (OOM) state, triggering the kernel's oom_killer. The oom_killer sometime would kill system processes, which sometimes caused subsequent jobs to die. Even if subsequent jobs didn't die, recovery required that we manually close the node, reboot it when running jobs finished, then reopen it. This gets to be pretty dreary after a while. Our problem is somewhat different from your interests, but some of the same issues come into play. See below for the partially satisfying solution that we put in place for our OOM woes. First a review of the problem landscape as I understand it. You can try to enforce memory limits with a daemon, but you risk missing important events, including a badly behaved process suddenly using a whole lot of memory all at once. If that happens, your daemon is nearly useless since swapping and/or oom_killer will be running, and not your daemon. Your node may lock up for a while, which was what the daemon was supposed to prevent. I think you really want to do it in the kernel, so that badly behaved requests for memory (allocation and/or writing) can be cut off before they affect anyone else. But the kernel doesn't really enforce anything useful. It doesn't enforce a resident set size (RSS) limit, even though setrlimit() will let you request such a limit. As I understand it, modern Linux doesn't even try to track RSS, because semantics of RSS are unclear given modern memory management methods. RSS probably isn't even what you want -- you probably want to limit the amount of physical memory used, keeping the sum of the limits around the amount of total RAM, to avoid swapping. There is no way to communicate this limit to the kernel; I suspect it doesn't even track it except globally. The kernel *is* able to enforce the amount of virtual memory allocated per process (set with setrlimit()), but as you noted, that is of limited value when different applications can have very different overcommit percentages (virtual memory allocated beyond the amount actually used). But take a step back from considering the limits you can place on a given process. You probably want a policy that limits memory use at the job level, not at the process level, regardless of whether you have one job or multiple jobs running on a node. There is no kernel mechanism for that either. Seems your best bet might be to write a daemon, and hope that actual use patterns don't cause swapping or OOM before the daemon can act. To end our OOM problems, we took a different route. The job launch mechanism (via LSF) sets the per-process virtual-memory-allocation limit on each user job process. We can prevent OOM this way, unless a job both uses non-standard job launch methods and has runaway memory use (which is rare in our experience). Other weaknesses of our method include: * It does not prevent heavy swapping (which would be nice to have, but at least the user suffers the consequences most). * It can prevent a job from using all available RAM if the job has a larger overcommit than our algorithm assumes. * When the VM allocation limit is reached, the errors are often cryptic. Nothing appears in syslog (unlike segfaults, which are logged at least on x86_64) -- the kernel patch to enable logging seems likely pretty trivial, but stock kernels don't do it. A malloc() will return ENOMEM, which many programs and libraries don't handle properly (or indeed handle at all -- how many programmers omit checking the return value or errno?), so the user doesn't get a useful error message. A failed stack expansion will cause a segfault (as I recall), which is also cryptic to the user. At least segfaults get logged... I'd love to hear other approaches to this family of problems. David From jmdavis1 at vcu.edu Tue Oct 9 21:47:39 2007 From: jmdavis1 at vcu.edu (Mike Davis) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Memory limit enforcement In-Reply-To: <200710091748.55939.kewley@gps.caltech.edu> References: <470A530E.7090206@csc.fi> <200710091748.55939.kewley@gps.caltech.edu> Message-ID: <470C596B.3010307@vcu.edu> David Kewley wrote: >Ah, a family of issues near to my heart. :) > >I'll ask a broader question: How do you enforce real memory usage in modern >Linux *at all*? > >We were interested in this because we were having user jobs regularly cause >nodes to go into an Out Of Memory (OOM) state, triggering the kernel's >oom_killer. The oom_killer sometime would kill system processes, which >sometimes caused subsequent jobs to die. Even if subsequent jobs didn't >die, recovery required that we manually close the node, reboot it when >running jobs finished, then reopen it. This gets to be pretty dreary after >a while. > >Our problem is somewhat different from your interests, but some of the same >issues come into play. See below for the partially satisfying solution >that we put in place for our OOM woes. First a review of the problem >landscape as I understand it. > >You can try to enforce memory limits with a daemon, but you risk missing >important events, including a badly behaved process suddenly using a whole >lot of memory all at once. If that happens, your daemon is nearly useless >since swapping and/or oom_killer will be running, and not your daemon. >Your node may lock up for a while, which was what the daemon was supposed >to prevent. > >I think you really want to do it in the kernel, so that badly behaved >requests for memory (allocation and/or writing) can be cut off before they >affect anyone else. > >But the kernel doesn't really enforce anything useful. It doesn't enforce a >resident set size (RSS) limit, even though setrlimit() will let you request >such a limit. As I understand it, modern Linux doesn't even try to track >RSS, because semantics of RSS are unclear given modern memory management >methods. > >RSS probably isn't even what you want -- you probably want to limit the >amount of physical memory used, keeping the sum of the limits around the >amount of total RAM, to avoid swapping. There is no way to communicate >this limit to the kernel; I suspect it doesn't even track it except >globally. > >The kernel *is* able to enforce the amount of virtual memory allocated per >process (set with setrlimit()), but as you noted, that is of limited value >when different applications can have very different overcommit percentages >(virtual memory allocated beyond the amount actually used). > >But take a step back from considering the limits you can place on a given >process. You probably want a policy that limits memory use at the job >level, not at the process level, regardless of whether you have one job or >multiple jobs running on a node. There is no kernel mechanism for that >either. > >Seems your best bet might be to write a daemon, and hope that actual use >patterns don't cause swapping or OOM before the daemon can act. > >To end our OOM problems, we took a different route. The job launch >mechanism (via LSF) sets the per-process virtual-memory-allocation limit on >each user job process. We can prevent OOM this way, unless a job both uses >non-standard job launch methods and has runaway memory use (which is rare >in our experience). > >Other weaknesses of our method include: > >* It does not prevent heavy swapping (which would be nice to have, but at >least the user suffers the consequences most). > >* It can prevent a job from using all available RAM if the job has a larger >overcommit than our algorithm assumes. > >* When the VM allocation limit is reached, the errors are often cryptic. >Nothing appears in syslog (unlike segfaults, which are logged at least on >x86_64) -- the kernel patch to enable logging seems likely pretty trivial, >but stock kernels don't do it. A malloc() will return ENOMEM, which many >programs and libraries don't handle properly (or indeed handle at all -- >how many programmers omit checking the return value or errno?), so the user >doesn't get a useful error message. A failed stack expansion will cause a >segfault (as I recall), which is also cryptic to the user. At least >segfaults get logged... > >I'd love to hear other approaches to this family of problems. > > We have been dealing with similar problems on one of our clusters. The solution that we're coming to is that we need a non-standard solution. With Sun Grid Engine, one could build a memory consumable and then have jobs request memory. One could even require jobs to request memory. The problem is that many times a user will not know how much memory to request. We have been experimenting with using SGE 6's suspend feature with a Free RAM limit to stop (suspend) jobs that are going over the preset limit. The problem with this particular solution is that the reporting feature has a default timing of once every 40 seconds. This means that there will be some lag and that could cause problems with jobs that allocate RAM very quickly. In testing, we used some graphics jobs that will take up RAM fast. With these jobs and a limit of 1GB of free RAM, we were able to get the last 1.5GB job submitted to suspend when the system memory reached a gig on a 4GB machine (3 jobs). When the first job finished, the 3rd then restarted and completed. I won't say that there are not issues with this solution. But I believe that it can work. I still believe that the best solution is to make users aware of the memory requirements for their jobs and then have them use memory requests and common sense to get their work done. If anyone is interested in more info, please let me know and I will put you in contact with the programmer that we have working on this. Mike Davis From zahir.tari at rmit.edu.au Tue Oct 9 20:05:36 2007 From: zahir.tari at rmit.edu.au (Zahir Tari) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] OTM 2007 Call for Participation Message-ID: <86F803D5-2B1E-475B-94EA-46F127367891@rmit.edu.au> *** Our apologies if you have received this more than once *** This is to let you know that the selection process for the OTM 2007 Federated event just finished. This event will be held in Algarve (Portugal) during the period of 25-30 November 2007, and it involves the following major conferences: - International Symposium on Distributed Objects and Applications (DOA'07) - International Conference on Cooperative Information Systems (CoopIS'07) - International Symposium on Grid computing, high-performAnce and Distributed Applications (GADA'07) - International Conference on Ontologies, Databases and Applications of Semantics (ODBASE'07) - International Symposium on Information Security (IS'07) We are proud to announce you the four exceptional keynote speakers for the OTM 2007 event: - Donald Ferguson (Microsoft) on "The Internet Service Bus" - Mark Little (Red Hat) on "Transaction Processing in a Service Oriented Architecture" - York Sure (SAP) on "Towards the next Generation Value Networks" - Dennis Gannon (Indiana University) on "A Service Architecture for eScience Grid Gateways" - Whitfield Diffie (Sun Microsystems) on "Cryptography: Past, Present and Future" Details about the list of accepted papers can be found at http://www.cs.rmit.edu.au/fedconf/index.html?page=accepted In additional to the conferences, OTM 2007 is also running workshops covering various topics (e.g. agents in web services, privacy, context-aware mobile services...). Hope to see you at OTM 2007 in Portugal! For any information, please send us an e-mail to fedconf@cs.rmit.edu.au Regards, Zahir [Tari] & Robert [Meersman] -- OTM 2007 General Co-Chairs -- http://www.cs.rmit.edu.au/fedconf/ From tjrc at sanger.ac.uk Wed Oct 10 00:23:14 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Memory limit enforcement In-Reply-To: <470C596B.3010307@vcu.edu> References: <470A530E.7090206@csc.fi> <200710091748.55939.kewley@gps.caltech.edu> <470C596B.3010307@vcu.edu> Message-ID: On 10 Oct 2007, at 5:47 am, Mike Davis wrote: > We have been dealing with similar problems on one of our clusters. > The solution that we're coming to is that we need a non-standard > solution. With Sun Grid Engine, one could build a memory consumable > and then have jobs request memory. One could even require jobs to > request memory. The problem is that many times a user will not know > how much memory to request. If the memory requirements of the application are not known, then all bets are off, and there's basically nothing you can do to stop either the application being killed by an arbitrarily low memory limit that you set, or at the other extreme running out of memory. We do exactly what you suggest, but under LSF, which has resource reservation for memory out of the box. Of course, it's not real reservation, but it's reservation as far as the scheduler is concerned. We then have a default memory limit on the queues which is really very low indeed (1.9 GB, typically, because we have 2 GB RAM per core on our nodes). If the user wants more memory, they have to set a new higher limit themselves. When they do that, we have supplied LSF with an esub script which then checks that the user has supplied both the new memory, and a suitable resource selection and reservation option. If they have not, the job is rejected. So for example, if the user asks for a 6 GB memory limit, the esub will check that they have requested a machine with at least 6GB of free memory, and then reserve that memory with the scheduler. For example: -M6000000 -R"select[mem>6000] rusage[mem=6000]" On our beowulf cluster, this has been fairly effective in reducing the frequency with which nodes run out of memory - they jobs are usually killed first. It's not 100% effective though. > We have been experimenting with using SGE 6's suspend feature with > a Free RAM limit to stop (suspend) jobs that are going over the > preset limit. The problem with this particular solution is that the > reporting feature has a default timing of once every 40 seconds. > This means that there will be some lag and that could cause > problems with jobs that allocate RAM very quickly. This is a problem with the LSF solution too. I don't think there's a great deal that can be done about it, as others have said. The other problem is that simply stopping the jobs then results in a node with suspended processes on it that are often deadlocked; you can't resume the job without running out of memory. So you might as well have simply killed the job in the first place. > > I still believe that the best solution is to make users aware of > the memory requirements for their jobs and then have them use > memory requests and common sense to get their work done. Absolutely. If the user doesn't understand their application, all bets are off. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From Daniel.Pfenniger at obs.unige.ch Wed Oct 10 01:38:13 2007 From: Daniel.Pfenniger at obs.unige.ch (Daniel Pfenniger) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Intel quad core nodes? In-Reply-To: <470A661B.1030904@comcast.net> References: <20071008124403.M2100@mail.ipm.ir> <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> <470A661B.1030904@comcast.net> Message-ID: <470C8F75.8040907@obs.unige.ch> Barnet Wagman wrote: > I'm moving towards setting up a small cluster (my first), and am > thinking about using Intel quad core processors. However, I'm a little > concerned about memory contention. I'm (tentatively) going to have one > processor per node (this appears to be the cheapest way to go), but I > still wonder whether four cores will choke Intel's memory architecture. > (AMD's Barcelona may be better in this regard, but it doesn't seem to be > available yet, at least not through retail channels). > > I'd like to hear any opinions on this issue. And if you've used quad > core processors, I'd certainly like to hear about your experiences > (including which processor you've used). Most of the benchmarks consider speed, some consider speed/Watt, but none I have seen consider speed/monetary cost. The reason is partly that the component cost is a highly variable parameter and not always easy to obtain. But while configuring a cluster remember that you can get a better deal (=speed/cost for your application) by choosing cheaper CPUs in larger quantities. Dan From rgb at phy.duke.edu Wed Oct 10 02:17:56 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:33 2010 Subject: VM and performance (was Re: [Beowulf] best Linux distribution) In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <470A3547.2060606@comcast.net> <470A5DCF.20009@streamline-computing.com> <48078.192.168.1.1.1191933139.squirrel@mail.eadline.org> <470B82D5.9050002@charter.net> Message-ID: On Tue, 9 Oct 2007, andrew holway wrote: > It seems in europe at least Microsoft are talking to the likes of Xen > to get windows into hpc. After the European community's cold reception > of CCS they seem willing to float the next version of CCS on unix. No > one will trust MS with metal but the users sure do want it. The > Longhorn kernel is a lot easier to paravirtualise than previous > incarnations so maybe they are finally learning their place, as a > linux application. . > > Ta I will believe it when fish learn to dance on the surface of the sun... rgb -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From b.wagman at comcast.net Wed Oct 10 07:22:54 2007 From: b.wagman at comcast.net (Barnet Wagman) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Intel quad core nodes? In-Reply-To: References: <20071008124403.M2100@mail.ipm.ir> <1BBA6091-708C-4633-A892-FCAD72134366@sanger.ac.uk> <470A661B.1030904@comcast.net> Message-ID: <470CE03E.4010205@comcast.net> > minor version changes in SSE, etc, and iirc the earliest Core2 chips had > just 1066 MHz FSB, vs 1333 now. you can probably come pretty close by > populating just one node's dimms on a dual-socket AMD system. Intel's non-xeon quad processors still have only 1066 MHz FSB (except for the very pricey QX6850). This leads to a tricky trade-off. I presume that contention can be reduced by (i) getting a processor with a 1333 MHz FSB and (ii) building single (quad) processor nodes. But to get a 1333 FSB, you have to go to a 5000 series xeon. These processors are considerably more expensive than non-xeons (at a similar clock speed). Furthermore, as far as I can tell, there are no single socket boards that support the 5000 series. So the question is, which has less contention, a single quad processor system with 1066 FSB or a dual processor system with 1333 FSB? My guess is the former, if all the cores are kept busy. My app is embarassingly parallel, so it's likely they will be. Of course I'd rather wait for AMD's quad but that's not an option (I doubt that they'll be readily available until next year). So I'm leaning towards the low cost per node solution - one quad processor (probably a Q6600) per node. From lindahl at pbm.com Wed Oct 10 12:04:43 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] New beowulf recommendations In-Reply-To: <54odf89f66.fsf@intech19.enhanced.com> References: <54642ms0sx.fsf_-_@intech19.enhanced.com> <54odf89f66.fsf@intech19.enhanced.com> Message-ID: <20071010190443.GA32248@bx9.net> On Tue, Oct 09, 2007 at 05:00:33PM -0400, Camm Maguire wrote: > Confirmed! The on board nic has measurably lower latency without > inducing any discernible cpu load. Offload doesn't buy anything? Wow. I would never have guessed that... -- greg From mathog at caltech.edu Wed Oct 10 12:18:54 2007 From: mathog at caltech.edu (David Mathog) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Re: Memory limit enforcement Message-ID: David Kewley wrote: > But the kernel doesn't really enforce anything useful. I agree, the kernel should be able to enforce these sorts of limits on all processes of a user at once. Write Linus or whichever kernel developer you think is most likely to know now to implement this and request it as a new feature, and explain the situation. I'm thinking that it shouldn't be too difficult (but what do I know about kernel hacking?) to allow a process to request: 1. for any child processes that may be created after the request 2. for any malloc() type operation 3. have the child share the parent's memory statistics/limits as maintained by the kernel. (Not the complete memory map, just the sum of physical and virtual memory allocated.) The end result would be malloc/calloc failing in some child process once the parent process's counters hit up against the set limit(s). This would happen no matter which child ate all the memory, which is the desired behavior. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From kilian at stanford.edu Wed Oct 10 12:24:12 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Memory limit enforcement In-Reply-To: References: <470A530E.7090206@csc.fi> <470C596B.3010307@vcu.edu> Message-ID: <200710101224.13307.kilian@stanford.edu> On Wednesday 10 October 2007 12:23:14 am Tim Cutts wrote: > We then have a default memory limit on the queues which > is really very low indeed (1.9 GB, typically, because we have 2 GB > RAM per core on our nodes). If the user wants more memory, they have > to set a new higher limit themselves. I'm also relying on LSF's LSB_MEMLIMIT_ENFORCE option to take care of memory-greedy jobs. Before that, I tried to modify the default VM overcommit behavior on individual nodes, playing with sys.vm.overcommit_memory and sys.vm.overcommit_ratio values. By setting overcommit_memory=2 and an appropriate overcommit_ratio, you can basically prevent any swapping. The result is that processes' malloc()s going beyond the limits are denied. This is cool from the sysadmin standpoint, since the greedy applications are killed before bringing the machine to its knees. But it may as well happen that an application trying to use the last few available MBs gets killed, while another one has already allocated several GBs, which is not especially fair. And on top of that, most scientific applications are not very careful about checking errors. So our users were beginning to complain that their applications were crashing without any reason when they were reaching the overcommit limits. Which made me realize that this solution was probably not that optimal. So LSF per-job memory limits enforcement did the trick for us: an esub script to check that user can't request funny limits, and jobs using more that requested get killed. That's good for serial jobs. But parallel (read MPI) jobs are a different can of worms. Say you have 2 dual-cpu nodes, with 4GB each. A user can submit a job using 4 CPUs and 6GB of memory without any problem as long as those 6GB are equally balanced between the two nodes. But since LSF conception of the memory limits is *per job*, it means that, for this specific job, we need to set -M6000000 if we want it to run. And this limit won't prevent a process from this job to use more than 4GB on the first node, making it unusable... So anyway, no solution is perfect. I guess that what the Linux kernel really misses are memory quotas. Per user. Exactly like disk quotas. That would be *really* neat and solve a whole range of problems. > When they do that, we have > supplied LSF with an esub script which then checks that the user has > supplied both the new memory, and a suitable resource selection and > reservation option. If they have not, the job is rejected. So for > example, if the user asks for a 6 GB memory limit, the esub will > check that they have requested a machine with at least 6GB of free > memory, and then reserve that memory with the scheduler. For > example: > > -M6000000 -R"select[mem>6000] rusage[mem=6000]" I'm not 100% certain here, but I would have assumed that it would be the scheduler's job to select a host with enough ressources to run the job. So from my understanding, specifying -R"rusage[mem=6000]" would be sufficient to select a machine which 6GB available. But I may have missed some LSF subtleties. :) Cheers, -- Kilian From hahn at mcmaster.ca Wed Oct 10 15:05:58 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] using extend-reach IB? Message-ID: I've got an situation where a 20-25M IB cable would be very handy, and as far as I can tell, such cables exist. What's not clear to me is how they work - I think they all have some form of active components. some appear to be copper; others fiber, but all seem to draw a few Watts. I guess they draw from both ends, but do most IB NICs and switches support this? in my case, I have the usual mellanox infinihost III cards and Voltaire ISR 9024D-M. alternatively, do you have any experience with media-converters? the long story is that I would like to drive some display hardware in an adjoining room from a parallel rendering cluster at one end of a machineroom (the wrong end, naturally). it seems quite expensive to extend dual-link DVI for the total distance (35M or so), so I was hoping to extend the DDR IB about 20M to the edge of the machineroom, then use plain old 15M DL-DVI cables for the rest. mixed display targets, but including one of the new quad-HD panels (3840x2160, which needs 2x dual-link or 4x single-link.) thanks, mark hahn. From hahn at mcmaster.ca Wed Oct 10 15:28:50 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] using extend-reach IB? In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784FBCCF89@mtiexch01.mti.com> References: <9FA59C95FFCBB34EA5E42C1A8573784FBCCF89@mtiexch01.mti.com> Message-ID: > mention have support for those active or fiber cables already. Feel free > to let me know if you need more info, or contacts for the cable vendors. thanks very much for the info; I'm going to follow up offlist in more detail. regards, mark hahn. From richard.walsh at comcast.net Wed Oct 10 16:03:43 2007 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] using extend-reach IB? Message-ID: <101020072303.13547.470D5A4F00031E7E000034EB2200748184089C040E99D20B9D0E080C079D@comcast.net> -------------- Original message -------------- From: Mark Hahn > I've got an situation where a 20-25M IB cable would be very handy, > and as far as I can tell, such cables exist. What's not clear to me > is how they work - I think they all have some form of active components. > some appear to be copper; others fiber, but all seem to draw a few Watts. I can say a few words about optical active cable (OAC) choices. The current in production choice is from Intel, their Connects Cable. This is a VCSEL laser multi mode fiber design with a CX4 connector that runs at DDR speeds out to100m. A 25m cable is going to run you about $300 (US). Power consumption at each end is 1.1w. It is plug compatible with CX4 NICs that have power. Has the small bend radius advantages that you can expect from fiber and other fiber attributes. There is an excellent study published by ORNL covering their plug-and-play abiliity in the field, performance, and 1000x lower BER (important for full bandwidth utilization). Another choice, although they are have not fully ramped up production (Q1 08??) is the Blazar OAC from Luxtera. This cable is based on a DFB laser and single mode fiber. It has only three discrete components limited by their very interesting on-chip splitters and wave guides which are used to bend the normally edge-emitted DFB laser light vertically and into the smaller single mode fiber. The connector is QSFP, not really that common yet. Power draw is about the same as Intel Connects VCSEL. Range is 300m. Bandwidth is QDR. No real in-the-field test data that I know of, but the technology has gotten a lot of notice and I think has been selected for special attention at SC07. Rumor has it that is is already in at least one HPC vendors interconnect technology. The transceiver-on-a-chip technology is supposed to be a cost reducer, but that has not been proven. SMF fiber is much harder to align .... I don't know as much about 10GBASE-T, but it requires cat 6a or 7 cable, is supposed to have a 100m meter range, but the power consumption at both ends is like 5w until they get the transceiver chips down to 65nm. The signal correction technology is elaborate and may neutralize the "cheap-transceiver" advantage that copper has had over fiber historically, especially as the above fiber alternatives ramp up production and begin to compete with each other. Of course, per media foot copper has always (?) been more expensive that fiber. Hope that helps ... rbw PS There some very good white papers and studies on both the Intel and Luxtera web sites. -- "Making predictions is hard, especially about the future." Niels Bohr -- Richard Walsh Thrashing River Consulting-- 5605 Alameda St. Shoreview, MN 55126 Phone #: 612-382-4620 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071010/5064db93/attachment.html From hahn at mcmaster.ca Wed Oct 10 20:46:58 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] using extend-reach IB? In-Reply-To: <101020072303.13547.470D5A4F00031E7E000034EB2200748184089C040E99D20B9D0E080C079D@comcast.net> References: <101020072303.13547.470D5A4F00031E7E000034EB2200748184089C040E99D20B9D0E080C079D@comcast.net> Message-ID: > I can say a few words about optical active cable (OAC) choices. The > current in production choice is from Intel, their Connects Cable. This is are they shipping? I checked their website a couple weeks ago and they were talking 1q08 availability. > speeds out to100m. A 25m cable is going to run you about $300 (US). Power similar to Gore's one (which is copper I think). > Rumor has it that is is already in at least one HPC vendors interconnect > technology. what does that mean? are there HPC vendors who don't just sell you anything, including their sister, for a large enough customer? ;) to me, I don't think this market is all that huge - there just aren't lots of clusters with that kind of diameter, which means that long IB will be a useful tool for adding a few uplinks/etc across a machineroom or around the corner... > compete with each other. Of course, per media foot copper has always (?) > been more expensive that fiber. sure - once >=10G optics become dirt cheap, everyone will toss a few onto every motherboard ;) hmm, I should patent an augmented 10G/IB optical tranceiver that makes a cool blue glow when nothing's plugged in. or maybe when something is... > PS There some very good white papers and studies on both the Intel and >Luxtera web sites. thanks, but whitepapers can't address the main (only) interesting issue: whether it becomes mass-market-cheap... thanks, mark hahn. From jcownie at cantab.net Thu Oct 11 06:58:52 2007 From: jcownie at cantab.net (James Cownie) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] using extend-reach IB? In-Reply-To: References: <101020072303.13547.470D5A4F00031E7E000034EB2200748184089C040E99D20B9D0E080C079D@comcast.net> Message-ID: Maybe these would do ? http://www.intel.com/design/network/products/optical/cables/ Enabling larger clusters and reduced installation and maintenance costs Intel? Connects Cables, with 20 Gbps data rates at distances up to 100 meters, deliver the performance you need to build large, reliable computer clusters. With their low weight and bulk these high- performance active optical cables can reduce installation and maintenance costs while improving airflow. Designed for InfiniBand* 4x SDR/DDR and 10 Gb Ethernet (GbE) installations, Intel Connects Cables are priced competitively with 24 AWG copper cables they are designed to replace. FWIW I work for Intel, but on software... -- -- Jim -- James Cownie From richard.walsh at comcast.net Thu Oct 11 07:20:39 2007 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] using extend-reach IB? Message-ID: <101120071420.28811.470E3137000A84B10000708B2207020653089C040E99D20B9D0E080C079D@comcast.net> -------------- Original message -------------- From: Mark Hahn > > I can say a few words about optical active cable (OAC) choices. The > > current in production choice is from Intel, their Connects Cable. This is > > are they shipping? I checked their website a couple weeks ago > and they were talking 1q08 availability. You must not have looked thoroughly enough ... ;-) ... and you tend to be very thorough, but it was after regular business hours. Intel is selling them directly from there webstore and they offer a couple of other vendors. That is where I got the pricing. > > speeds out to100m. A 25m cable is going to run you about $300 (US). Power > > similar to Gore's one (which is copper I think). Mmm ... their website specs them out only to 25m (with asterisk) and that is a cable with a .9cm diameter! I wonder what the bend radius is on that? Even their photo of the cable looks like a cobra ... ;-) ... might be OK if you only need one I guess. Did not see the pricing, but if it is the same as Intel's fiber why buy the snake? The claimed power draw is lower than I expected though ... anyone actually tested/used this cable? > > > Rumor has it that is is already in at least one HPC vendors interconnect > > technology. > > what does that mean? are there HPC vendors who don't just sell you anything, > including their sister, for a large enough customer? ;) What I meant is that it is part of some currently being installed high performance IB interconnects. I think this is somewhat validating, but ... > to me, I don't think this market is all that huge - there just aren't lots > of clusters with that kind of diameter, which means that long IB will be > a useful tool for adding a few uplinks/etc across a machineroom or > around the corner... It is a larger system product for now, but as bandwidth demands go up on smaller systems it won't be the length limitations that kill copper, but weight, diameter, added power draw, BERs, and equalizing total links costs. > > > PS There some very good white papers and studies on both the Intel and > >Luxtera web sites. > > thanks, but whitepapers can't address the main (only) interesting issue: > whether it becomes mass-market-cheap... Intel's cable has been tested is for sale today like I said. If only need a few for one long haul application, then I am not sure you need to wait around for mass market cheap. Hey, the Canadian dollar is at an all time high ain't it? Take care, rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071011/e9c38e7e/attachment.html From maurice at harddata.com Wed Oct 10 14:47:18 2007 From: maurice at harddata.com (Maurice Hilarius) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Re: Intel quad core nodes? In-Reply-To: <200710101901.l9AJ06mK026026@bluewest.scyld.com> References: <200710101901.l9AJ06mK026026@bluewest.scyld.com> Message-ID: <470D4866.5080604@harddata.com> Daniel Pfenniger wrote: "But while configuring a cluster remember that you can get a better deal (=speed/cost for your application) by choosing cheaper CPUs in larger quantities." I disagree. Sometimes this is true, and sometimes it is not. The CPU cost is a fractional element of the total cost. The total cost can vary quite a bit. Let's do a fictional example, with 60 nodes, one rack, GbE networking: Quad core Opteron blades, in a 10 blade 8U chassis. 6 chassis per rack, 60 blades, 120 CPUs, 480 cores 1 Motherboard Opteron S1207x2/ split rail power, newer chipset, video/GbE*2 2 AMD Opteron2347 Quad Core 1.9GHz 8 2GB DDR2 667MHz ECC REG DIMM Assuming 16GB RAM ( 2GB per core) 1 160 GB 7200rpm HDD Total per node: $ 2,600 Blade frame/chassis/power cost per node: $300 Rack, PDU, GbE network switch, cables, etc., etc, per rack of 60 nodes: $6,000 Cost per node: $100 Total: $180,000 So, each node costs roughly $3,000 with 1.9GHz CPUs ( 2 per node) Each CPUs is about $400/3000 of the cost, or 13.4% Upgrade to: 2.0GHz: Each node is $3,200, CPU $500 15.7% 2.3GHz: Each node is $4,200, CPU $1000 23.8% 2.5GHz: Each node is $5,150, CPU $1475 28.7% Assuming performance is pretty well linear with clock speed in your applications. Yes, a big assumption, but one that holds true for most, unless you are limited by memory bandwidth or network performance.. Using the above figures, we can see that: Upgrade from 1.9 to 2GHz: 5.3% performance gain, 6.6% cost increase. Upgrade from 1.9 to 2.3GHz: 21.1% performance gain, 40% cost increase. Upgrade from 1.9 to 2.5GHz: 31.5% performance gain, 72% cost increase. When one factors power consumption and cooling the curve of cost/performance certainly gets steeper. But, the jump from 1.9 to 2.0 is a reasonable one. Going from 1.9 to 2.0GHz is an example that disagrees with your statement. Going from 1.9 to 2.3GHz it is different: If we factor in the performance gain ( again, assuming even scaling by simply adding nodes): Add a second rack, add 13 more nodes, to gain 26.4% mode performance ( equiv to 1.9 vs 2.3GHz) : 2 more 10 blade chassis, 16U, more network, another rack, PDU, etc. $8,000 13 nodes @ $3,000 = $39,000 $180,000 + $47,000 = $227,000 227/180 = 26.1% cost for 21% performance gain. It seems that the gain for more CPUs versus faster holds true for the jump to 2.3GHz. With Quad core Opterons it is more cost effective to add nodes versus faster CPUs comparing 2.3GHz to 1.9GHz. Of course you are supporting a second rack, more power, more cooling to do so. And, you are maintaining more nodes which is more work, more risk of failure, etc.. ----------------------------------- Barnet Wagman wrote: "Of course I'd rather wait for AMD's quad but that's not an option (I doubt that they'll be readily available until next year). So I'm leaning towards the low cost per node solution - one quad processor (probably a Q6600) per node." Huh? What does "readily available" mean to you? How many machines would you like with these? Give me a call. -- With our best regards, //Maurice W. Hilarius Telephone: 01-780-456-9771/ /Hard Data Ltd. FAX: 01-780-456-9772/ /11060 - 166 Avenue email:maurice@harddata.com/ /Edmonton, AB, Canada http://www.harddata.com// / T5X 1Y3/ / -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071010/eb92c8bc/attachment.html From Shainer at mellanox.com Wed Oct 10 15:22:15 2007 From: Shainer at mellanox.com (Gilad Shainer) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] using extend-reach IB? In-Reply-To: Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784FBCCF89@mtiexch01.mti.com> You do have fiber cables that goes +100m with IB DDR, and should not be expensive. They have active component, and the adapters and switch you mention have support for those active or fiber cables already. Feel free to let me know if you need more info, or contacts for the cable vendors. Gilad. -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Mark Hahn Sent: Wednesday, October 10, 2007 3:06 PM To: Beowulf Mailing List Subject: [Beowulf] using extend-reach IB? I've got an situation where a 20-25M IB cable would be very handy, and as far as I can tell, such cables exist. What's not clear to me is how they work - I think they all have some form of active components. some appear to be copper; others fiber, but all seem to draw a few Watts. I guess they draw from both ends, but do most IB NICs and switches support this? in my case, I have the usual mellanox infinihost III cards and Voltaire ISR 9024D-M. alternatively, do you have any experience with media-converters? the long story is that I would like to drive some display hardware in an adjoining room from a parallel rendering cluster at one end of a machineroom (the wrong end, naturally). it seems quite expensive to extend dual-link DVI for the total distance (35M or so), so I was hoping to extend the DDR IB about 20M to the edge of the machineroom, then use plain old 15M DL-DVI cables for the rest. mixed display targets, but including one of the new quad-HD panels (3840x2160, which needs 2x dual-link or 4x single-link.) thanks, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From scheinin at crs4.it Thu Oct 11 08:38:29 2007 From: scheinin at crs4.it (Alan Louis Scheinine) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Re: Intel quad core nodes? In-Reply-To: <470D4866.5080604@harddata.com> References: <200710101901.l9AJ06mK026026@bluewest.scyld.com> <470D4866.5080604@harddata.com> Message-ID: <470E4375.4020205@crs4.it> When dual-core came out we had a debate here about the cost of dual-core versus single-core. There was not a cost savings if one considered just the CPU, but there was a cost savings if one considered the entire box. On the other hand, using benchmark figures we saw that the performance scaled a little less than simple frequency multiplier. In the end, at the time of introduction of dual-core, the choice with regard to "cost-effectiveness" was a toss-up (either choice equally cost-effectiveness). We can assume that AMD and Intel have marketing specialists that are almost as intelligent as the HPC staff at CRS4. So it is not surprising that at the moment of the introduction of quad-core, the actual cost-effectiveness taking into account everything is a function that meets the dual-core cost effectiveness. A kind of C1 continuity. best regards, Alan Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna Center for Advanced Studies, Research, and Development in Sardinia Postal Address: | Physical Address for FedEx, UPS, DHL: --------------- | ------------------------------------- Alan Scheinine | Alan Scheinine c/o CRS4 | c/o CRS4 C.P. n. 25 | Loc. Pixina Manna Edificio 1 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy Email: scheinin@crs4.it Phone: 070 9250 238 [+39 070 9250 238] Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220] Operator at reception: 070 9250 1 [+39 070 9250 1] Mobile phone: 347 7990472 [+39 347 7990472] From hahn at mcmaster.ca Thu Oct 11 09:15:52 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] using extend-reach IB? In-Reply-To: <101120071420.28811.470E3137000A84B10000708B2207020653089C040E99D20B9D0E080C079D@comcast.net> References: <101120071420.28811.470E3137000A84B10000708B2207020653089C040E99D20B9D0E080C079D@comcast.net> Message-ID: >> are they shipping? I checked their website a couple weeks ago >> and they were talking 1q08 availability. > You must not have looked thoroughly enough ... ;-) ... and you tend to be > very thorough, but it was after regular business hours. Intel is selling > them directly from there webstore and they offer a couple of other > vendors. That is where I got the pricing. I think you mean this page http://shop.intel.com/shop/category.aspx?category_id=171 right? to me, it looks like they're all still listed as "Out of stock: available January 2008" like this: http://shop.intel.com/shop/product.aspx?pid=SICC0007&pfid=171&pindex=1 perhaps you found a page of other products? or is their stock warning bogus? >>> speeds out to100m. A 25m cable is going to run you about $300 (US). Power >> >> similar to Gore's one (which is copper I think). > Mmm ... their website specs them out only to 25m (with asterisk) and that >is a cable with a .9cm diameter! I wonder what the bend radius is on that? >Even their photo of the cable looks like a cobra ... ;-) ... might be OK if >you only need one I guess. Did not see the pricing, but if it is the same >as Intel's fiber why buy the snake? The claimed power draw is lower than I >expected though ... anyone actually tested/used this cable? my machineroom currently has something like 3.6 tons of quadrics cables, which are all about 1 cm dia. I don't find that the bend radius is much of a concern - space certainly is, since getting 38 quadrics out of a rack is hard enough, not to mention switch racks which have 256 cables. I believe even the thickest IB is slimmer than quadrics, but yes, I'm hoping for optical in the next genreation. though in general, I think the cluster layout is actually more important than worrying about the cable. for instance, distributing leaf switches among racks is probably a good idea, and if you do that, you almost certainly want to use copper for short/local interconnect. > It is a larger system product for now, but as bandwidth demands go up on >smaller systems it won't be the length limitations that kill copper, but >weight, diameter, added power draw, BERs, and equalizing total links costs. I'm skeptical of the proposition that bandwidth is changing that much - reeks of the great inet bubble. within a cluster, sure, there are some apps which do really want more bw. (though the most common bw-user I hear of is weather codes which seem to do all-to-all purely out of laziness.) bw out of clusters is probably growing, but at modest rates - have you priced a full-on 1Gb ISP link, let alone 10Gb? regards, mark hahn. From rgb at phy.duke.edu Wed Oct 10 21:34:15 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Re: Intel quad core nodes? In-Reply-To: <470E4375.4020205@crs4.it> References: <200710101901.l9AJ06mK026026@bluewest.scyld.com> <470D4866.5080604@harddata.com> <470E4375.4020205@crs4.it> Message-ID: On Thu, 11 Oct 2007, Alan Louis Scheinine wrote: > When dual-core came out we had a debate here > about the cost of dual-core versus single-core. > There was not a cost savings if one considered > just the CPU, but there was a cost savings if > one considered the entire box. On the other > hand, using benchmark figures we saw that > the performance scaled a little less than > simple frequency multiplier. In the end, at > the time of introduction of dual-core, the choice > with regard to "cost-effectiveness" was a toss-up > (either choice equally cost-effectiveness). This is a very old cost benefit computation that goes all the way back to SMP Pentium Pros. Over the years it has USUALLY been the case that for people with CPU bound code SMP systems (which I would argue continues to include modern multiprocessor multicores) would realize a cost-benefit advantage compared to UP systems because of not having to buy so many chassis and needlessly replicating and paying for case, power supply, disk, and network interface. In the earliest days (2.0.x kernels) there was only a single kernel lock on for interrupts so all interrupt processing was effectively single threaded, which caused systems-bound code to scale poorly. For many processor generations now that hasn't been the case, though. For code that was memory bound things have been less consistent. The ability of memory to keep up with multiple cores has varied quite a lot. In some years -- for certain motherboards, processors, chipsets, memory bus speeds, cache sizes -- two processors could run straight to memory with little to no binding, in others you'd drop to maybe 1.4x UP speed as they collided on the bus. For code that was network bound (or bound in several dimensions) things got even more complicated. Two processors sharing a single network channel could easily be bound -- or not. Adding a second NIC could unbind the processes -- or not. In other words, it has always been YMMV, but well worth considering especially for mostly-CPU bound code. I think that this is still likely to be the case. For my personal applications four cores on two processors in one box might run 8x as fast as UP (per box) with little to no binding. I doubt that parallel stream will, though, and forcing IPCs for 8 processors through just 1-2 gigabit channels is likely to produce problems as well. rgb > > We can assume that AMD and Intel have marketing > specialists that are almost as intelligent as the > HPC staff at CRS4. So it is not surprising that at > the moment of the introduction of quad-core, the actual > cost-effectiveness taking into account everything is a > function that meets the dual-core cost effectiveness. > A kind of C1 continuity. > > best regards, > Alan > > > Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna > Center for Advanced Studies, Research, and Development in Sardinia > > Postal Address: | Physical Address for FedEx, UPS, DHL: > --------------- | ------------------------------------- > Alan Scheinine | Alan Scheinine > c/o CRS4 | c/o CRS4 > C.P. n. 25 | Loc. Pixina Manna Edificio 1 > 09010 Pula (Cagliari), Italy | 09010 Pula (Cagliari), Italy > > Email: scheinin@crs4.it > > Phone: 070 9250 238 [+39 070 9250 238] > Fax: 070 9250 216 or 220 [+39 070 9250 216 or +39 070 9250 220] > Operator at reception: 070 9250 1 [+39 070 9250 1] > Mobile phone: 347 7990472 [+39 347 7990472] > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From lindahl at pbm.com Thu Oct 11 15:13:16 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Channel bonding, again Message-ID: <20071011221316.GA18222@bx9.net> I'm thinking about using "balance-alb" channel bonding on a medium-to-large Linux cluster; does anyone have experience with this? It seems that it might generate a lot of arp replies if a switch fails. -- greg From henning.fehrmann at aei.mpg.de Fri Oct 12 04:57:50 2007 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Channel bonding, again In-Reply-To: <20071011221316.GA18222@bx9.net> References: <20071011221316.GA18222@bx9.net> Message-ID: <20071012115750.GA4993@gretchen.aei.uni-hannover.de> Hello Greg, On Thu, Oct 11, 2007 at 03:13:16PM -0700, Greg Lindahl wrote: > I'm thinking about using "balance-alb" channel bonding on a > medium-to-large Linux cluster; does anyone have experience with this? > It seems that it might generate a lot of arp replies if a switch > fails. > We did some experiments with channel bonding. The bond device inherits the MAC-address of the first interface, meaning, it comes with a single MAC- and a single IP-address. The bonding mode determines the load-balance of the transition. On the other hand, one needs to trunk (HP calls it trunking) the ports on the switches. We tried it on HP and Cisco switches. The switch collects the packages of the trunked ports and redistributes them according to a level 2 or level 3 hash policy. Here starts the problem: the packages are coming with the same IP- and MAC-address. The switch gets confused and, furthermore, the switch wants to do the work which is already done by the node. One of our studends found a workaround: E.g., you want to bond 4 interfaces. Put each interface in a distinct VLAN (VLAN 1-4). The VLAN tag increases the Ethernets frame by 4 byte. Subsequently, you bond the four interfaces. The ports on the switch can be also configured to be a member of the coresponding VLAN. The load-balancing is done by the node, the switch transmits the packages according to the VLAN id and does not care about trunking. In order to establish communication of a node with 4 network cards and nodes with 2 ore 1 networkcards you have to create on each node 4 virtual interfaces, put each one into a distinct VLAN and bond them together. The ports on the switch can also be configured to be member of multiple VLANs. Cheers Henning From kilian at stanford.edu Fri Oct 12 09:53:06 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Channel bonding, again In-Reply-To: <20071012115750.GA4993@gretchen.aei.uni-hannover.de> References: <20071011221316.GA18222@bx9.net> <20071012115750.GA4993@gretchen.aei.uni-hannover.de> Message-ID: <200710120953.07470.kilian@stanford.edu> Hi all, On Friday 12 October 2007 04:57:50 am Henning Fehrmann wrote: > On the other hand, one needs to trunk (HP calls it trunking) the > ports on the switches. We tried it on HP and Cisco switches. > The switch collects the packages of the trunked ports and > redistributes them according to a level 2 or level 3 hash policy. > Here starts the problem: the packages are coming with the same IP- > and MAC-address. The switch gets confused and, furthermore, the > switch wants to do the work which is already done by the node. I think LACP (802.3ad) [1] is supposed to address these issues: it allows swicthes to negotiate an automatic configuration of individual ports by exchanging LACP packets with the peer. From the switch standpoint, a LACP port group is considered as a single logical port, which kind of alleviate those multiple-ports-same-address problems, while conserving the advantages of increased bandwidth and failover if one individual link fails. LACP is supported by Cisco [2] and Dell [3] switches, and for the peer side, by the Linux bonding module (mode 4) [4], the FreeSBD lagg(4) driver (>= 6.2-STABLE) [5] and by NetBSD agr(4) [6]. [1] http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol [2] http://www.cisco.com/en/US/products/ps6566/products_feature_guide09186a008071860b.html [3] http://www.dell.com/downloads/global/power/ps1q06-20050254-Holmes.pdf [4] http://www.linuxhorizon.ro/bonding.html [5] http://www.freebsd.org/cgi/man.cgi?query=lagg&apropos=0&sektion=4&manpath=FreeBSD+7-current&format=html [6] http://www.daemon-systems.org/man/agr.4.html Cheers, -- Kilian From kus at free.net Fri Oct 12 11:36:24 2007 From: kus at free.net (Mikhail Kuzminsky) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] quad-core SPECfp2006: where are 4 FPresults/cycle ? Message-ID: I found 1st AMD quad core (Opteron 2347/1.9 Ghz) SPECfp2006 results (at www.spec.org) obtained by IBM: 11.2/10.7 for peak/base values. I'll say about 1 core only, i.e. for results w/Autoparallel=NO. Let me look to other x86-64 microarchitecture w/same 4*64 bit FP results per cycle, i.e. Intel Core. For close frequency (1.86 Ghz, Xeon 5120) we may find close performance (10.9/10.7, for Bull SAS NovaScale R460 - for example). Let me now forget about cache sizes and memory throughput differences for AMD Barcelona and Intel Core microarchitectures, and their corresponding influence to performance. Then I may say that in some sense the "efficiency" (in the sense of performance, OK - SPECfp2006 performance - per 1 Hz) of both microarchitectures are close. But if I'll compare SPECfp2006 results w/x86-64 microarchitecture w/2*64 bit FP results per cycle - previous Opteron generation - I'll see some strange (IMHO) result. So, for Opteron 2222SE/3 Ghz, AMD SPECfp2006 values are 15.2/14.3. But Xeon 5160, having 4 FP results per cycle, w/same 3.0 Ghz gives very close values - 15.6/15.4 ! This means that 2 additional FP results per cycle in microarchitecture gives only about 7% of performance increase :-( The question is - should we wait some better results for new incoming optimizing compilers versions ? Or it is the reality - that 2 additional FP results per cycle gives (in average) relative small performance increase ? Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow are From agshew at gmail.com Fri Oct 12 12:14:31 2007 From: agshew at gmail.com (Andrew Shewmaker) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] Re: Memory limit enforcement In-Reply-To: References: Message-ID: On 10/10/07, David Mathog wrote: > David Kewley wrote: > > > > But the kernel doesn't really enforce anything useful. > > I agree, the kernel should be able to enforce these sorts of limits > on all processes of a user at once. > > Write Linus or whichever kernel developer you think is most likely to > know now to implement this and request it as a new feature, and explain > the situation. The kernel developer Matt Mackall has been working on making the question of how much memory an app is using more easily answered. http://lwn.net/Articles/230975/ According to the Linux Weather Forecast, we might see these patches included in 2.6.25 http://www.linux-foundation.org/en/Linux_Weather_Forecast Like Kilian, I've used the sysctl overcommit_memory and overcommit_ratio settings. Partly because for some RHEL4 on diskless nodes (4 socket opterons with 32GB RAM), the OOM killer didn't actually work and the watchdog would panic the kernel. I also saw bad behavior under RHEL4 for diskful systems with 8GB RAM and plenty of swap. Swapping seemed to hang machines when it should have only slowed down the system and then recovered. -- Andrew Shewmaker From hahn at mcmaster.ca Fri Oct 12 13:09:05 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] quad-core SPECfp2006: where are 4 FPresults/cycle ? In-Reply-To: References: Message-ID: > This means that 2 additional FP results per cycle in microarchitecture gives > only about 7% of performance increase :-( the 4 flops/cycle is really for linpack-like code: it assumes you are executing packed double SIMD. > The question is - should we wait some better results for new incoming > optimizing compilers versions ? Or it is the reality - that 2 additional FP > results per cycle gives (in average) relative small performance increase ? just that not all FP is SIMD-friendly, I think. if your code spends a lot of time in blas/lapack functions, I would expect it to see good speedup. regards, mark hahn. From richard.walsh at comcast.net Fri Oct 12 13:50:08 2007 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Mon Mar 15 01:06:33 2010 Subject: [Beowulf] quad-core SPECfp2006: where are 4 FPresults/cycle ? Message-ID: <101220072050.21835.470FDE00000B9CA10000554B2207021553089C040E99D20B9D0E080C079D@comcast.net> -------------- Original message -------------- From: "Mikhail Kuzminsky" > > But if I'll compare SPECfp2006 results w/x86-64 microarchitecture > w/2*64 bit FP results per cycle - previous Opteron generation - I'll > see some strange (IMHO) result. So, for Opteron 2222SE/3 Ghz, AMD > SPECfp2006 values are 15.2/14.3. But Xeon 5160, having 4 FP results > per cycle, w/same 3.0 Ghz gives very close values - 15.6/15.4 ! > This means that 2 additional FP results per cycle in microarchitecture > gives only about 7% of performance increase :-( > Mikhail, I am not sure I fully understand what you are presenting here, but I might say that yes, at the FPU unit level the 2222 series AMD Opteron/Barcelona and the Intel Core2/Clovertown (and also Harpertown at 45 nm) are now more largely equivalent -- that is they both can execute 2, double-wide (2x64 bit) floats in certain FMA situtations simultaneously and/or in a pipeline. And this feature could be used to compute clock x 4 64-bit flop peaks if you work in the marketing department. This was not true with the earlier Opteron which had to serialize each 64-bit piece of the 128-bit floating point operation. You might therefore conclude that from registers the two processors at the same clock should perform equally, but there are other issues. One big one is instruction width and issue rate. The Opteron (both 940 and 1207) are only three-wide processors while the Core2 is four-wide giving it a wider aperature through which to schedule two 128-bit SSEs side-by-side. Different compilers or older revs could also make a difference as you suggest. The philosophical message is of course that there are no two apples alike, or even more radically that the concept of identity is fundamentally flawed ... ;-) ... Regards, rbw -- "Making predictions is hard, especially about the future." Niels Bohr -- Richard Walsh Thrashing River Consulting-- 5605 Alameda St. Shoreview, MN 55126 Phone #: 612-382-4620 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071012/e2fba439/attachment.html From kus at free.net Sat Oct 13 07:57:05 2007 From: kus at free.net (Mikhail Kuzminsky) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] quad-core SPECfp2006: where are 4 FPresults/cycle ? In-Reply-To: Message-ID: In message from Mark Hahn (Fri, 12 Oct 2007 16:09:05 -0400 (EDT)): >> This means that 2 additional FP results per cycle in >>microarchitecture gives >> only about 7% of performance increase :-( > >the 4 flops/cycle is really for linpack-like code: it assumes you are >executing packed double SIMD. Yes, but AFAIK most of the modern optimizing F9x compilers for x86 can generate codes w/SSEx instructions (instead of x87). And I assume that many real world codes, including some from SPECfp2006 set, includes the work w/floating point vectors. It's not necessary to have very long vectors - taking into account that 64 bit SSE vectors have length=2. Such things may gives theoretically 2x speedup ! >just that not all FP is SIMD-friendly, I think. Yes, I agree w/"not all". But 7% speedup means, I beleive, "very seldom FP codes" ? Yours Mikhail > if your code spends >a lot of time in blas/lapack functions, I would expect it to see good >speedup. > >regards, mark hahn. From kus at free.net Sat Oct 13 08:10:14 2007 From: kus at free.net (Mikhail Kuzminsky) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] quad-core SPECfp2006: where are 4 FPresults/cycle ? In-Reply-To: <101220072050.21835.470FDE00000B9CA10000554B2207021553089C040E99D20B9D0E080C079D@comcast.net> Message-ID: In message from richard.walsh@comcast.net (Fri, 12 Oct 2007 20:50:08 +0000): >Mikhail, >I am not sure I fully understand what you are presenting here, but I >might say that yes, at the FPU unit level the 2222 series AMD >Opteron/Barcelona and the Intel Core2/Clovertown (and also Harpertown >at 45 nm) are now more largely equivalent -- that is they both can >execute 2, double-wide (2x64 bit) floats in certain FMA situtations >simultaneously and/or in a pipeline. .... >Regards, >rbw >-- Sorry, now I'm misunderstanding :-) I thought that Opteron 2222 don't have Barcelona microarchitecture (all the Barcelona's are 23xx or 83xx) and therefore can't perform 4*64 bit FP results per cycle. Am I wrong ? Mikhail From richard.walsh at comcast.net Sat Oct 13 10:28:43 2007 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] quad-core SPECfp2006: where are 4 FPresults/cycle ? Message-ID: <101320071728.28799.4711004B00047CB90000707F2207002953089C040E99D20B9D0E080C079D@comcast.net> -------------- Original message -------------- From: "Mikhail Kuzminsky" > In message from richard.walsh@comcast.net (Fri, 12 Oct 2007 20:50:08 > +0000): > >Mikhail, > >I am not sure I fully understand what you are presenting here, but I > >might say that yes, at the FPU unit level the 2222 series AMD > >Opteron/Barcelona and the Intel Core2/Clovertown (and also Harpertown > >at 45 nm) are now more largely equivalent -- that is they both can > >execute 2, double-wide (2x64 bit) floats in certain FMA situtations > >simultaneously and/or in a pipeline. > .... > >Regards, > >rbw > >-- > Sorry, now I'm misunderstanding :-) > I thought that Opteron 2222 don't have Barcelona microarchitecture > (all the Barcelona's are 23xx or 83xx) and therefore can't perform > 4*64 bit FP results per cycle. Am I wrong ? Mikhail, Whoops ... ;-( ... I should have said 23XX. That is the Barcelona generation. If you were talking about generation two, then what I said about a 64-bit serialize staging of the 128-bit wide SSE FPU operations applies. Sorry, if I caused confusion. My comments regarding near-floating-point functional unit equivalency should be confined to comparing the 23XX series Opteron with the Clovertown/Harpertown processors from Intel. The point about 3-wide versus 4-wide applies with all Opteron generations. Dobray Utra, rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071013/9259f3fe/attachment.html From carsten.aulbert at aei.mpg.de Sun Oct 14 09:19:33 2007 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Channel bonding, again In-Reply-To: <200710120953.07470.kilian@stanford.edu> References: <20071011221316.GA18222@bx9.net> <20071012115750.GA4993@gretchen.aei.uni-hannover.de> <200710120953.07470.kilian@stanford.edu> Message-ID: <47124195.8090500@aei.mpg.de> Hi all, Kilian CAVALOTTI wrote: > I think LACP (802.3ad) [1] is supposed to address these issues: it > allows swicthes to negotiate an automatic configuration of individual > ports by exchanging LACP packets with the peer. From the switch > standpoint, a LACP port group is considered as a single logical port, > which kind of alleviate those multiple-ports-same-address problems, > while conserving the advantages of increased bandwidth and failover if > one individual link fails. > > LACP is supported by Cisco [2] and Dell [3] switches, and for the peer > side, by the Linux bonding module (mode 4) [4], the FreeSBD lagg(4) > driver (>= 6.2-STABLE) [5] and by NetBSD agr(4) [6]. > I don't know from the top of my head which versions we have used, but our problem with LACP was that the switch (mostly ProCurve 2900, but i think the Cisco 4948 behaved similarly, but that I would need to cross-check) was using only a single of the available two or four lines to the node for a single connection. Thus a node could handle two different 1 GB/s connections at the same time and reaching almost 2 Gb/s in total, but we never saw a single connection using all the available bandwidth. That was the reason our student came up with this VLAN trick. Cheers Carsten From henning.fehrmann at aei.mpg.de Mon Oct 15 05:37:13 2007 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Channel bonding, again In-Reply-To: <865578781.20071012204237@gmx.net> References: <20071011221316.GA18222@bx9.net> <20071012115750.GA4993@gretchen.aei.uni-hannover.de> <865578781.20071012204237@gmx.net> Message-ID: <20071015123713.GA6659@gretchen.aei.uni-hannover.de> Hallo Jan, > That is not 100% correct. There are at least 6 mod that the bonding > device of linux supports. balance-alb and balance-tlb are not > assigning the same MAC to the interfaces in the bond. Therefore you > don't need a switch that supports trunks or something similar. Hmm, we tried balance-tlb. After view seconds either the bonding device was down with a bunch of kernel logs or the node hang-up. Maybe, somebody can report on this problem. > > The problem ist that the performance of the bonding device under linux > is far away from being optimal (as far as i saw). The round-robin or > ad modes do not bring more than 140 to 150 MB/s out of 2 Gigabit > links. > The load balancing modi allow multiple connections to be fast but are > not speeding up a single connection (each connection is limited to the > speed of a single link in the bond). Exactly. Using the VLAN trick the student reported a transmission rate of 240MB/s using round-robin, NFS (reading from a ram-disk) and 2 Gigabit links, also, if one establishes a single connection. We don't know yet, how it scales in a cluster. There might be a problem with reordering the packages. Regards Henning From jpenney at advancedclustering.com Thu Oct 11 08:16:30 2007 From: jpenney at advancedclustering.com (Justin Penney) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Re: Intel quad core nodes? In-Reply-To: <200710101901.l9AJ06mK026026@bluewest.scyld.com> Message-ID: <1219028424.9852571192115790668.JavaMail.root@venus.corp.advancedclustering.com> I recently ran a customer's weather modeling code on a variety of machines. This program is very sensitive to memory. The following are the run times while running 4 processes. Please note that the Xeon, Opteron and Barcelona numbers are from dual socket machines. Core 2 Quad 2.4 GHz - 3h16m Xeon Quad 2.0 GHz - 2h51m Barcelona 2.0 GHz - 1h40m Opteron 2220SE 2.8 GHz - 2h24m At 8 processes: Xeon Quad 2.0 GHz - 2h52m Barcelona 2.0 GHz - 1h22m Barnet Wagman wrote: > I'm moving towards setting up a small cluster (my first), and am > thinking about using Intel quad core processors. However, I'm a little > concerned about memory contention. I'm (tentatively) going to have one > processor per node (this appears to be the cheapest way to go), but I > still wonder whether four cores will choke Intel's memory architecture. > (AMD's Barcelona may be better in this regard, but it doesn't seem to be > available yet, at least not through retail channels). > > I'd like to hear any opinions on this issue. And if you've used quad > core processors, I'd certainly like to hear about your experiences > (including which processor you've used). -- justin penney email: jpenney@advancedclustering.com phone: 913.643.0300 option 2 im: http://livehelp.advancedclustering.com/ From daniels at mkem.uu.se Thu Oct 11 09:49:18 2007 From: daniels at mkem.uu.se (=?iso-8859-1?Q?Daniel_Sp=E5ngberg?=) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Re: Memory limit enforcement In-Reply-To: References: Message-ID: On Wed, 10 Oct 2007 21:18:54 +0200, David Mathog wrote: > > I'm thinking that it shouldn't be too difficult (but what do > I know about kernel hacking?) to allow a process to request: > > 1. for any child processes that may be created after the request > 2. for any malloc() type operation > 3. have the child share the parent's memory statistics/limits > as maintained by the kernel. (Not the complete memory map, just > the sum of physical and virtual memory allocated.) > > The end result would be malloc/calloc failing in some > child process once the parent process's counters hit up against > the set limit(s). This would happen no matter which child ate all the > memory, which is the desired behavior. > That's probably a good start, but wouldn't help with applications which use a lot of shared memory. I have currently a resource limit problem with an application which uses *a lot* of SYSV shared memory. Essentially, on a 4 cpu machine with 4 GB memory it starts four processes, creates four 1 GB shared memory segments, one per process and then attaches all four segments to all processes. So the virtual memory per process is about 4GB and when it actually uses the memory the RSS of each process also comes close to 4GB. We currently use a daemon which kills processes which have a larger RSS than their rlimits, which obviosly is quite bad for this kind of application. Daniel Sp?ngberg UPPMAX Uppsala University Sweden From rfinch at water.ca.gov Fri Oct 12 09:46:35 2007 From: rfinch at water.ca.gov (Finch, Ralph) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Tilera to Introduce 64-Core Processor References: <101120071420.28811.470E3137000A84B10000708B2207020653089C040E99D20B9D0E080C079D@comcast.net> Message-ID: <3EE436BDAFE0044A8D6997B74EC3FAB1028F6B8D@sacex2.ad.water.ca.gov> [I know nothing! Just copy-and-paste from a Usenet group] Subject: Tilera to Introduce 64-Core Processor Newsgroups: comp.arch, comp.arch.embedded, comp.sys.intel, alt.comp.hardware.amd.x86-64, comp.sys.ibm.pc.hardware.chips Date: Thu, 11 Oct 2007 11:02:14 -0700 Tilera to Introduce 64-Core Processor By Andy Patrizio An MIT-inspired startup will introduce a new multi-core chip today at the annual Hot Chips conference at Stanford University. The TILE64 boasts a "clean sheet" design, unencumbered by any legacy compatibility concerns, that Tilera says will provide a huge leap in multithreaded performance. Tilera was founded in 2004 to bring to market the multi-core processor designs of MIT researcher Anant Agarwal. Agarwal created what he called a "mesh" multi-core architecture, where the cores are all interconnected rather than going through a frontside bus, as Intel's multi-core chips do. Agarwal first created this multi-core architecture in 1996, long before Intel and AMD were anywhere close to doing it. The project received funding from the Defense Advanced Research Project Agency (DARPA) and the National Science Foundation, the agency that managed the Internet for decades. Tilera holds 40-plus patents for its multi-core design. TIL64 will be the first in a series of processors built around massively multi-core chips. The TILE64 processor contains 64 full-featured, programmable cores that Tilera claims can perform 500 billion operations per second and delivers ten times the performance and thirty times the performance-per-watt of the Intel dual-core Xeon. Agarwal said the company can make these performance leaps because it doesn't use any legacy technologies or designs. "The real problem with scale is existing multi-core architectures use a bus. In that architecture, the bus is a central switch and all the cores are connected to the single central switch. A packet has to go through it no matter what, which is fine for one, two or four cores, but it does not scale," he told internetnews.com. Tilera uses a mesh architecture, where the cores are laid out in a checkerboard-like grid, all connected through high-speed interconnects. "In architectures of this sort, you can keep growing and you won't have any serious congestion," said Agarwal. Intel has promised to dispense with the frontside bus with the Nehalem architecture, due late next year. AMD does not have a frontside bus in the Opteron, but it's also using four cores at the most, while Tilera is at 64. The TILE family can scale up to even more, or down to a two-core design for the smallest of designs, such as a cell phone. Its power consumption is a few hundred milliwatts per core, Agarwal said. Its clock speed will range from 600MHz to 1GHz. But there's a lot more on the chip than just cores. It has a pair of 10 gigabit Ethernet ports directly on the chip for high speed networking, as well as on-board I/O and peripheral controllers. Its integrated memory controllers allow for up to 200 gigabits of memory bandwidth within the chip. That's what made the TILE64 chip so appealing to Top Layer, developer of network security and intrusion detection appliance. The company had built its own processors but now plans to switch to Tilera's chips, according to Chief Strategy Officer Mike Paquette. "Our software is a multi-core design, and we were able to map out functionality almost 1 for 1 for each process to a core in a Tilera chip," he said. "The performance we expect in our estimates exceeds what we could have gotten from any silicon providers." Top Layer decided to license processors for future products rather than the expense of building any more, and no other processors had the scalability. "Because the movement of data is so much of what we do, we needed a multi-core chip that was optimized for what we were doing rather than something optimized for general purpose computing Tilera has capabilities for network capabilities that are far ahead of what you can get from [x86] processors," said Paquette. Tilera will ship a full development toolkit, called the Multicore Development Environment (MDE), for building applications. It's an Eclipse-based Integrated Development Environment (IDE) with an ANSI standard C compiler, an application level library and tools for debugging and profiling multi-core processors. Wisely, Tilera is not taking on Intel and AMD right out of the gate, as Transmeta did. It's going for the embedded market. "We're focused on embedded because we are a startup and want to go into a space where there is massive demand for performance like ours. We can focus on a couple of markets and do really well in those markets by addressing customer demands squarely and don't have to go up against a dominant competitor," said Agarwal. Tilera expects to sell the TILE64 processor for $435 in lots of 10,000 units. The company is also planning a 36-core and 120-core processor for the near future. http://www.internetnews.com/ent-news/article.php/3695116 From jan.heichler at gmx.net Fri Oct 12 11:42:37 2007 From: jan.heichler at gmx.net (Jan Heichler) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Channel bonding, again In-Reply-To: <20071012115750.GA4993@gretchen.aei.uni-hannover.de> References: <20071011221316.GA18222@bx9.net> <20071012115750.GA4993@gretchen.aei.uni-hannover.de> Message-ID: <865578781.20071012204237@gmx.net> Hallo Henning, Freitag, 12. Oktober 2007, meintest Du: HF> Hello Greg, HF> On Thu, Oct 11, 2007 at 03:13:16PM -0700, Greg Lindahl wrote: >> I'm thinking about using "balance-alb" channel bonding on a >> medium-to-large Linux cluster; does anyone have experience with this? >> It seems that it might generate a lot of arp replies if a switch >> fails. >> HF> We did some experiments with channel bonding. HF> The bond device inherits the MAC-address of the first HF> interface, meaning, it comes with a single MAC- and a single HF> IP-address. The bonding mode determines the load-balance of the HF> transition. That is not 100% correct. There are at least 6 mod that the bonding device of linux supports. balance-alb and balance-tlb are not assigning the same MAC to the interfaces in the bond. Therefore you don't need a switch that supports trunks or something similar. The problem ist that the performance of the bonding device under linux is far away from being optimal (as far as i saw). The round-robin or ad modes do not bring more than 140 to 150 MB/s out of 2 Gigabit links. The load balancing modi allow multiple connections to be fast but are not speeding up a single connection (each connection is limited to the speed of a single link in the bond). As a second point the balancing modes create a heavy load on the nodes and can interfere with a parallel computation. Regards Jan From richard.walsh at comcast.net Mon Oct 15 07:38:35 2007 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Tilera to Introduce 64-Core Processor Message-ID: <101520071438.15899.47137B6B0007A42100003E1B2207022933089C040E99D20B9D0E080C079D@comcast.net> -------------- Original message -------------- From: "Finch, Ralph" > [I know nothing! Just copy-and-paste from a Usenet group] > > Subject: Tilera to Introduce 64-Core Processor > Newsgroups: comp.arch, comp.arch.embedded, comp.sys.intel, > alt.comp.hardware.amd.x86-64, comp.sys.ibm.pc.hardware.chips > Date: Thu, 11 Oct 2007 11:02:14 -0700 > It is cool, but not as cool as it could be ... it has no FPUs like the development version or its academic predecessor, the RAW chip (or Intel's Polaris, for that matter). But the on-chip interconnection network is interesting and programmable, as is the pooled L3 cache. Eventually, as Moore's Law scaling drives up core counts, this kind of programmable mesh interconnect will replace what Intel and AMD currently provide. It a preview of the "many core" future. rbw -- "Making predictions is hard, especially about the future." Niels Bohr -- Richard Walsh Thrashing River Consulting-- 5605 Alameda St. Shoreview, MN 55126 Phone #: 612-382-4620 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071015/1e7eb05d/attachment.html From kilian at stanford.edu Mon Oct 15 10:24:06 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Channel bonding, again In-Reply-To: <47124195.8090500@aei.mpg.de> References: <20071011221316.GA18222@bx9.net> <200710120953.07470.kilian@stanford.edu> <47124195.8090500@aei.mpg.de> Message-ID: <200710151024.08139.kilian@stanford.edu> Hi Carsten, On Sunday 14 October 2007 09:19:33 am Carsten Aulbert wrote: > I don't know from the top of my head which versions we have used, but > our problem with LACP was that the switch (mostly ProCurve 2900, but > i think the Cisco 4948 behaved similarly, but that I would need to > cross-check) was using only a single of the available two or four > lines to the node for a single connection. Thus a node could handle > two different 1 GB/s connections at the same time and reaching almost > 2 Gb/s in total, but we never saw a single connection using all the > available bandwidth. > > That was the reason our student came up with this VLAN trick. Indeed, with a trunked LACP link, a single connection will only go over one link. But you can have up to your-number-of-trunk-lines transfers going wire-speed at the same time. I guess it all depends on what you need. :) We're using LACP to aggregate links between our users and our cluster firewall, like: user \ \ trunk trunk user -- | switch | ====== | firewall | ===== cluster / user / So in that setup, each user's individual connection is limited by its own NIC (and often disk i/o), which is at most GigE. Our point is letting more than only one user transfer data at Gbps speeds. I guess that in our case, the VLAN trick couldn't really work, since, if I understood correctly, the switch has to be the receiveing end for the aggregagtion to work. For instance, the trunking host can send data using several links, but it can only receive using one, because the switch can't load balance and has to choose one interface/VLAN to send data through, is that right? I'm quite surprised balance-tlb could crash a node too, but I didn't try recently. Cheers, -- Kilian From atp at piskorski.com Mon Oct 15 15:32:26 2007 From: atp at piskorski.com (Andrew Piskorski) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Tilera to Introduce 64-Core Processor In-Reply-To: <3EE436BDAFE0044A8D6997B74EC3FAB1028F6B8D@sacex2.ad.water.ca.gov> References: <3EE436BDAFE0044A8D6997B74EC3FAB1028F6B8D@sacex2.ad.water.ca.gov> Message-ID: <20071015223226.GA15488@tehun.pair.com> On Fri, Oct 12, 2007 at 09:46:35AM -0700, Finch, Ralph wrote: > [I know nothing! Just copy-and-paste from a Usenet group] > Tilera to Introduce 64-Core Processor > By Andy Patrizio > http://www.internetnews.com/ent-news/article.php/3695116 This is a commercialization of MIT's RAW chip? There is/was another academic project with an even more interesting sounding design than RAW, but now I can't remember what it's called. Their papers did cite RAW, and explained how they were similar/different. -- Andrew Piskorski http://www.piskorski.com/ From richard.walsh at comcast.net Mon Oct 15 19:48:58 2007 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Tilera to Introduce 64-Core Processor Message-ID: <101620070248.4875.4714269A0004F2AD0000130B2207002953089C040E99D20B9D0E080C079D@comcast.net> -------------- Original message -------------- From: Andrew Piskorski > On Fri, Oct 12, 2007 at 09:46:35AM -0700, Finch, Ralph wrote: > > [I know nothing! Just copy-and-paste from a Usenet group] > > > Tilera to Introduce 64-Core Processor > > By Andy Patrizio > > http://www.internetnews.com/ent-news/article.php/3695116 > > This is a commercialization of MIT's RAW chip? > > There is/was another academic project with an even more interesting > sounding design than RAW, but now I can't remember what it's called. > Their papers did cite RAW, and explained how they were > similar/different. > Perhaps you are refering to the TRIPS Polymorphic Processor from U of Texas which can be configured to favor ILP, DLP, or TLP application types. I think I sent out this reference a while back. It should be in the archives. rbw -- "Making predictions is hard, especially about the future." Niels Bohr -- Richard Walsh Thrashing River Consulting-- 5605 Alameda St. Shoreview, MN 55126 Phone #: 612-382-4620 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071016/6cf95558/attachment.html From jpenney at advancedclustering.com Mon Oct 15 11:56:50 2007 From: jpenney at advancedclustering.com (Justin Penney) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Re: Intel quad core nodes? In-Reply-To: <2011880023.10484331192467730910.JavaMail.root@venus.corp.advancedclustering.com> Message-ID: <1126330224.10491961192474610622.JavaMail.root@venus.corp.advancedclustering.com> > I assume these are all one-system results? > MPI or OpenMP? > Looks like the Xeon Quad and Barcelona results must have been run with 2 > processes per socket? > Was it one executable? Compiled/optimized on what platform? > > Very interesting data, you may just want to provide the list with some > additional details, so they know what to make of the data. > > -Tom These are single system results with MVAPICH. They will run on InfiniBand which is why MVAPICH was used. The quad core runs were 2 processes per socket. The executables were created with the latest Portland Group f90 and the optimisation flags were used for each chip. The Barcelona binary used "-tp barcelona-64" and the Xeon binary used "-tp core2-64." As soon as I am able I will be running this test on Harpertown processors. I should then have permission to publish the details of the testing. >> I recently ran a customer's weather modeling code on a >> variety of machines. This program is very sensitive to >> memory. The following are the run times while running 4 >> processes. Please note that the Xeon, Opteron and Barcelona >> numbers are from dual socket machines. >> >> Core 2 Quad 2.4 GHz - 3h16m >> Xeon Quad 2.0 GHz - 2h51m >> Barcelona 2.0 GHz - 1h40m >> Opteron 2220SE 2.8 GHz - 2h24m >> >> At 8 processes: >> >> Xeon Quad 2.0 GHz - 2h52m >> Barcelona 2.0 GHz - 1h22m -- justin penney email: jpenney@advancedclustering.com phone: 913.643.0300 option 2 im: http://livehelp.advancedclustering.com/ -- justin penney email: jpenney@advancedclustering.com phone: 913.643.0300 option 2 im: http://livehelp.advancedclustering.com/ From andrew at moonet.co.uk Tue Oct 16 06:07:47 2007 From: andrew at moonet.co.uk (andrew holway) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools Message-ID: And the winner of the 2007 Parallel Development Tools Award is....... From landman at scalableinformatics.com Tue Oct 16 06:48:04 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: Message-ID: <4714C114.9060409@scalableinformatics.com> andrew holway wrote: > And the winner of the 2007 Parallel Development Tools Award is....... make -j16 ... (ducks and runs away) -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From john.leidel at gmail.com Tue Oct 16 07:02:40 2007 From: john.leidel at gmail.com (John Leidel) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4714C114.9060409@scalableinformatics.com> References: <4714C114.9060409@scalableinformatics.com> Message-ID: <1192543360.4558.35.camel@e521.site> `vi` :-P On Tue, 2007-10-16 at 09:48 -0400, Joe Landman wrote: > andrew holway wrote: > > And the winner of the 2007 Parallel Development Tools Award is....... > > make -j16 ... > > (ducks and runs away) > > > From rgb at phy.duke.edu Tue Oct 16 07:35:56 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <1192543360.4558.35.camel@e521.site> References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> Message-ID: On Tue, 16 Oct 2007, John Leidel wrote: > `vi` > > :-P Yeah, you'd BETTER duck and run away right after Joe after that one. ;-) rgb > > On Tue, 2007-10-16 at 09:48 -0400, Joe Landman wrote: >> andrew holway wrote: >>> And the winner of the 2007 Parallel Development Tools Award is....... >> >> make -j16 ... >> >> (ducks and runs away) >> >> >> > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From landman at scalableinformatics.com Tue Oct 16 07:39:08 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> Message-ID: <4714CD0C.8060804@scalableinformatics.com> Robert G. Brown wrote: > On Tue, 16 Oct 2007, John Leidel wrote: > >> `vi` >> >> :-P > > Yeah, you'd BETTER duck and run away right after Joe after that one. > > ;-) I have heard (or am spreading) the rumor that the 1-18-08 movie is not really a monster movie, but the final epic battle between vi and emacs ... -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From rgb at phy.duke.edu Tue Oct 16 08:34:25 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4714CD0C.8060804@scalableinformatics.com> References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> Message-ID: On Tue, 16 Oct 2007, Joe Landman wrote: > Robert G. Brown wrote: >> On Tue, 16 Oct 2007, John Leidel wrote: >> >>> `vi` >>> >>> :-P >> >> Yeah, you'd BETTER duck and run away right after Joe after that one. >> >> ;-) > > I have heard (or am spreading) the rumor that the 1-18-08 movie is not really > a monster movie, but the final epic battle between vi and emacs ... Ooooo, now you're REALLY in trouble. You're bad-mouthing emacs. And here I am typing in this reply using jove, which is (as all really Old People know) Jonathan's Own Version of Emacs, which in fact I use for all code development, all email, all writing of fictional novels and poetry. Not only I, but all my rgbbots type into jove or they don't type at all. It is 'leventy million times faster than vi, seven thousand times smaller than emacs, and you never have to actually use a mouse or remove your fingers from the keys making it much better than any possible GUI editor. I do realize (*ahem*) that I'm one of three living humans that still use jove, and I've had to adopt it and maintain its rpm all by myself just so I can still install it as generations of Linux and its libraries pass me by, but it is still a damn good tool! tty4ever! rgb > > > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From James.P.Lux at jpl.nasa.gov Tue Oct 16 10:35:43 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4714CD0C.8060804@scalableinformatics.com> References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> Message-ID: <6.2.3.4.2.20071016100925.0332a890@mail.jpl.nasa.gov> At 07:39 AM 10/16/2007, Joe Landman wrote: >Robert G. Brown wrote: >>On Tue, 16 Oct 2007, John Leidel wrote: >> >>>`vi` >>> >>>:-P >>Yeah, you'd BETTER duck and run away right after Joe after that one. >>;-) > >I have heard (or am spreading) the rumor that the 1-18-08 movie is >not really a monster movie, but the final epic battle between vi and emacs ... Why not TECO? Much more cryptic and arcane than either of those two later wimpy editors. If you MUST have a sort of WYSIWG editor, there *is* a TECO macro that does it, but that would imply you have some form of glass teletype upon which to view it, and, as we all know, real developers use paper: tape, tab cards, or, if they must, teletype rolls. No mouse required. And, there's a command to tell you the phase of the moon, as I recall. (google is my friend... EG).. which wound up being the Emacs calendar command M. vi doesn't provide this incredibly useful and valuable capability (nor does MS Word or Excel, but I DO have a VBA addon, if someone needs it..) And, I happen to know that one can do the equivalent of make from inside TECO (although it's not without problems.. my roommate in the 80s cursed it regularly late into the night) (TECO was used by Stallman to create the first version of Emacs, apparently) And, I never realized it at the time, but I used to go backpacking with one of the keepers of the TECO flame: Pete Siemsen who wrote TECOC (i.e. a port into C of TECO from the original MIDAS). Had I but known, I could have saved some relics of this great man. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From mathog at caltech.edu Tue Oct 16 11:20:51 2007 From: mathog at caltech.edu (David Mathog) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools Message-ID: Jim Lux wrote > and, as we all know, real developers > use paper: tape, tab cards, or, if they must, teletype rolls. You forgot paper tape. (Most people who used it probably wish they could forget it too!) Anyway, all of the tools you mentioned are for wimps - real programmers load code directly into memory using the toggle switches on the front of the computer. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From john.leidel at gmail.com Tue Oct 16 11:41:53 2007 From: john.leidel at gmail.com (John Leidel) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: Message-ID: <1192560113.4558.53.camel@e521.site> Friends don't let friends play tic-tac-toe using punchcards :-) On Tue, 2007-10-16 at 11:20 -0700, David Mathog wrote: > Jim Lux wrote > > > and, as we all know, real developers > > use paper: tape, tab cards, or, if they must, teletype rolls. > > You forgot paper tape. (Most people who used it probably wish they > could forget it too!) > > Anyway, all of the tools you mentioned are for wimps - real > programmers load code directly into memory using the toggle > switches on the front of the computer. > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From dnlombar at ichips.intel.com Tue Oct 16 12:11:55 2007 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: Message-ID: <20071016191155.GA21674@nlxdcldnl2.cl.intel.com> On Tue, Oct 16, 2007 at 11:20:51AM -0700, David Mathog wrote: > Jim Lux wrote > > > and, as we all know, real developers > > use paper: tape, tab cards, or, if they must, teletype rolls. > > You forgot paper tape. (Most people who used it probably wish they > could forget it too!) Let's not get sloppy here, there is (was) 5-hole (Telex) and 7-hole (TWX, later renamed Telex II) paper tape. Back in the early eighties, I had to hunt down an owner of an actual Telex unit to transliterate a spool of 5-hole to text, so that I could have one of the secretaries punch that onto 7-hole tape. Early--manual--transcoding :) -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From peter.st.john at gmail.com Tue Oct 16 12:30:51 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: Message-ID: **real** programmers somehow get large numbers of thralls to hoist huge boulders into precise positions. Poor GFLOPS/$, though. Peter ... Anyway, all of the tools you mentioned are for wimps - real > programmers load code directly into memory using the toggle > switches on the front of the computer. > > Regards, > > David Mathog > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071016/d7314036/attachment.html From lusk at mcs.anl.gov Tue Oct 16 12:40:44 2007 From: lusk at mcs.anl.gov (Rusty Lusk) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: Message-ID: <19DAE557-271A-48BD-9301-08FE4218F2D5@mcs.anl.gov> On Oct 16, 2007, at 1:20 PM, David Mathog wrote: > Jim Lux wrote > >> and, as we all know, real developers >> use paper: tape, tab cards, or, if they must, teletype rolls. > > You forgot paper tape. (Most people who used it probably wish they > could forget it too!) > > Anyway, all of the tools you mentioned are for wimps... This exchange reminds me of a non-wimp I once worked with. Larry Wos, one of the inventors of the field of automated theorem proving, was blind. At Argonne National Laboratory in the early 70's, he programmed in assembly language (of course) using a Model 33 teletype, the kind with the paper tape output on the side. He had it modified to make bumps instead of punching holes. He could read the 5-bit Baudot code on that tape with his thumb as fast as I could read the printed output with my eyes. Rusty Lusk From James.P.Lux at jpl.nasa.gov Tue Oct 16 13:05:15 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: Message-ID: <6.2.3.4.2.20071016130350.02d5ac88@mail.jpl.nasa.gov> At 11:20 AM 10/16/2007, David Mathog wrote: >Jim Lux wrote > > > and, as we all know, real developers > > use paper: tape, tab cards, or, if they must, teletype rolls. > >You forgot paper tape. (Most people who used it probably wish they >could forget it too!) That's what I meant Paper paper tape paper tab cards paper teletype rolls I've heard there's such a thing as rust on scotch tape or something like that... h From jlforrest at berkeley.edu Tue Oct 16 13:52:16 2007 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <6.2.3.4.2.20071016100925.0332a890@mail.jpl.nasa.gov> References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <6.2.3.4.2.20071016100925.0332a890@mail.jpl.nasa.gov> Message-ID: <47152480.4070204@berkeley.edu> Jim Lux wrote: > Why not TECO? Indeed. One of the great features of TECO is that no matter what your name was, you could always enter it as a TECO command, and it would do something. Of course, as other people recognized long ago, most complicated TECO macros closely resembled transmission noise in a character stream. -- Jon Forrest Unix Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest@berkeley.edu From rgb at phy.duke.edu Tue Oct 16 14:19:05 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4714F82D.10707@nada.kth.se> References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> Message-ID: On Tue, 16 Oct 2007, Jon Tegner wrote: > You should switch to a .deb-system, to save you some trouble: > > $ apt-cache search jove > jove - Jonathan's Own Version of Emacs - a compact, powerful editor > > Sorry, couldn't resist ;-) Hey, it's ok. I'm actually trisystemal. FC 6 on top (soon to jump to 8, but in no hurry), VMware, then debian and XP Pro VM. And yes, it was a good thing debian already had jove as I still don't really know how to build debian packages, and manage to get myself confused by apt tools (I'm too used to yum). But there is no doubt: a) Debian is a perfectly useful, fully functional variety of linux, and I have been painfully taught to bow down before its selection of available packages, which is for all practical purposes inexhaustible. In fact, you need a search engine with powerful features even to go shopping amongst them. b) It works great as a VM under Fedora. Beyond that, well, Fedora is also a perfectly useful, fully functional variety of linux. XP Pro (outfitted with cygwin and living in a VM where it belongs) isn't a bad version of Windows, for that matter -- it sucks less than any version of Windows I ever used (which is still plenty, mind you, but it is workable and reasonably stable). jove even builds under cygwin. rgb > > /jon > > > Robert G. Brown wrote: >> >> >> I do realize (*ahem*) that I'm one of three living humans that still use >> jove, and I've had to adopt it and maintain its rpm all by myself just >> so I can still install it as generations of Linux and its libraries pass >> me by, but it is still a damn good tool! >> >> tty4ever! >> >> rgb >> >>> >>> >>> >> > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From gerry.creager at tamu.edu Tue Oct 16 14:21:16 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <20071016191155.GA21674@nlxdcldnl2.cl.intel.com> References: <20071016191155.GA21674@nlxdcldnl2.cl.intel.com> Message-ID: <47152B4C.2060706@tamu.edu> Lombard, David N wrote: > On Tue, Oct 16, 2007 at 11:20:51AM -0700, David Mathog wrote: >> Jim Lux wrote >> >>> and, as we all know, real developers >>> use paper: tape, tab cards, or, if they must, teletype rolls. >> You forgot paper tape. (Most people who used it probably wish they >> could forget it too!) > > Let's not get sloppy here, there is (was) 5-hole (Telex) and 7-hole (TWX, > later renamed Telex II) paper tape. > > Back in the early eighties, I had to hunt down an owner of an actual Telex > unit to transliterate a spool of 5-hole to text, so that I could have one > of the secretaries punch that onto 7-hole tape. Early--manual--transcoding :) You mean you didn't know how to read the tape by sight? -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From rgb at phy.duke.edu Tue Oct 16 14:28:45 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: Message-ID: On Tue, 16 Oct 2007, David Mathog wrote: > Jim Lux wrote > >> and, as we all know, real developers >> use paper: tape, tab cards, or, if they must, teletype rolls. > > You forgot paper tape. (Most people who used it probably wish they > could forget it too!) > > Anyway, all of the tools you mentioned are for wimps - real > programmers load code directly into memory using the toggle > switches on the front of the computer. Been there, done that! Both of them (a PDP 1 "with a few bad bits", booted from paper tape once you set the sense switches). Coded on it, too... Maybe we should form an Old Guy club or something...;-) > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From gerry.creager at tamu.edu Tue Oct 16 15:34:18 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <1192560113.4558.53.camel@e521.site> References: <1192560113.4558.53.camel@e521.site> Message-ID: <47153C6A.8010803@tamu.edu> Didn't you have a tic-tac-toe game on punch cards written in PL-1? John Leidel wrote: > Friends don't let friends play tic-tac-toe using punchcards :-) > > On Tue, 2007-10-16 at 11:20 -0700, David Mathog wrote: >> Jim Lux wrote >> >>> and, as we all know, real developers >>> use paper: tape, tab cards, or, if they must, teletype rolls. >> You forgot paper tape. (Most people who used it probably wish they >> could forget it too!) >> >> Anyway, all of the tools you mentioned are for wimps - real >> programmers load code directly into memory using the toggle >> switches on the front of the computer. >> >> Regards, >> >> David Mathog >> mathog@caltech.edu >> Manager, Sequence Analysis Facility, Biology Division, Caltech >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From gerry.creager at tamu.edu Tue Oct 16 17:34:38 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <949399.18114.qm@web37911.mail.mud.yahoo.com> References: <949399.18114.qm@web37911.mail.mud.yahoo.com> Message-ID: <4715589E.608@tamu.edu> Quote... "Three things in life a man must do, Before his days are done. Write two lines of APL... And make the sucker run." OK, so it's not PL-I but APL was another interesting beast. A friend had written an entire StarTrek game, including a 3d universe, in APL and we wasted cycles waiting for long jobs on the Amdahl 470v6 to complete that way... Ellis Wilson wrote: > Wow, PL-I, I'm learning about that in my language design class. While > it brought a bunch of new items to the computing field, can't say I'm > upset I didn't code in it :). > > Sorry guys, I came into existence just about the time the internet was > opened up from just NSF to commercial interest, so punch cards are a > little out of my league. I must say though, this certainly beats the > heck out of a history of computing languages class any day! > > Ellis > > */Gerry Creager /* wrote: > > Didn't you have a tic-tac-toe game on punch cards written in PL-1? > > John Leidel wrote: > > Friends don't let friends play tic-tac-toe using punchcards :-) > > > > On Tue, 2007-10-16 at 11:20 -0700, David Mathog wrote: > >> Jim Lux wrote > >> > >>> and, as we all know, real developers > >>> use paper: tape, tab cards, or, if they must, teletype rolls. > >> You forgot paper tape. (Most people who used it probably wish they > >> could forget it too!) > >> > >> Anyway, all of the tools you mentioned are for wimps - real > >> programmers load code directly into memory using the toggle > >> switches on the front of the computer. > >> > >> Regards, > >> > >> David Mathog > >> mathog@caltech.edu > >> Manager, Sequence Analysis Facility, Biology Division, Caltech > >> _______________________________________________ > >> Beowulf mailing list, Beowulf@beowulf.org > >> To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Gerry Creager -- gerry.creager@tamu.edu > Texas Mesonet -- AATLT, Texas A&M University > Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 > Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > ------------------------------------------------------------------------ > Be a better Heartthrob. Get better relationship answers > from > someone who knows. > Yahoo! Answers - Check it out. -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From James.P.Lux at jpl.nasa.gov Tue Oct 16 21:04:48 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4715589E.608@tamu.edu> References: <949399.18114.qm@web37911.mail.mud.yahoo.com> <4715589E.608@tamu.edu> Message-ID: <6.2.3.4.2.20071016204804.03341218@mail.jpl.nasa.gov> At 05:34 PM 10/16/2007, Gerry Creager wrote: >Quote... >"Three things in life a man must do, >Before his days are done. >Write two lines of APL... >And make the sucker run." > >OK, so it's not PL-I but APL was another interesting beast. A >friend had written an entire StarTrek game, including a 3d universe, >in APL and we wasted cycles waiting for long jobs on the Amdahl >470v6 to complete that way... Lest we forget... the first portable personal computer, the IBM 5100 (predecessor of the more familiar IBM 5150 Personal Computer) ran APL and BASIC as its two native languages. And had core memory, to boot, so if you just powered off instead of running the shutdown, it would remember where you were. I think IBM's first real "personal computer", as in one intended to be used by a single person at a time, sitting at the typewriter console, was probably the 1130 (although I understand that it really was a version of their industrial 1800 machine) >Ellis Wilson wrote: >>Wow, PL-I, I'm learning about that in my language design >>class. While it brought a bunch of new items to the computing >>field, can't say I'm upset I didn't code in it :). >>Sorry guys, I came into existence just about the time the internet >>was opened up from just NSF to commercial interest, so punch cards >>are a little out of my league. I must say though, this certainly >>beats the heck out of a history of computing languages class any day! >>Ellis You know... I pitched my last box of cards into the trash probably in the early,mid-80s. So there I am, working for a special effects company in 1997, and we get hired by the producer to do effects for an Intel commercial. They want punched cards in the shot sort of floating in the background on wires (a "practical" effect, as opposed to CGI)... So one of the artists gets out some card stock and an X-acto knife and starts to cut out a tab card AND HOLES... I walk by after de-lidding some Pentiums for another shot, and say, surely there's some place in Los Angeles where we can just buy a box of cards and use a keypunch. A day of phone calls later... So yes, one CAN buy 80 column tab cards still (or 10 years ago you could). Minimum order is a case, 5 boxes (10,000 cards). But, even better, I found a place in the San Fernando Valley that did key-to-disk and service bureau processing on legacy stuff. They said, oh yeah, we have a keypunch to do JCL cards and such for various card oriented processing jobs. (Interestingly, Ventura County still used tab cards for ballots in elections until last year) I went over there, and they gave me a bunch of cards and let me use their 029 keypunch back in the corner. Talk about a weird experience. I hadn't even seen one since, probably, 1980. I sat down, reached under to turn on the power, loaded the cards in the hopper, flipped the switch that did the auto feed, and started punching. There's all those sounds of the card coming down, moving across, etc., and then the chunk,chunk,chunk as you punch. Talk about instinctive motor memory... I think the guy at the service center was confident I knew what I was doing when I knew how to turn on the power without looking.. Thank the gods I didn't have to build a drum card. But now, I've fully confirmed I'm an old codger, and I'm not even 50 yet. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From john.hearns at streamline-computing.com Wed Oct 17 00:18:56 2007 From: john.hearns at streamline-computing.com (John Hearns) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: Message-ID: <4715B760.4020302@streamline-computing.com> Peter St. John wrote: > **real** programmers somehow get large numbers of thralls to hoist huge > boulders into precise positions. s/boulders/19 inch racks/ From tjrc at sanger.ac.uk Wed Oct 17 01:32:07 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> Message-ID: On 16 Oct 2007, at 10:19 pm, Robert G. Brown wrote: > On Tue, 16 Oct 2007, Jon Tegner wrote: > >> You should switch to a .deb-system, to save you some trouble: >> >> $ apt-cache search jove >> jove - Jonathan's Own Version of Emacs - a compact, powerful editor >> >> Sorry, couldn't resist ;-) > > Hey, it's ok. I'm actually trisystemal. FC 6 on top (soon to jump to > 8, but in no hurry), VMware, then debian and XP Pro VM. And yes, > it was > a good thing debian already had jove as I still don't really know > how to > build debian packages, If you want a good introduction to debian packages and how they work, then I recommend Martin Krafft's book "The Debian System". I've been a Debian Developer for ten years, and that book still teaches me useful stuff about Debian on a regular basis. The chapter on packaging is superb; it teaches you how to make packages from the ground up, so you really understand how they work, starting with the basic fact that fundamentally a debian binary package is an ar archive which contains two tarballs. One, data.tar.gz contains the files belonging to the package. The other, control.tar.gz, contains the scripts and information about the package used by the packaging tools, and at a minimum this contains two files: DEBIAN/control, which contains the information about the package (description, dependencies and whatnot) and DEBIAN/md5sums which is, as you'd expect, a list of md5sums of all the plain files in the package. Once he's shown you how to build a Debian package manually like that, he then shows you how to do it the more normal way using the various wrapper scripts that Debian provides for the purpose to make life a bit easier (and to help enforce the Debian policy on packages) Debian doesn't really have a source package idea like Red Hat - instead, when you use "apt-get source" to download the source for a package you get three files; the upstream tarball, which is completely unmodified from upstream. You also get a gzipped patch, and a description file containing md5sums for the patch and the tarball, amongst other things. Typically, the patch creates a debian directory within the upstream source directory, and inside that debian directory is a file called "rules". This is just a normal makefile, containing all the instructions for configuring, compiling and packaging the software on a Debian system. Once you have one of these things, building the .debs is just a matter of typing: dpkg-buildpackage -rfakeroot or something similar. There are still fancier things available for doing this by keeping the sources and debian/* files in a CVS, subversion or other revision control repository. I use these in my own package management activities to be able to go back and build previous releases when users report bugs against them. > and manage to get myself confused by apt tools I can sympathise. I've only started using aptitude since etch came out, and it's taken me some time to get used to, but now that I am, I quite like it, for the most part. Especially the etch version, the version of it in sarge had some really annoying behaviour under certain circumstances. > (I'm too used to yum). But there is no doubt: > > a) Debian is a perfectly useful, fully functional variety of linux, > and I have been painfully taught to bow down before its selection of > available packages, which is for all practical purposes inexhaustible. > In fact, you need a search engine with powerful features even to go > shopping amongst them. ... which fortunately it provides for you. It's called apt-cache. Regards, Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From andrew at moonet.co.uk Wed Oct 17 02:30:51 2007 From: andrew at moonet.co.uk (andrew holway) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> Message-ID: Apt-cache with a bit of grep is a powerful tool indeed. $apt-cache search foo | grep bar everyone I work with however prefers yum. They regard Debian as being a bit backward. On 17/10/2007, Tim Cutts wrote: > > On 16 Oct 2007, at 10:19 pm, Robert G. Brown wrote: > > > On Tue, 16 Oct 2007, Jon Tegner wrote: > > > >> You should switch to a .deb-system, to save you some trouble: > >> > >> $ apt-cache search jove > >> jove - Jonathan's Own Version of Emacs - a compact, powerful editor > >> > >> Sorry, couldn't resist ;-) > > > > Hey, it's ok. I'm actually trisystemal. FC 6 on top (soon to jump to > > 8, but in no hurry), VMware, then debian and XP Pro VM. And yes, > > it was > > a good thing debian already had jove as I still don't really know > > how to > > build debian packages, > > If you want a good introduction to debian packages and how they work, > then I recommend Martin Krafft's book "The Debian System". I've been > a Debian Developer for ten years, and that book still teaches me > useful stuff about Debian on a regular basis. > > The chapter on packaging is superb; it teaches you how to make > packages from the ground up, so you really understand how they work, > starting with the basic fact that fundamentally a debian binary > package is an ar archive which contains two tarballs. One, > data.tar.gz contains the files belonging to the package. The other, > control.tar.gz, contains the scripts and information about the > package used by the packaging tools, and at a minimum this contains > two files: DEBIAN/control, which contains the information about the > package (description, dependencies and whatnot) and DEBIAN/md5sums > which is, as you'd expect, a list of md5sums of all the plain files > in the package. > > Once he's shown you how to build a Debian package manually like that, > he then shows you how to do it the more normal way using the various > wrapper scripts that Debian provides for the purpose to make life a > bit easier (and to help enforce the Debian policy on packages) > > Debian doesn't really have a source package idea like Red Hat - > instead, when you use "apt-get source" to download the source for a > package you get three files; the upstream tarball, which is > completely unmodified from upstream. You also get a gzipped patch, > and a description file containing md5sums for the patch and the > tarball, amongst other things. Typically, the patch creates a debian > directory within the upstream source directory, and inside that > debian directory is a file called "rules". This is just a normal > makefile, containing all the instructions for configuring, compiling > and packaging the software on a Debian system. Once you have one of > these things, building the .debs is just a matter of typing: > > dpkg-buildpackage -rfakeroot > > or something similar. There are still fancier things available for > doing this by keeping the sources and debian/* files in a CVS, > subversion or other revision control repository. I use these in my > own package management activities to be able to go back and build > previous releases when users report bugs against them. > > > and manage to get myself confused by apt tools > > I can sympathise. I've only started using aptitude since etch came > out, and it's taken me some time to get used to, but now that I am, I > quite like it, for the most part. Especially the etch version, the > version of it in sarge had some really annoying behaviour under > certain circumstances. > > > (I'm too used to yum). But there is no doubt: > > > > a) Debian is a perfectly useful, fully functional variety of linux, > > and I have been painfully taught to bow down before its selection of > > available packages, which is for all practical purposes inexhaustible. > > In fact, you need a search engine with powerful features even to go > > shopping amongst them. > > ... which fortunately it provides for you. It's called apt-cache. > > Regards, > > Tim > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From tjrc at sanger.ac.uk Wed Oct 17 03:08:13 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> Message-ID: <193A590B-0FF9-4041-BCE3-101F48AB2525@sanger.ac.uk> On 17 Oct 2007, at 10:30 am, andrew holway wrote: > Apt-cache with a bit of grep is a powerful tool indeed. > > $apt-cache search foo | grep bar > > everyone I work with however prefers yum. They regard Debian as being > a bit backward. I'm not familiar with yum, so I can't really comment. However, apt- cache is of course the "previous" generation of package management tools, and these days aptitude is recommended instead. I guess you could say that apt-{get,cache} are to aptitude as apt-rpm was to yum. I guess these things are always likely to leap-frog each other, as each distro sees what's good and bad in the current tools available with other distros, and copies/improves those features as appropriate. I've been reading the aptitude documentation this morning, as it happens, and there's all sorts of useful stuff it can do above what apt-cache could manage, and it's solved a number of long-standing problems I've had... in particular its search functionality is *much* improved over apt-cache. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From andrew at moonet.co.uk Wed Oct 17 03:50:31 2007 From: andrew at moonet.co.uk (andrew holway) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <193A590B-0FF9-4041-BCE3-101F48AB2525@sanger.ac.uk> References: <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> <193A590B-0FF9-4041-BCE3-101F48AB2525@sanger.ac.uk> Message-ID: Not a fan of aptitude, Like command line me :) On 17/10/2007, Tim Cutts wrote: > > On 17 Oct 2007, at 10:30 am, andrew holway wrote: > > > Apt-cache with a bit of grep is a powerful tool indeed. > > > > $apt-cache search foo | grep bar > > > > everyone I work with however prefers yum. They regard Debian as being > > a bit backward. > > I'm not familiar with yum, so I can't really comment. However, apt- > cache is of course the "previous" generation of package management > tools, and these days aptitude is recommended instead. I guess you > could say that apt-{get,cache} are to aptitude as apt-rpm was to > yum. I guess these things are always likely to leap-frog each other, > as each distro sees what's good and bad in the current tools > available with other distros, and copies/improves those features as > appropriate. > > I've been reading the aptitude documentation this morning, as it > happens, and there's all sorts of useful stuff it can do above what > apt-cache could manage, and it's solved a number of long-standing > problems I've had... in particular its search functionality is > *much* improved over apt-cache. > > Tim > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > From tjrc at sanger.ac.uk Wed Oct 17 05:15:40 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4715F9B9.5050402@aplpi.com> References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> <193A590B-0FF9-4041-BCE3-101F48AB2525@sanger.ac.uk> <4715F9B9.5050402@aplpi.com> Message-ID: <0FFB2FC4-A168-4FC3-ACC7-5221153193FF@sanger.ac.uk> On 17 Oct 2007, at 1:02 pm, stephen mulcahy wrote: > > > Tim Cutts wrote: >> I've been reading the aptitude documentation this morning, as it >> happens, and there's all sorts of useful stuff it can do above >> what apt-cache could manage, and it's solved a number of long- >> standing problems I've had... in particular its search >> functionality is *much* improved over apt-cache. > > Interesting, I should rtfm. I've been using aptitude since > migrating to etch but still find myself using apt-cache search over > aptitude search because it seemed to give me more hits. As do I. But there's gold in them thar hills: http://people.debian.org/~dburrows/aptitude-doc/en/ch02s03.html Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From gdjacobs at gmail.com Wed Oct 17 07:24:18 2007 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> <193A590B-0FF9-4041-BCE3-101F48AB2525@sanger.ac.uk> Message-ID: <47161B12.1060206@gmail.com> andrew holway wrote: > Not a fan of aptitude, Like command line me :) > > On 17/10/2007, Tim Cutts wrote: >> On 17 Oct 2007, at 10:30 am, andrew holway wrote: >> >>> Apt-cache with a bit of grep is a powerful tool indeed. >>> >>> $apt-cache search foo | grep bar >>> >>> everyone I work with however prefers yum. They regard Debian as being >>> a bit backward. >> I'm not familiar with yum, so I can't really comment. However, apt- >> cache is of course the "previous" generation of package management >> tools, and these days aptitude is recommended instead. I guess you >> could say that apt-{get,cache} are to aptitude as apt-rpm was to >> yum. I guess these things are always likely to leap-frog each other, >> as each distro sees what's good and bad in the current tools >> available with other distros, and copies/improves those features as >> appropriate. >> >> I've been reading the aptitude documentation this morning, as it >> happens, and there's all sorts of useful stuff it can do above what >> apt-cache could manage, and it's solved a number of long-standing >> problems I've had... in particular its search functionality is >> *much* improved over apt-cache. >> >> Tim $ aptitude help It's all there. -- Geoffrey D. Jacobs To have no errors would be life without meaning No struggle, no joy From herbert.fruchtl at st-andrews.ac.uk Wed Oct 17 07:45:07 2007 From: herbert.fruchtl at st-andrews.ac.uk (Herbert Fruchtl) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Barcelona availability Message-ID: <47161FF3.3020408@st-andrews.ac.uk> We are planning to extend an existing cluster (dual-core Opterons) with some Barcelona nodes. What the salesman tells me is that you can get 1.9GHz chips now, up to 2.3GHz in December and 2.4/2.6GHz in January. Does this sound realistic? Have faster Barcelonas been seen in the channel? Are there official (or unofficial) announcements from AMD? Thanks in advance, Herbert -- Herbert Fruchtl EaStCHEM Fellow School of Chemistry University of St Andrews From rgb at phy.duke.edu Wed Oct 17 08:20:26 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4715589E.608@tamu.edu> References: <949399.18114.qm@web37911.mail.mud.yahoo.com> <4715589E.608@tamu.edu> Message-ID: On Tue, 16 Oct 2007, Gerry Creager wrote: > Quote... > "Three things in life a man must do, > Before his days are done. > Write two lines of APL... > And make the sucker run." > > OK, so it's not PL-I but APL was another interesting beast. A friend had > written an entire StarTrek game, including a 3d universe, in APL and we > wasted cycles waiting for long jobs on the Amdahl 470v6 to complete that > way... Great little poem. I didn't write tic-tac-toe -- zero sum games that always can be drawn being a bit on the boring side -- but I did write MasterMind (the code-breaking game) on an IBM 5100 in APL. I lived to regret it, too, as by strange chance John Titor claimed to have come back from the future to collect an IBM 5100 (see, sigh, the wikipedia article on Titor if you're VERY bored today) and a group of fairly active Titor-busters homed in on me as a potential Titor, because I mentioned this fact on this very forum a decade or so ago and google revealed this awful truth. It took more than two lines of APL, though... rgb > > Ellis Wilson wrote: >> Wow, PL-I, I'm learning about that in my language design class. While it >> brought a bunch of new items to the computing field, can't say I'm upset I >> didn't code in it :). >> >> Sorry guys, I came into existence just about the time the internet was >> opened up from just NSF to commercial interest, so punch cards are a little >> out of my league. I must say though, this certainly beats the heck out of >> a history of computing languages class any day! >> >> Ellis >> >> */Gerry Creager /* wrote: >> >> Didn't you have a tic-tac-toe game on punch cards written in PL-1? >> >> John Leidel wrote: >> > Friends don't let friends play tic-tac-toe using punchcards :-) >> > >> > On Tue, 2007-10-16 at 11:20 -0700, David Mathog wrote: >> >> Jim Lux wrote >> >> >> >>> and, as we all know, real developers >> >>> use paper: tape, tab cards, or, if they must, teletype rolls. >> >> You forgot paper tape. (Most people who used it probably wish they >> >> could forget it too!) >> >> >> >> Anyway, all of the tools you mentioned are for wimps - real >> >> programmers load code directly into memory using the toggle >> >> switches on the front of the computer. >> >> >> >> Regards, >> >> >> >> David Mathog >> >> mathog@caltech.edu >> >> Manager, Sequence Analysis Facility, Biology Division, Caltech >> >> _______________________________________________ >> >> Beowulf mailing list, Beowulf@beowulf.org >> >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > >> > _______________________________________________ >> > Beowulf mailing list, Beowulf@beowulf.org >> > To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> -- Gerry Creager -- gerry.creager@tamu.edu >> Texas Mesonet -- AATLT, Texas A&M University >> Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 >> Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> >> ------------------------------------------------------------------------ >> Be a better Heartthrob. Get better relationship answers >> from >> someone who knows. >> Yahoo! Answers - Check it out. > > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From peter.st.john at gmail.com Wed Oct 17 09:47:39 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4715589E.608@tamu.edu> References: <949399.18114.qm@web37911.mail.mud.yahoo.com> <4715589E.608@tamu.edu> Message-ID: I implemented "Chomp" (a nim-like game from Martin Gardner's description) in basic with paper-punch around '74. At that time I had no concepts for "OS" or "development environment" but I learned "acoustic coupler" and "teletype" and was confused by "duplex". Later I briefly used a keypunch to learn FORTRAN. But my dad tells a droll story. He was sitting at a keypunch machine distraught about the random number generation algorithms they used for monte carlo simulation for neutrons that thought they were in a pinball machine (this would have been at Savannah River). Hanging his head in frustration, he noticed the wastebasket that every keypunch machine has in the same place, to collect the huge pile of little punched out bits of paper with ...numerals. He thought, "hmmm". Nothing came of it. Years ..a generation ...later, I was frustrated by the random number library on the Slackware on my 486, and rolled up my own (replaced days later by porting to SunOS which had a better library). But I was reminded of the wastebasket and thought, "hmmm". Nothing came of that either :-) If someone had thought of a way to queue up and read tiny bits of paper science would have advanced a decade :-) Peter ... > > Sorry guys, I came into existence just about the time the internet was > > opened up from just NSF to commercial interest, so punch cards are a > > little out of my league. I must say though, this certainly beats the > > heck out of a history of computing languages class any day! > > > > Ellis > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071017/bca4eabd/attachment.html From rgb at phy.duke.edu Wed Oct 17 09:55:29 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:34 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> Message-ID: On Wed, 17 Oct 2007, Tim Cutts wrote: > > If you want a good introduction to debian packages and how they work, then I > recommend Martin Krafft's book "The Debian System". I've been a Debian > Developer for ten years, and that book still teaches me useful stuff about > Debian on a regular basis. Very useful information, actually, and I will get the book in my next Amazon load. For better or worse, I'm maintaining a couple of packages that are living under FC and going into Debian, and the only sane way to do that is to learn them both (since both dieharder and wulfware have libraries of their own and nontrivial builds). rgb > > The chapter on packaging is superb; it teaches you how to make packages from > the ground up, so you really understand how they work, starting with the > basic fact that fundamentally a debian binary package is an ar archive which > contains two tarballs. One, data.tar.gz contains the files belonging to the > package. The other, control.tar.gz, contains the scripts and information > about the package used by the packaging tools, and at a minimum this contains > two files: DEBIAN/control, which contains the information about the package > (description, dependencies and whatnot) and DEBIAN/md5sums which is, as you'd > expect, a list of md5sums of all the plain files in the package. > > Once he's shown you how to build a Debian package manually like that, he then > shows you how to do it the more normal way using the various wrapper scripts > that Debian provides for the purpose to make life a bit easier (and to help > enforce the Debian policy on packages) > > Debian doesn't really have a source package idea like Red Hat - instead, when > you use "apt-get source" to download the source for a package you get three > files; the upstream tarball, which is completely unmodified from upstream. > You also get a gzipped patch, and a description file containing md5sums for > the patch and the tarball, amongst other things. Typically, the patch > creates a debian directory within the upstream source directory, and inside > that debian directory is a file called "rules". This is just a normal > makefile, containing all the instructions for configuring, compiling and > packaging the software on a Debian system. Once you have one of these > things, building the .debs is just a matter of typing: > > dpkg-buildpackage -rfakeroot > > or something similar. There are still fancier things available for doing > this by keeping the sources and debian/* files in a CVS, subversion or other > revision control repository. I use these in my own package management > activities to be able to go back and build previous releases when users > report bugs against them. > >> and manage to get myself confused by apt tools > > I can sympathise. I've only started using aptitude since etch came out, and > it's taken me some time to get used to, but now that I am, I quite like it, > for the most part. Especially the etch version, the version of it in sarge > had some really annoying behaviour under certain circumstances. > >> (I'm too used to yum). But there is no doubt: >> >> a) Debian is a perfectly useful, fully functional variety of linux, >> and I have been painfully taught to bow down before its selection of >> available packages, which is for all practical purposes inexhaustible. >> In fact, you need a search engine with powerful features even to go >> shopping amongst them. > > ... which fortunately it provides for you. It's called apt-cache. > > Regards, > > Tim > > > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From rgb at phy.duke.edu Wed Oct 17 10:08:25 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> Message-ID: On Wed, 17 Oct 2007, andrew holway wrote: > Apt-cache with a bit of grep is a powerful tool indeed. > > $apt-cache search foo | grep bar > > everyone I work with however prefers yum. They regard Debian as being > a bit backward. I don't know if Debian is backward, but I will affirm that yum is pretty fashion-forward. I'm probably biased, of course, since I wrote the original HOWTO and since Seth developed it about seventy feet from where I'm sitting right now. I've never seriously contributed to its development but I've been a user for years, and it continues to improve even now. There are lots of ways to search out info using yum, some faster or more complex than others, but the most common ones, also the most useful and fastest, are as simple as "yum list \*whatever\*, e.g. -- rgb@ganesh|B:1011>yum list \*pvm\* Available Packages pvm.x86_64 3.4.5-7.fc6.1 fedora pvm-gui.x86_64 3.4.5-7.fc6.1 fedora It also supports yum search foo (likely to get more than you really want), yum info, yum deplist, etc. most of which I rarely use. It's nice to do "yum info \* > /tmp/yum.list" once right after an install or upgrade, though, as this pretty much gives you a catalog of all the available packages and their synopses, so that you can find almost anything efficiently and interactively with less /tmp/yum.list and its built in find tools, or via grep , or by just plain browsing through it. Of course with 20K or so packages, browsing through Debian is a bit more tedious I understand...;-) rgb > > On 17/10/2007, Tim Cutts wrote: >> >> On 16 Oct 2007, at 10:19 pm, Robert G. Brown wrote: >> >>> On Tue, 16 Oct 2007, Jon Tegner wrote: >>> >>>> You should switch to a .deb-system, to save you some trouble: >>>> >>>> $ apt-cache search jove >>>> jove - Jonathan's Own Version of Emacs - a compact, powerful editor >>>> >>>> Sorry, couldn't resist ;-) >>> >>> Hey, it's ok. I'm actually trisystemal. FC 6 on top (soon to jump to >>> 8, but in no hurry), VMware, then debian and XP Pro VM. And yes, >>> it was >>> a good thing debian already had jove as I still don't really know >>> how to >>> build debian packages, >> >> If you want a good introduction to debian packages and how they work, >> then I recommend Martin Krafft's book "The Debian System". I've been >> a Debian Developer for ten years, and that book still teaches me >> useful stuff about Debian on a regular basis. >> >> The chapter on packaging is superb; it teaches you how to make >> packages from the ground up, so you really understand how they work, >> starting with the basic fact that fundamentally a debian binary >> package is an ar archive which contains two tarballs. One, >> data.tar.gz contains the files belonging to the package. The other, >> control.tar.gz, contains the scripts and information about the >> package used by the packaging tools, and at a minimum this contains >> two files: DEBIAN/control, which contains the information about the >> package (description, dependencies and whatnot) and DEBIAN/md5sums >> which is, as you'd expect, a list of md5sums of all the plain files >> in the package. >> >> Once he's shown you how to build a Debian package manually like that, >> he then shows you how to do it the more normal way using the various >> wrapper scripts that Debian provides for the purpose to make life a >> bit easier (and to help enforce the Debian policy on packages) >> >> Debian doesn't really have a source package idea like Red Hat - >> instead, when you use "apt-get source" to download the source for a >> package you get three files; the upstream tarball, which is >> completely unmodified from upstream. You also get a gzipped patch, >> and a description file containing md5sums for the patch and the >> tarball, amongst other things. Typically, the patch creates a debian >> directory within the upstream source directory, and inside that >> debian directory is a file called "rules". This is just a normal >> makefile, containing all the instructions for configuring, compiling >> and packaging the software on a Debian system. Once you have one of >> these things, building the .debs is just a matter of typing: >> >> dpkg-buildpackage -rfakeroot >> >> or something similar. There are still fancier things available for >> doing this by keeping the sources and debian/* files in a CVS, >> subversion or other revision control repository. I use these in my >> own package management activities to be able to go back and build >> previous releases when users report bugs against them. >> >>> and manage to get myself confused by apt tools >> >> I can sympathise. I've only started using aptitude since etch came >> out, and it's taken me some time to get used to, but now that I am, I >> quite like it, for the most part. Especially the etch version, the >> version of it in sarge had some really annoying behaviour under >> certain circumstances. >> >>> (I'm too used to yum). But there is no doubt: >>> >>> a) Debian is a perfectly useful, fully functional variety of linux, >>> and I have been painfully taught to bow down before its selection of >>> available packages, which is for all practical purposes inexhaustible. >>> In fact, you need a search engine with powerful features even to go >>> shopping amongst them. >> >> ... which fortunately it provides for you. It's called apt-cache. >> >> Regards, >> >> Tim >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From rgb at phy.duke.edu Wed Oct 17 10:45:01 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <949399.18114.qm@web37911.mail.mud.yahoo.com> <4715589E.608@tamu.edu> Message-ID: On Wed, 17 Oct 2007, Peter St. John wrote: > If someone had thought of a way to queue up and read tiny bits of paper > science would have advanced a decade :-) Ahh, but but but... Let us grant that a bucket full of such dots can be shaken to where the order that they are drawn is unpredictable (note well that I don't say "random" as I'm not convinced that the word means anything beyond an abstraction in this Universe). Not unlike the little bingo or lottery ball machines, they get all mixed up and after enough shuffling or shaking one can attain a high degree of mixing that makes them unpredictable and may make them "random" within testable resolution on the source. However, is there any guarantee that 0-9 are uniformly distributed in the original shuffled sample? There is not. If you punched out letters drawn from (say) a dictionary, would they be uniformly distributed? In no way. Would the results of using shuffled strings of either one be likely to produce acceptable digits in a uniform random distribution? No. And in any event, the "randomness" comes from the shuffling, not the source per se. So even if one deliberated punched all the numbers out of many cards and ensured uniform populations of each digit in the shuffled population (and drew from that population with replacement and additional shuffling, so that one doesn't immediately introduce bias after the first digit is drawn) it is the shuffling that matters. If it is good, then you don't need "a population" -- you just need one each of the ten digits and a good shuffler, as you draw, replace, shuffle, draw. So one is then back to -- how to make a good shuffler? Physically it isn't too easy, actually -- there having many balls gives one the ability to average over the subtle differences between balls that might produce slight deviations from uniformity in the shuffle/draw. Numerically you're right back where you started, because a good shuffle requires a good random number generator (or at least a good source of unpredictability/entropy). This is a non-trivial problem, actually. There are numerous physical sources of "randomness" or "entropy" out there in the world, but many of them produce not random bits with an equal probability of 0 and 1 but "random" bits with some unequal probability of 0 and 1. Some of them have autocorrelation times associated with the drawing process. Some of them have long term occult periodicities in the signals. Even with physical RNGs, about all one can really say is ex post facto either the strings of random bits they produce pass various statistical tests for randomness, or they don't. Throw in Shannon's theorem and some of its consequences -- entropy theorems applied to code -- and "random" number generation (oxymoron that it is) is one of the most interesting subjects on the planet, as is testing and their various applications. rgb -- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From kyron at mekkreations.com Tue Oct 16 07:05:35 2007 From: kyron at mekkreations.com (Eric Thibodeau) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4714C114.9060409@scalableinformatics.com> References: <4714C114.9060409@scalableinformatics.com> Message-ID: <4714C52F.5020301@mekkreations.com> Of course, this is valid given so many people use the kernel build time as a benchmark ;) Heck! It's even used as a filesystem performance benchmark! Eric Joe Landman wrote: > andrew holway wrote: >> And the winner of the 2007 Parallel Development Tools Award is....... > > make -j16 ... > > (ducks and runs away) From SeanWard at msn.com Tue Oct 16 07:08:14 2007 From: SeanWard at msn.com (Sean Ward) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Reliable Job Queueing and Notification Message-ID: I've started work on a web service which contains several potentially long running processing steps (molecular dynamics), which are perfect to farm out to the fairly large (90 node) Beowulf I have access to. The primary issue is translating requests from the event driven web service, to job queues, and back again upon completion. Specifically, the major queuing systems I have immediate access to (Sun Grid Engine and Condor) only support e-mail based notification of job completion. Starting jobs isn't an issue, as my service can simply ssh over and execute shell scripts as needed to start things up, the problem is reliably being informed when the jobs fail or complete, via any programmatic method (such as executing a shell script, calling a web service via SOAP/etc, or an asynchronous message library). My other problem, ensuring that these web service requests don't starve in house jobs on the Beowulf is easily handled via the priority levels built into all the various job managers, although being able to checkpoint a long running job would be a plus (such as is supported by Condor). I am currently investigating modifications to either Condor (more complex to update, but checkpoint is useful) or Ruby Queue (very easy to update for reliable notification) to solve this issue, but wanted to be sure I wasn't overlooking any existing solutions to programmatic based queuing and receiving notifications on jobs in a Beowulf environment... -Sean From tegner at nada.kth.se Tue Oct 16 10:43:09 2007 From: tegner at nada.kth.se (Jon Tegner) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> Message-ID: <4714F82D.10707@nada.kth.se> You should switch to a .deb-system, to save you some trouble: $ apt-cache search jove jove - Jonathan's Own Version of Emacs - a compact, powerful editor Sorry, couldn't resist ;-) /jon Robert G. Brown wrote: > > > I do realize (*ahem*) that I'm one of three living humans that still use > jove, and I've had to adopt it and maintain its rpm all by myself just > so I can still install it as generations of Linux and its libraries pass > me by, but it is still a damn good tool! > > tty4ever! > > rgb > >> >> >> > From rodmur at maybe.org Tue Oct 16 11:20:42 2007 From: rodmur at maybe.org (Dale Harris) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <4714CD0C.8060804@scalableinformatics.com> References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> Message-ID: <20071016182042.GM17802@maybe.org> On 2007-10-16 at 10:39, Joe Landman elucidated: > > I have heard (or am spreading) the rumor that the 1-18-08 movie is not > really a monster movie, but the final epic battle between vi and emacs ... > An old coworker of mine would probably like me to put in good and sarcastic word for ed, the only real line editor. :) -- Dale Harris rodmur@maybe.org rodmur@gmail.com /.-) From xclski at yahoo.com Tue Oct 16 16:36:37 2007 From: xclski at yahoo.com (Ellis Wilson) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <47153C6A.8010803@tamu.edu> Message-ID: <949399.18114.qm@web37911.mail.mud.yahoo.com> Wow, PL-I, I'm learning about that in my language design class. While it brought a bunch of new items to the computing field, can't say I'm upset I didn't code in it :). Sorry guys, I came into existence just about the time the internet was opened up from just NSF to commercial interest, so punch cards are a little out of my league. I must say though, this certainly beats the heck out of a history of computing languages class any day! Ellis Gerry Creager wrote: Didn't you have a tic-tac-toe game on punch cards written in PL-1? John Leidel wrote: > Friends don't let friends play tic-tac-toe using punchcards :-) > > On Tue, 2007-10-16 at 11:20 -0700, David Mathog wrote: >> Jim Lux wrote >> >>> and, as we all know, real developers >>> use paper: tape, tab cards, or, if they must, teletype rolls. >> You forgot paper tape. (Most people who used it probably wish they >> could forget it too!) >> >> Anyway, all of the tools you mentioned are for wimps - real >> programmers load code directly into memory using the toggle >> switches on the front of the computer. >> >> Regards, >> >> David Mathog >> mathog@caltech.edu >> Manager, Sequence Analysis Facility, Biology Division, Caltech >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf --------------------------------- Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071016/86dc2c78/attachment.html From smulcahy at aplpi.com Wed Oct 17 05:02:01 2007 From: smulcahy at aplpi.com (stephen mulcahy) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: <193A590B-0FF9-4041-BCE3-101F48AB2525@sanger.ac.uk> References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> <193A590B-0FF9-4041-BCE3-101F48AB2525@sanger.ac.uk> Message-ID: <4715F9B9.5050402@aplpi.com> Tim Cutts wrote: > I've been reading the aptitude documentation this morning, as it > happens, and there's all sorts of useful stuff it can do above what > apt-cache could manage, and it's solved a number of long-standing > problems I've had... in particular its search functionality is *much* > improved over apt-cache. Interesting, I should rtfm. I've been using aptitude since migrating to etch but still find myself using apt-cache search over aptitude search because it seemed to give me more hits. -stephen -- Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center, GMIT, Dublin Rd, Galway, Ireland. +353.91.751262 http://www.aplpi.com Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway) From jan.heichler at gmx.net Wed Oct 17 08:54:56 2007 From: jan.heichler at gmx.net (Jan Heichler) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Barcelona availability In-Reply-To: <47161FF3.3020408@st-andrews.ac.uk> References: <47161FF3.3020408@st-andrews.ac.uk> Message-ID: <8010202503.20071017175456@gmx.net> Hallo Herbert, Mittwoch, 17. Oktober 2007, meintest Du: HF> We are planning to extend an existing cluster (dual-core Opterons) with HF> some Barcelona nodes. What the salesman tells me is that you can get HF> 1.9GHz chips now, up to 2.3GHz in December and 2.4/2.6GHz in January. HF> Does this sound realistic? Have faster Barcelonas been seen in the HF> channel? Are there official (or unofficial) announcements from AMD? As far as i know you don't have to expect 1.9 GHz in numbers before end of November... worst case: christmas. Higher clockspeeds not to be expected in this year. Regards, Jan From tegner at nada.kth.se Wed Oct 17 10:50:17 2007 From: tegner at nada.kth.se (Jon Tegner) Date: Mon Mar 15 01:06:35 2010 Subject: [Beowulf] Parallel Development Tools In-Reply-To: References: <4714C114.9060409@scalableinformatics.com> <1192543360.4558.35.camel@e521.site> <4714CD0C.8060804@scalableinformatics.com> <4714F82D.10707@nada.kth.se> Message-ID: <47164B59.4020102@nada.kth.se> Drifting off a bit further, but as I see it, the biggest advantage of FC over debien/ubuntu is kickstart. Or??? /jon Robert G. Brown wrote: > On Wed, 17 Oct 2007, andrew holway wrote: > >> Apt-cache with a bit of grep is a powerful tool indeed. >> >> $apt-cache search foo | grep bar >> >> everyone I work with however prefers yum. They regard Debian as being >> a bit backward. > > I don't know if Debian is backward, but I will affirm that yum is pretty > fashion-forward. I'm probably biased, of course, since I wrote the > original HOWTO and since Seth developed it about seventy feet from where > I'm sitting right now. I've never seriously contributed to its > development but I've been a user for years, and it continues to improve > even now. > > There are lots of ways to search out info using yum, some faster or more > complex than others, but the most common ones, also the most useful and > fastest, are as simple as "yum list \*whatever\*, e.g. -- > > rgb@ganesh|B:1011>yum list \*pvm\* > Available Packages > pvm.x86_64 3.4.5-7.fc6.1 fedora > pvm-gui.x86_64 3.4.5-7.fc6.1 fedora > > It also supports yum search foo (likely to get more than you really > want), yum info, yum deplist, etc. most of which I rarely use. It's > nice to do "yum info \* > /tmp/yum.list" once right after an install or > upgrade, though, as this pretty much gives you a catalog of all the > available packages and their synopses, so that you can find almost > anything efficiently and interactively with less /tmp/yum.list and its > built in find tools, or via grep , or by just plain browsing through