From rgb at phy.duke.edu Fri Jun 1 09:06:22 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] HDTV video file sizes In-Reply-To: <1932e3120705291017t4f11eed9gcc36cd120697e216@mail.gmail.com> References: <1088659434.1180453767761.JavaMail.root@fepweb03> <1932e3120705291017t4f11eed9gcc36cd120697e216@mail.gmail.com> Message-ID: On Tue, 29 May 2007, Jim Windle wrote: > So if Netflix isn't lying when they say they have shipped over a billion > movies that means they have moved roughly 5 exabytes of data via the US > mail. I wonder how that compares the amount moved over the internet during > the same time period? > > compressed data rates appear to be 20-50 Mbps (lower than 20 Oh, there's little doubt about this sort of thing. With a DSL bottleneck, it's MUCH faster for me to drive to Duke and do an install from its mirrors via a 1 Gbps local network than it is to wait at home for the data to squeeze through my little pipe. And every time I drive to and from Duke carrying my laptop, I move 10 GB/minute between locations which (at a GB/six seconds) is slightly HIGHER bandwidth than the campus Gbps backbone. If you want to move terabytes at high bandwidth, box up some portable multi-terabyte RAIDS and fly them there. However, I can transfer data home while doing other things. I cannot drive and do other things. Network transfers are often parallelizable in a classic sense and can complete while other things are happening (as they are now on my laptop as I type this). rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mathog at caltech.edu Sat Jun 2 17:39:46 2007 From: mathog at caltech.edu (David Mathog) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] network transfer issue to disk, old versus new hardware Message-ID: I can't quite wrap my head around a recent nettee result, perhaps one of the network gurus here can explain it. The tests were these: A. Sustained write to disk: sync; accudate; dd if=/dev/zero bs=512 count=1000000 of=test.dat; \ sync; accudate (accudate is a little utility of mine which is like date but gives times to milliseconds. Subtract the times and calculate sustained write rate to disk.) B. transfer of 512Mb one node to another: first node: dd if=/dev/zero bs=512 count=1000000 | \ nettee -in - -next secondnode -v 63 second node: nettee -out test.dat C. Same as B, but buffer nettee output second node: nettee -out - | mbuffer -m 4000000 >test.dat D. Calculate transfer rate if read from network and write to disk are strictly sequential (alternating read, write)= 1/(1/11.7 + 1/(speed from A)) E. Ratio: Observed (B) / expected (D) F. Pipe speed (lowest of 5 consecutive tests, it varies a lot, probably because of other activity on the nodes, even though they were quiescent, highest was around 970Mb/s for both platforms) dd if=/dev/zero bs=512 count=1000000 >/dev/null G. Raw network speed (move the data, then throw it out) first node: dd if=/dev/zero bs=512 count=1000000 | \ nettee -in - -next secondnode -v 63 second node: nettee -out /dev/null This was carried out on two different sets of hardware, both with 100BaseT networks (different switches though): Old: Athlon MP 2200+, Tyan S2466MPX mobo, 2.6.19.3 kernel, 512Mb RAM New: Athlon64 3700+ CPU, ASUS A8N5X mobo, 2.6.21.1 kernel, 1G RAM Here are the results, all in Megabytes/sec OLD NEW A 17 40 B 7.4 10.47 C 7.4 11.43 D 6.9 9.05 E 1.07 1.16 F 743 603 G 11.77 11.71 Start with G, in both cases the hardware could push data across the network at almost exactly the same speed. From A we see that the disks on the older machines are considerably slower than the ones on the newer machines (hdparm showed the same values for OLD/NEW, so it isn't an obvious misconfiguration). From D we expect OLD to be slower than NEW, and B shows that that is indeed the case. It's a little better than pure sequential because there's some parallelism in the read part of the network transfer, giving ratios greater than 1 (E). There's plenty of pipe bandwidth (F). Yet when we put mbuffer in (C) there is no speed up AT ALL on OLD, and a nice one (as expected) on NEW. Everything is as it should be for NEW, but why isn't mbuffer doing it's thing on the OLD machines? Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From Wally-Edmondson at utc.edu Fri Jun 1 08:24:40 2007 From: Wally-Edmondson at utc.edu (Wally Edmondson) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] IBRIX Experiences Message-ID: <46603A38.3060006@utc.edu> On Thu, 10 May 2007, Ian Reynolds wrote: > Hey all -- we're considering IBRIX for a parallel storage cluster > solution with an EMC Clarion CX3-20 at the center, as well as a handful > of storage servers -- total of roughly 40 client servers, mix of 32 and > 64 bit OSs. > > Can anyone offer their experiences with IBRIX, good or bad? We have > worked with gpfs extensively, so any comparisons would also be helpful. It looks like you aren't getting many answers your question, Ian. I'll quickly share my IBRIX experiences. I have been running IBRIX since late 2004 on around 540 diskless clients and 50 regular servers and workstations with 8 segment servers and a Fusion Manager connected to a DDN S2A 3000 couplet with 20TB of usable storage. The storage is 1Gb FibreChannel to the Segment Servers and it's non-bonded GigE for everything else. I'll start with the bad, I guess. We had our share of problems with the 1.x version of the software in the early days. I suppose all parallel filesystems with 600 clients are going to hit bumps. That's what CFS said back then, anyways. Stability wasn't a problem, but occasionally a file wouldn't be readable and to fix it you had to copy the file, stuff like that. This was no longer an issue beginning with version 2.0. You have to get a new build of the software if you want to change kernels. Their are two RPMS, one generic for the major kernel number and the other specific to your kernel containing some modules. They only support RHEL/CENTOS and SLES as far as I know, and SLES was only recently added. I asked about Ubuntu and they don't yet support it, which sucks because I would like to use it on some workstations. Oh, and make sure that the segment servers can always see each other. Use at least two links through different switches. We had some bad switch ports that caused the segment servers to miss heartbeats. This caused automatic failovers to segment servers that also couldn't be seen. This is a disaster. I thought it was IBRIX's fault the whole time. Turned out to be intermittent switch port problems. It was avoidable with a little bit more planning and a better understanding of how the whole thing worked. Redundancy is set up with buddies rather than globally, so you tell it that one server should watch some other server's back. It works, but it could be a problem if a failing server's buddy is down or a server goes down while it owns a failed segment. In either case, some percentage of your files won't be accessible until one of the servers is fixed. It hasn't happened to me, but it is a possibility. I can bring down four of my eight servers without a problem, for instance, but it needs to be the right four. Servers have failed and it has never been a problem for me. The running jobs never know the difference. Support has been top-notch. Last year, we had a catastrophic storage controller failure following a scheduled power outage, major corruption, the works. A guy at IBRIX stayed with me all weekend on the phone and AIM. He logged in and remotely restored all the files he could (tens of thousands). Apparently he could have restored more if I had already been running 2.0 or higher. They know their product very well. I'm not sure if I am the right person to compare it to GPFS or Lustre since I looked into those products back in 2004 and haven't really researched them since. My setup is simple, too, so I only use the basics. The performance is fine, using nearly all of my GigE pipes. With more segment servers and faster storage you could get some pretty amazing speeds. I don't use the quotas or multiple interfaces. Their GUI looks nice at first but you really don't need it because their command-line tools make sense and have excellent help output if you forget something. Adding new clients is a breeze. There is a Windows client now but I haven't used it. I use CIFS exports and it works just fine. I also use NFS exports for my few remaining Solaris clients. Everything is very customizable and the documentation seems pretty thorough. You can put any storage you like behind it, which is nice. I think I could use USB keys if I felt like it. I have been very please with IBRIX overall, especially since we upgraded out of 1.x land. It's usually the last thing on my mind, so I guess that's a good thing. That's all I have time for right now. Let me know if you have any specific questions. Wally From ruhollah.mb at gmail.com Fri Jun 1 13:41:05 2007 From: ruhollah.mb at gmail.com (Ruhollah Moussavi Baygi ) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] ssh connection problem In-Reply-To: References: <1bef2ce30705270147k430800b5x303e56410aba640b@mail.gmail.com> Message-ID: <1bef2ce30706011341g3f93fe9bo3a13121efa00d678@mail.gmail.com> Hi, Thank you for your answers, But, please ignore the content of the 'links' I have posted, I didn't mean to send you those links. I just did google to find a solution for our cluster's problem 'Disconnecting:?'. However, because I couldn't find a proper solution via googling, I posted it to Beowulf, so, I just did copy-paste the sentence 'Disconnecting:?' in my gmail. That's why you can see 'links' in my email. Returning to our problem, the results of 'netstat ?i' and '-s' are as follows, respectively. Please note that: a) I use cat 6, b) it is nearly improbable to have electricity noise c) the head-node has two NICs, eth0 is for internal zone, i.e. computing nodes, which is running with no problem. eth1 is for external zone, i.e. to be connected by our users via ssh. This one has disconnecting problem. d) it doesn't seem that there is any SW/router problem. Because in the same network, there is some other machine, which is connected by users via ssh with no problem. ___________________________________________________________________ *[root@node01 ~]# netstat -i*** Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 586745989 0 0 0 598858710 0 0 0 BMRU eth1 1500 0 701868 0 0 0 325542 0 0 0 BMRU lo 16436 0 1959 0 0 0 1959 0 0 0 LRU *[root@node01 ~]# netstat -s*** Ip: 585891011 total packets received 0 forwarded 0 incoming packets discarded 585887228 incoming packets delivered 597668214 requests sent out Icmp: 34 ICMP messages received 21 input ICMP message failed. ICMP input histogram: destination unreachable: 25 timeout in transit: 5 echo requests: 4 601 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 597 echo replies: 4 Tcp: 78 active connections openings 360 passive connection openings 0 failed connection attempts 18 connection resets received 8 connections established 585798178 segments received 597666644 segments send out 16197 segments retransmited 94 bad segments received. 1682 resets sent Udp: 1005 packets received 596 packets to unknown port received. 0 packet receive errors 1019 packets sent TcpExt: 2 resets received for embryonic SYN_RECV sockets 26 packets pruned from receive queue because of socket buffer overrun ArpFilter: 0 60 TCP sockets finished time wait in fast timer 1 packets rejects in established connections because of timestamp 734435 delayed acks sent 127 delayed acks further delayed because of locked socket Quick ack mode was activated 7963 times 724 packets directly queued to recvmsg prequeue. 6030 packets directly received from backlog 164431 packets directly received from prequeue 571897537 packets header predicted 138 packets header predicted and directly queued to user TCPPureAcks: 44870 TCPHPAcks: 458279645 TCPRenoRecovery: 0 TCPSackRecovery: 2875 TCPSACKReneging: 0 TCPFACKReorder: 0 TCPSACKReorder: 0 TCPRenoReorder: 0 TCPTSReorder: 0 TCPFullUndo: 0 TCPPartialUndo: 0 TCPDSACKUndo: 1 TCPLossUndo: 7099 TCPLoss: 626 TCPLostRetransmit: 0 TCPRenoFailures: 0 TCPSackFailures: 1635 TCPLossFailures: 169 TCPFastRetrans: 4294 TCPForwardRetrans: 23 TCPSlowStartRetrans: 1130 TCPTimeouts: 8329 TCPRenoRecoveryFail: 0 TCPSackRecoveryFail: 279 TCPSchedulerFailed: 0 TCPRcvCollapsed: 2731 TCPDSACKOldSent: 8194 TCPDSACKOfoSent: 0 TCPDSACKRecv: 7125 TCPDSACKOfoRecv: 0 TCPAbortOnSyn: 0 TCPAbortOnData: 28 TCPAbortOnClose: 8 TCPAbortOnMemory: 0 TCPAbortOnTimeout: 12 TCPAbortOnLinger: 0 TCPAbortFailed: 0 TCPMemoryPressures: 0 ___________________________________________________________________ -- Best, Ruhollah Moussavi Baygi On 5/29/07, Robert G. Brown wrote: > > On Sun, 27 May 2007, Ruhollah Moussavi Baygi wrote: > > > Hi everybody at Beowulf, > > > > I have a serious problem with ssh connection to our cluster. Every > > hint/help/suggestion, which can help me to solve it, is highly > appreciated. > > > > Most of the time, when users want to connect and run their programs from > > their own PCs, the ssh connection failed, especially during transfer > files > > from/to head-node. Our user's PCs are mainly WindowsXP, so they use > packages > > like SSH Secure Shell for connection and file transfer, or Putty for > > connection and WinSCP for file transfer. > > > > > > The error massage is as follows: > > > > 'Disconnecting: Corrupted MAC on input' > > This sounds to me like hardware problems. What does your physical > network look like? Is it built with the right cables, within spec, with > decent switches? Do you see other evidence of network packet > corruption? > > > < > http://www.google.com/history/url?url=http://ubuntuforums.org/showthread.php%3Ft%3D202076&ei=wkJZRsGfHZf-0gTehKXrDQ&sig2=lIzQGYq3zN0Tz2EC8b4dAw&zx=JGkABbsjtaA&ct=w > > > > > > or > > > > 'Disconnecting: bad packet > > Yes, sounds like bad hardware. Perhaps your cables aren't cat 5? > Perhaps your electrical power has noise? Perhaps your switch(es) are > broken or have been taken over by trolls? This sounds like you're > failing packet checksum tests or experiencing pretty serious TCP > collision problems. > > What do the network statistics look like on the interfaces in question? > > rgb > > > length...< > http://www.google.com/search?q=disconnecting:+bad+packet+length+from+windows+to+linux+machine&hl=en > >', > > followed by a long integer. > > > > > > This problem has practically made our cluster unusable. So, I would be > > thankful for any coming advice. > > > > -- > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > > -- Best, Ruhollah Moussavi Baygi -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070602/11a5e345/attachment.html From ctierney at hypermall.net Sat Jun 2 21:11:10 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] tftp permission denied In-Reply-To: References: Message-ID: <46623F5E.4010201@hypermall.net> fahad saeed wrote: > Hello All, > > I am trying to install Fedora Core 6 using network(since i only have 1 > cd rom installed on the head node and no cdrom/flopy drive on the slave > node...)so > > I used this how-to to configure my tftp server and all seems to go well... > > http://www.opensourcehowto.org/how-t...a-install.html > > > > Now the problem is that when i boot my slave node and 'command' it boot > from the network (using Intel boot Boot Agent 1.1.07) I get this error > > PXE -T00 permission denied > PXE -E36 error received from tftp server > > Although the slave node does recognises the master node and its ip etc.... > > > > Any Help would be highly appreciable as I have no idea what to do next... > > Thanks in advance...and please help !! > > Fahad > Have you tried to copy the file via tftp from the server node: # tftp localhost # get "blah" See if that works. I was seeing something similar to this on RHEL5 this past week. I haven't got an answer yet, but it seemed that I could only transfer files that ended in .bin. I wonder if it is a security or selinux issue, but I haven't tracked it down yet. Craig > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From john.hearns at streamline-computing.com Sun Jun 3 01:39:54 2007 From: john.hearns at streamline-computing.com (John Hearns) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] tftp permission denied In-Reply-To: <46623F5E.4010201@hypermall.net> References: <46623F5E.4010201@hypermall.net> Message-ID: <46627E5A.7070801@streamline-computing.com> Craig Tierney wrote: > fahad saeed wrote: >> Now the problem is that when i boot my slave node and 'command' it >> boot from the network (using Intel boot Boot Agent 1.1.07) I get this >> error >> >> PXE -T00 permission denied >> PXE -E36 error received from tftp server > # tftp localhost > # get "blah" I'll add another debugging tip to that one - stop the tftp daemon service, then start it on the command line (as root) with the following flags: -l -vv -s /path/to/your/tftpdirectory The try Craig's tip - ie can you transfer a file by hand from 'localhost' then reboot a compute node and follow the tftp request > See if that works. > > I was seeing something similar to this on RHEL5 this past week. > I haven't got an answer yet, but it seemed that I could only > transfer files that ended in .bin. I wonder if it is a security > or selinux issue, but I haven't tracked it down yet. > > Craig From hahn at mcmaster.ca Sun Jun 3 11:25:13 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] tftp permission denied In-Reply-To: <46627E5A.7070801@streamline-computing.com> References: <46623F5E.4010201@hypermall.net> <46627E5A.7070801@streamline-computing.com> Message-ID: > stop the tftp daemon service, then start it on the command line (as root) > with the following flags: > > -l -vv -s /path/to/your/tftpdirectory yes, definitely. this sort of problem calls for debugging on the server side - verbose server settings is probably enough, but I wouldn't shy away from running the server under strace to see what it's really doing... From Bogdan.Costescu at iwr.uni-heidelberg.de Mon Jun 4 05:07:36 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] network transfer issue to disk, old versus new hardware In-Reply-To: References: Message-ID: On Sat, 2 Jun 2007, David Mathog wrote: > I can't quite wrap my head around a recent nettee result, perhaps > one of the network gurus here can explain it. IMHO, it's not a network issue, as is shown by your G results. > sync; accudate; dd if=/dev/zero bs=512 count=1000000 of=test.dat; All your tests use bs=512 - why ? This makes unnecessary trips to kernel code and back which result in an increased number of context switches and significant slowdown. My guess is that this (high number of context switches) plus a high interrupt rate (disk and network simultaneously) is the reason for your results. > Old: Athlon MP 2200+, Tyan S2466MPX mobo, 2.6.19.3 kernel, 512Mb RAM I used to have the exact same hardware as cluster nodes (but with dual CPU, whether you also have duals is not clear from your post) and tried to convert 2 of them to small file-servers - same problem of disk + network simultaneous activity. After benchmarking, I gave up - this was almost 2 years ago and I don't have the exact numbers anymore, but a single PIV 3GHz on a consumer-grade mainboard was able to provide significantly better performance for the same task. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From mathog at caltech.edu Mon Jun 4 10:02:13 2007 From: mathog at caltech.edu (David Mathog) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] network transfer issue to disk, old versus new hardware Message-ID: Bogdan Costescu wrote: > On Sat, 2 Jun 2007, David Mathog wrote: > > > I can't quite wrap my head around a recent nettee result, perhaps > > one of the network gurus here can explain it. > > IMHO, it's not a network issue, as is shown by your G results. > > > sync; accudate; dd if=/dev/zero bs=512 count=1000000 of=test.dat; > > All your tests use bs=512 - why ? This makes unnecessary trips to > kernel code and back which result in an increased number of context > switches and significant slowdown. It's a convenient number, it may slow things down slightly but clearly it isn't rate limiting since piping that straight to /dev/null gives rates of 650Mb/sec or higher. In any case, I figured the problem out. The issue was that the distro (Mandriva 2007.0) installed a while back on the older machines turns on "athcool". Athcool does cut the idle temperatures of the nodes considerably, but apparently also prevents them from performing this sort of transfer at full speed, whether or not buffer is used. When I turned athcool off, on just the receiving node, the transfer rate for: sender: dd if=/dev/zero bs=512 count=1000000 | \ nettee -in - -v 63 -next next_node receiver: nettee -out test.dat jumped from 7.7Mb/sec to 11.6Mb/sec. So apparently athcool gets in the way by preventing rapid shifts from disk to network IO, no matter which process is doing them. Which is interesting because it didn't have any measurable effect on CPU bound processes. I had thought it would shut itself off and get out of the way when the CPU rate was high, but apparently not. When imaging nodes athcool isn't running, but I'll have to keep this in mind when doing routine transfers of data across the nodes. On the newer machines cpufreq runs instead of athcool, and it didn't make very much difference if that was running or not. Apparently this power saver does a much better job of detecting higher CPU load and "getting out of the way" when it's present. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Mon Jun 4 10:39:52 2007 From: mathog at caltech.edu (David Mathog) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] network transfer issue to disk, old versus new hardware Message-ID: Bogdan Costescu wrote: > > Old: Athlon MP 2200+, Tyan S2466MPX mobo, 2.6.19.3 kernel, 512Mb RAM > > I used to have the exact same hardware as cluster nodes (but with dual > CPU, whether you also have duals is not clear from your post) These are single CPU machines. > and > tried to convert 2 of them to small file-servers - same problem of > disk + network simultaneous activity. After benchmarking, I gave up - > this was almost 2 years ago and I don't have the exact numbers > anymore, but a single PIV 3GHz on a consumer-grade mainboard was able > to provide significantly better performance for the same task. Was athcool running on these? I've done some more benchmarking with athcool on/off, and it changed the write speed for the dd generated 512MB file from just under 18MB/sec to 31 MB/sec. Even with that change, there is clearly something else going on in the network + disk department, since the "expected sequential" rate only changes from 7.1 to 8.5MB/sec. The "hdparm -tT" results were around 520MB/sec cached reads in either case, but the timed buffered disk reads went from 24MB/sec to 44MB/sec (both with large variances, but not THAT large.) Thankfully this is entirely irrelevant to those of you who have long since retired these older Tyan systems. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From atp at piskorski.com Mon Jun 4 13:38:07 2007 From: atp at piskorski.com (Andrew Piskorski) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] cheap SMC 8524T gigabit switches, and performance of In-Reply-To: References: Message-ID: <20070604203805.GA72069@tehun.pair.com> FYI, some vendor called Unity Electronics is currently selling a bunch of 24 port SMC 8524T gigabit switches for c. $120 each on Ebay: http://search.ebay.com/search/search.dll?satitle=SMC+gigabit&sass=unityelectronics.com http://unityelectronics.com/product-product_id/3942/m/SMC/p/SMC8524T http://unityelectronics.com/product-product_id/3941/m/SMC/p/SMC8516T I haven't actually tried using it yet, but the one I recieved is part number 751.7398, and appears to be new in box as advertised. And that reminded me of the interesting thread from April, below, on performance testing of some (small) SMC gigabit switches: http://www.beowulf.org/archive/2007-April/017924.html On Mon, Apr 02, 2007 at 03:58:06PM -0500, Bruce Allen wrote: > Subject: Re: [Beowulf] How to Diagnose Cause of Cluster Ethernet Errors? > Just for kicks have a look at these figures: > http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/SMC_8508T_Performance.html > Here are some more testing results from different edge switches: > http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/switching.html Bruce, it's interesting how your bandwith tests show the SMC 8508T 721.0154 switch started out with true wire-speed and 9k jumbo frame performance, 721.8129 was worse, and then 722.8486 was yet worse again. Compared to the previous part number, each subsequent revision of the supposedly "same" SMC8508T model degraded performance! And your tests were with only 2 of the 8 ports on each switch, so I wonder how much worse they'd be when using all ports at once. It's also interesting that all 3 part numbers showed the same performance for the 2 kb MTU. The iterative cheapening of the hardware seems to have only broken the large frame sizes. However, I'm confused by part of your results: Some of your crossover cable and 5 port switch results show a big bandwith advantage when using jumbo frames - bandwith takes a huge jump up from around 125 MB/s with a 2k MTU to 225 with 4k. But your 8508T results, on the other hand, are much better at 2k, around 200 MB/s, and then gradually moves up to about the same 225 at 4k. Any idea why you saw those different behaviors? -- Andrew Piskorski http://www.piskorski.com/ From Bogdan.Costescu at iwr.uni-heidelberg.de Tue Jun 5 07:03:38 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] network transfer issue to disk, old versus new hardware In-Reply-To: References: Message-ID: On Mon, 4 Jun 2007, David Mathog wrote: > Athcool does cut the idle temperatures of the nodes considerably, > but apparently also prevents them from performing this sort of > transfer at full speed, whether or not buffer is used. Well, near the top of the athcool website there is a warning and one the listed items is 'a slowdown in harddisk performance' - so nothing new here ;-) > Which is interesting because it didn't have any measurable effect on > CPU bound processes. I had thought it would shut itself off and get > out of the way when the CPU rate was high, but apparently not. CPU bound and I/O bound processes use the processor in different ways... When doing only I/O, the processor is often waiting for the hardware, so the load on the processor is low. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From Bogdan.Costescu at iwr.uni-heidelberg.de Tue Jun 5 07:24:49 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] network transfer issue to disk, old versus new hardware In-Reply-To: References: Message-ID: On Mon, 4 Jun 2007, David Mathog wrote: > Was athcool running on these? No. Given that our cluster nodes are in use most of the time, it makes no sense to think much about idling... And when it's known that some cluster nodes will not be used for some time (like one day or more), I prefer to just turn them off - most of them are built from consumer-grade components, so this brings them closer to their typical life-cycle. ;-) > Thankfully this is entirely irrelevant to those of you who have long > since retired these older Tyan systems. ... and if you didn't have any of those to begin with, then consider yourself lucky :-) -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From hahn at mcmaster.ca Tue Jun 5 09:40:22 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] network transfer issue to disk, old versus new hardware In-Reply-To: References: Message-ID: >> Athcool does cut the idle temperatures of the nodes considerably, but >> apparently also prevents them from performing this sort of transfer at full >> speed, whether or not buffer is used. > > Well, near the top of the athcool website there is a warning and one the > listed items is 'a slowdown in harddisk performance' - so nothing new here > ;-) athcool works by putting the cpu-northbridge interface into a low-power mode. the difficulties people had with it was that this sort of down-clocking was new at the time, and not well-handled by all chips, probably on both the chipset and cpu sides. erata centered on how long it took to stabilize the PLL's involved. things are quite different nowadays - AMD put the northbridge entirely on-cpu, so it has fully control, and can modulate clocks extensively and differentially. I don't know how common (or effective) it is to modulate HT power, but such features show up prominently in recent HT revs. it's interesting to speculate about Intel - mostly it solved this by dominating the chipset market for its own CPUs. I'm guessing Intel will fall somewhat behind AMD in system-wide power savings, at least until CSI. even then, I'm a little unclear how good Intel's initial implementation will be - the fact that they've chosen to not simply adopt HT indicates to me that Intel will be re-learning AMD's lessons. >> Which is interesting because it didn't have any measurable effect on CPU >> bound processes. I had thought it would shut itself off and get out of the I'd expect athcool to not affect a cache-friendly cpu-bound process, but to hurt pretty badly if you have cache misses. networking (using the normal network stack) count as memory-bound, I think, rather than kinds of IO which might be more DMA-intensive. that is, if a disk is streaming many MB into memory, the CPU's northbridge interface should be able to go low-power (though most disk transfers are only in the 64K range...) regards, mark hahn. From naveed at caltech.edu Mon Jun 4 15:08:07 2007 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] Re: IBRIX Experiences (Wally Edmondson) In-Reply-To: <200706030347.l533l4Mo009549@bluewest.scyld.com> References: <200706030347.l533l4Mo009549@bluewest.scyld.com> Message-ID: <1180994887.5878.21.camel@aeolis.gps.caltech.edu> On Sat, 2007-06-02 at 20:47 -0700, beowulf-request@beowulf.org wrote: > Date: Fri, 01 Jun 2007 11:24:40 -0400 > From: Wally Edmondson > Subject: Re: [Beowulf] IBRIX Experiences > To: beowulf@beowulf.org > Message-ID: <46603A38.3060006@utc.edu> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > On Thu, 10 May 2007, Ian Reynolds wrote: > > > Hey all -- we're considering IBRIX for a parallel storage cluster > > solution with an EMC Clarion CX3-20 at the center, as well as a handful > > of storage servers -- total of roughly 40 client servers, mix of 32 and > > 64 bit OSs. > > > > Can anyone offer their experiences with IBRIX, good or bad? We have > > worked with gpfs extensively, so any comparisons would also be helpful. > > It looks like you aren't getting many answers your question, Ian. I'll quickly share > my IBRIX experiences. I have been running IBRIX since late 2004 on around 540 > diskless clients and 50 regular servers and workstations with 8 segment servers and a > Fusion Manager connected to a DDN S2A 3000 couplet with 20TB of usable storage. The > storage is 1Gb FibreChannel to the Segment Servers and it's non-bonded GigE for > everything else. > > I'll start with the bad, I guess. We had our share of problems with the 1.x version > of the software in the early days. I suppose all parallel filesystems with 600 > clients are going to hit bumps. That's what CFS said back then, anyways. Stability > wasn't a problem, but occasionally a file wouldn't be readable and to fix it you had > to copy the file, stuff like that. This was no longer an issue beginning with > version 2.0. You have to get a new build of the software if you want to change > kernels. Their are two RPMS, one generic for the major kernel number and the other > specific to your kernel containing some modules. They only support RHEL/CENTOS and > SLES as far as I know, and SLES was only recently added. I asked about Ubuntu and > they don't yet support it, which sucks because I would like to use it on some > workstations. Oh, and make sure that the segment servers can always see each other. > Use at least two links through different switches. We had some bad switch ports > that caused the segment servers to miss heartbeats. This caused automatic failovers > to segment servers that also couldn't be seen. This is a disaster. I thought it was > IBRIX's fault the whole time. Turned out to be intermittent switch port problems. > It was avoidable with a little bit more planning and a better understanding of how > the whole thing worked. Redundancy is set up with buddies rather than globally, so > you tell it that one server should watch some other server's back. It works, but it > could be a problem if a failing server's buddy is down or a server goes down while it > owns a failed segment. In either case, some percentage of your files won't be > accessible until one of the servers is fixed. It hasn't happened to me, but it is a > possibility. I can bring down four of my eight servers without a problem, for > instance, but it needs to be the right four. Servers have failed and it has never > been a problem for me. The running jobs never know the difference. > > Support has been top-notch. Last year, we had a catastrophic storage controller > failure following a scheduled power outage, major corruption, the works. A guy at > IBRIX stayed with me all weekend on the phone and AIM. He logged in and remotely > restored all the files he could (tens of thousands). Apparently he could have > restored more if I had already been running 2.0 or higher. They know their product > very well. I'm not sure if I am the right person to compare it to GPFS or Lustre > since I looked into those products back in 2004 and haven't really researched them > since. My setup is simple, too, so I only use the basics. The performance is fine, > using nearly all of my GigE pipes. With more segment servers and faster storage you > could get some pretty amazing speeds. I don't use the quotas or multiple interfaces. > Their GUI looks nice at first but you really don't need it because their > command-line tools make sense and have excellent help output if you forget something. > Adding new clients is a breeze. There is a Windows client now but I haven't used > it. I use CIFS exports and it works just fine. I also use NFS exports for my few > remaining Solaris clients. Everything is very customizable and the documentation > seems pretty thorough. You can put any storage you like behind it, which is nice. I > think I could use USB keys if I felt like it. I have been very please with IBRIX > overall, especially since we upgraded out of 1.x land. It's usually the last thing > on my mind, so I guess that's a good thing. That's all I have time for right now. > Let me know if you have any specific questions. > > Wally > I would agree with some of this. The support is indeed top notch, but our switch to 2.x wasn't as smooth. we have had some problems with files not writing and some performance issues. this is being used on 520 nodes. For us, alot of our (recent) problems have been related to ibrix. Ibrix has been very good about helping fix things. I have had the same experience with ibrix being there when i needed them. when i have a problem, they work on it until fixed regardless of whether it is nighttime or weekends. At this point, i think we are stable and you probably would not have the same issues on a new system. -- Naveed Near-Ansari California Institute of Technology Division of Geology and Planetary Sciense From vaughanc at gmail.com Tue Jun 5 06:51:32 2007 From: vaughanc at gmail.com (Chris Vaughan) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] tftp permission denied In-Reply-To: References: <46623F5E.4010201@hypermall.net> <46627E5A.7070801@streamline-computing.com> Message-ID: <216ee070706050651s4ba40c10o165f11c8f08be0b4@mail.gmail.com> Hi, What does your default.cfg look like and your dhcp.conf file look like. I remember having this issue before and I fixed it one of those two files. On 6/3/07, Mark Hahn wrote: > > stop the tftp daemon service, then start it on the command line (as root) > > with the following flags: > > > > -l -vv -s /path/to/your/tftpdirectory > > yes, definitely. this sort of problem calls for debugging on the server side > - verbose server settings is probably enough, but I wouldn't shy away from > running the server under strace to see what it's really doing... > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- ------------------------------ Christopher Vaughan From aohara at haverford.edu Tue Jun 5 11:47:28 2007 From: aohara at haverford.edu (aohara@haverford.edu) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] xhpl and HPL.dat directory Message-ID: <41822.165.82.168.219.1181069248.squirrel@165.82.168.219> Hi, I'm working on benchmarking a recently installed cluster at Haverford College and we've been using the hpl benchmark. Currently, I've been testing the performance of each individual node blade in an attempt to look at bottlenecking in accessing the memory. Since we have several indentical nodes, it was be nice to have a different set of parameters running on each node. However, xhpl (installed in my home directory under my account) will only look for the HPL.dat file in the top directory (i.e. /n/home/aohara) and not in the same directory as a copy of the xhpl (for example I put a submission script, xhpl, and HPL.dat in the folder /n/home/aohara/newrun, but it runs the parameters of the file /n/home/aohara/HPL.dat instead). If anybody knows of a way to give a directive about the location of HPL.dat to xhpl, that would be exteremely helpful. Thank you very much, Andrew O'Hara '09 Haverford College Physics Department From lindahl at pbm.com Wed Jun 6 10:06:13 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] xhpl and HPL.dat directory In-Reply-To: <41822.165.82.168.219.1181069248.squirrel@165.82.168.219> References: <41822.165.82.168.219.1181069248.squirrel@165.82.168.219> Message-ID: <20070606170613.GA9230@bx9.net> On Tue, Jun 05, 2007 at 02:47:28PM -0400, aohara@haverford.edu wrote: > However, xhpl (installed in my home directory under > my account) will only look for the HPL.dat file in the top directory Use the Source, Luke. -- greg From mitch48 at sbcglobal.net Wed Jun 6 14:54:58 2007 From: mitch48 at sbcglobal.net (Tom Mitchell) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] tftp permission denied In-Reply-To: References: <46623F5E.4010201@hypermall.net> <46627E5A.7070801@streamline-computing.com> Message-ID: <20070606215458.GB11062@xtl1.xtl.tenegg.com> On Sun, Jun 03, 2007 at 02:25:13PM -0400, Mark Hahn wrote: > Date: Sun, 3 Jun 2007 14:25:13 -0400 (EDT) > From: Mark Hahn > To: Beowulf Mailing List > Subject: Re: [Beowulf] tftp permission denied > > >stop the tftp daemon service, then start it on the command line (as root) > >with the following flags: > > > >-l -vv -s /path/to/your/tftpdirectory > > yes, definitely. this sort of problem calls for debugging on the server > side > - verbose server settings is probably enough, but I wouldn't shy away from > running the server under strace to see what it's really doing... All the previous and above plus. Check /etc/xinetd.d/tftp, /etc/hosts.allow, /etc/hosts.deny Then check the ipfilter and security setting. TFTP is at a different port than FTP. If you are using the GUI Security Level Configuration tool you will have to enable TFTP under "Other ports". If ip filtering is blocking packets into the server 'verbose' flags will have nothing to be verbose about. The quick test is to disable filtering and test. ftp 21/tcp ftp 21/udp fsp fspd tftp 69/tcp tftp 69/udp sftp 115/tcp sftp 115/udp Both ftp and tftp get used by bad boys out on the Internet so watch the ownership, permissions, settings and logs. Most system admins will want to restrict TFTP access to your local hosts/networks. For the network programmers interested in historic bugs out there give this a quick read. http://en.wikipedia.org/wiki/Sorcerer's_Apprentice_Syndrome Later, mitch -- T o m M i t c h e l l Found me a new place to hang my hat :-) Now it got bought. From jlb17 at duke.edu Thu Jun 7 11:50:39 2007 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] Intel ESB2/82563EB NICs and RHEL/CentOS Message-ID: I have 6 dual Xeon 5160 compute nodes with Supermicro X7DVL-E motherboards . These boards have onboard Intel 82563EB NICs (PCI ID 8086:1096) and the systems are all running CentOS 4. When I first installed them, I was running CentOS 4.4 (kernel 2.6.9-42.0.x), which included version e1000-7.0.39. The network interfaces were very unreliable -- they would randomly stop and then re-start passing traffic. I downloaded version e1000-7.3.20 from intel.com, and they worked just fine. With the release of CentOS 4.5 (kernel 2.6.9-55) and its inclusion of e1000-7.2.7, I decided to give the stock driver a try again, but it had the same issues. Again, upgrading to a more recent version from intel.com (e1000-7.5.5 in this case) fixed the problem. I'm planning on moving these systems to CentOS 5 shortly (kernel kernel-2.6.18-8.x), but it too includes e1000-7.2.7. Has anybody else seen this issue? I'm wondering whether it is motherboard specific or if it's an issue with the NIC itself. Thanks! -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From hahn at mcmaster.ca Fri Jun 8 09:11:10 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] backtraces Message-ID: I had a user grumble about how it was not trivial to get a basic backtrace on our clusters. his jobs tend to be 32-128p, and run for a week, so it's not ideal to run them under the debugger. turns out to be fairly simple to produce a backtrace.so which can be LD_PRELOAD'ed - it contains a constructor which registers a signal handler, which obtains the backtrace and translates and prints the corresponding file:func:line. does this sound like something of interest to other HPC sites? regards, mark hahn. From toon.knapen at fft.be Sun Jun 10 22:32:35 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] backtraces In-Reply-To: References: Message-ID: <466CDE73.7020901@fft.be> Interesting indeed. On which platform is this backtrace.so available (obtaining backtraces is higly platform dependent AFAIK) ? toon Mark Hahn wrote: > I had a user grumble about how it was not trivial to get a basic > backtrace on our clusters. his jobs tend to be 32-128p, > and run for a week, so it's not ideal to run them under the debugger. > > turns out to be fairly simple to produce a backtrace.so which can > be LD_PRELOAD'ed - it contains a constructor which registers a signal > handler, which obtains the backtrace and translates and prints the > corresponding file:func:line. > > does this sound like something of interest to other HPC sites? > > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From bencer at cauterized.net Mon Jun 11 04:33:13 2007 From: bencer at cauterized.net (Jorge Salamero Sanz) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] MPI performance gain with jumbo frames Message-ID: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> hi all, new to this list, so don't know if this is offtopic. i'd like to know experiences about MPI performance gain with jumbo frames. i manage a beowulf cluster (42 athlon xp, gentoo linux) with gigabit ethernet where fluent, openfoam and other mpi apps are run. with NFS i'm sure wich kind of gain i would have, but with MPI apps i'm worried about after seeing this page http://www.scl.ameslab.gov/Projects/IBMCluster/Benchmarks.html regards From ctierney at hypermall.net Mon Jun 11 08:27:46 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] backtraces In-Reply-To: <466CDE73.7020901@fft.be> References: <466CDE73.7020901@fft.be> Message-ID: <466D69F2.60005@hypermall.net> Toon Knapen wrote: > Interesting indeed. On which platform is this backtrace.so available > (obtaining backtraces is higly platform dependent AFAIK) ? > The Intel Compiler provides backtraces. I think (from memory) that you compile with -g --traceback. Craig > toon > > Mark Hahn wrote: >> I had a user grumble about how it was not trivial to get a basic >> backtrace on our clusters. his jobs tend to be 32-128p, >> and run for a week, so it's not ideal to run them under the debugger. >> >> turns out to be fairly simple to produce a backtrace.so which can >> be LD_PRELOAD'ed - it contains a constructor which registers a signal >> handler, which obtains the backtrace and translates and prints the >> corresponding file:func:line. >> >> does this sound like something of interest to other HPC sites? >> >> regards, mark hahn. >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From deadline at eadline.org Mon Jun 11 08:43:02 2007 From: deadline at eadline.org (Douglas Eadline) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> Message-ID: <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> 1) The results you reference are rather old. Does this reflect your hardware? 2) To support Jumbo Frames you need both NICs and a switch that support them. 3) It is possible to achieve wire speed from GigE, you need something other then 32 bit PCI connections, however. (PCIe, PCI-X) 4) While Jumbo Frames can help NFS, the effect on MPI can vary by application. Have you run any tests to see exactly what your network performance is? (i.e. Netpipe) You may find these articles helpful: http://www.clustermonkey.net//content/view/38/34/ http://www.clustermonkey.net//content/view/39/34/ -- Doug > hi all, > > new to this list, so don't know if this is offtopic. > > i'd like to know experiences about MPI performance gain with jumbo frames. > i > manage a beowulf cluster (42 athlon xp, gentoo linux) with gigabit > ethernet > where fluent, openfoam and other mpi apps are run. > > with NFS i'm sure wich kind of gain i would have, but with MPI apps i'm > worried about after seeing this page > http://www.scl.ameslab.gov/Projects/IBMCluster/Benchmarks.html > > regards > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > !DSPAM:466d3377130762071360113! > -- Doug From laytonjb at charter.net Mon Jun 11 08:57:09 2007 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> Message-ID: <466D70D5.5050701@charter.net> Doug brings up some good points. If you want to try Jumbo Frames to improve MPI performance you might have to tweak the TCP buffers as well. There are some links around the web on this. Sometimes it helps performance, sometimes it doesn't. Your mileage may vary. Jeff > 1) The results you reference are rather old. Does this > reflect your hardware? > > 2) To support Jumbo Frames you need both NICs and a switch > that support them. > > 3) It is possible to achieve wire speed from > GigE, you need something other then 32 bit PCI > connections, however. (PCIe, PCI-X) > > 4) While Jumbo Frames can help NFS, the effect on MPI > can vary by application. Have you run any tests to > see exactly what your network performance is? > (i.e. Netpipe) > > You may find these articles helpful: > > http://www.clustermonkey.net//content/view/38/34/ > > http://www.clustermonkey.net//content/view/39/34/ > > -- > Doug > > > >> hi all, >> >> new to this list, so don't know if this is offtopic. >> >> i'd like to know experiences about MPI performance gain with jumbo frames. >> i >> manage a beowulf cluster (42 athlon xp, gentoo linux) with gigabit >> ethernet >> where fluent, openfoam and other mpi apps are run. >> >> with NFS i'm sure wich kind of gain i would have, but with MPI apps i'm >> worried about after seeing this page >> http://www.scl.ameslab.gov/Projects/IBMCluster/Benchmarks.html >> >> regards >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> !DSPAM:466d3377130762071360113! >> >> > > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From toon.knapen at fft.be Mon Jun 11 12:12:46 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] backtraces In-Reply-To: <466D69F2.60005@hypermall.net> References: <466CDE73.7020901@fft.be> <466D69F2.60005@hypermall.net> Message-ID: <466D9EAE.8010105@fft.be> > > The Intel Compiler provides backtraces. I think (from memory) that > you compile with -g --traceback. > Thanks. I had no idea. However from the man page at http://www.intel.com/software/products/compilers/docs/clin/icc_txt.htm I read: -[no]traceback Tell the compiler to generate [not generate] extra information in the object file to allow the display of source file trace- back information at run time when a severe error occurs. This is intended for use with C code that is to be linked into a Fortran program. I do not understand the last sentence though. I do not see how this can be specific to C code linked into a Fortran program (and thus linked against the fortran runtime library) t From toon.knapen at fft.be Mon Jun 11 12:15:37 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] backtraces In-Reply-To: <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> Message-ID: <466D9F59.7070901@fft.be> Ashley Pittman wrote: > It's highly dependant to implement but I should imagine most people who > need backtraces use a debugger, the libc backtrace() function or > libbacktrace which can be use from either inside or outside the target > process, these tend to be platform independent. > libbacktrace is AFAICT also gcc specific. Or do you any pointers to some more platform-info on libbacktrace ? thanks, t From lindahl at pbm.com Mon Jun 11 12:35:26 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Tue May 13 01:06:09 2008 Subject: [Beowulf] backtraces In-Reply-To: <466D9F59.7070901@fft.be> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> Message-ID: <20070611193526.GE6911@bx9.net> On Mon, Jun 11, 2007 at 09:15:37PM +0200, Toon Knapen wrote: > libbacktrace is AFAICT also gcc specific. That would be hard, given that the PathScale and Intel compilers are extremely gcc-compatible. By the way, some MPIs already offer backtraces: OpenMPI, PathScale MPI, perhaps others. -- greg From hahn at mcmaster.ca Mon Jun 11 12:43:12 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466D9F59.7070901@fft.be> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> Message-ID: >> It's highly dependant to implement but I should imagine most people who >> need backtraces use a debugger, suppose your program is running on a hundred nodes for a week before you hit the event you want the backtrace for... yes, debugger+coredump can be used, but for obvious reasons, we normally recommend users _not_ have them enabled. >> the libc backtrace() function or >> libbacktrace which can be use from either inside or outside the target >> process, these tend to be platform independent. I started with the libc backtrace function, but wanted something better than its backtrace_symbols() companion. > libbacktrace is AFAICT also gcc specific. Or do you any pointers to some more > platform-info on libbacktrace ? I believe it's binutils/libc-specific, not compiler-specific. at least "pathcc -O3 -fno-inline-functions -g" gave me a meaningful backtrace on an mpi tester. anyway, appended is my current version of backtrace.c - I think it's interesting and potentially useful, especially considering that it's not really complex: /* print a backtrace. written by Mark Hahn, SHARCnet, 2007. gcc -fPIC backtrace.c /usr/lib64/libbfd-2.15.92.0.2.so -shared -o backtrace.so using -lbfd chokes on a symbol addressing issue with (static) libbfd.a on my system. your libbfd version number may differ. LD_PRELOAD=./backtrace.so ./tester signal(11) Obtained 9 stack frames. file: /home/hahn/private/tester.c, line: 10, func dosegv file: /home/hahn/private/tester.c, line: 14, func bar file: /home/hahn/private/tester.c, line: 17, func foo file: /home/hahn/private/tester.c, line: 29, func main all symbols (globals and functions) are static to avoid contamination. you need -g on the target program, and potentially something like -fno-inline-functions to dissuade the compiler from disappearing some functions. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #define MAX_FRAMES (20) /* globals retained across calls to resolve. */ static bfd* abfd = 0; static asymbol **syms = 0; static asection *text = 0; static void resolve(char *address) { if (!abfd) { char ename[1024]; int l = readlink("/proc/self/exe",ename,sizeof(ename)); if (l == -1) { perror("failed to find executable\n"); return; } ename[l] = 0; bfd_init(); abfd = bfd_openr(ename, 0); if (!abfd) { perror("bfd_openr failed: "); return; } /* oddly, this is required for it to work... */ bfd_check_format(abfd,bfd_object); unsigned storage_needed = bfd_get_symtab_upper_bound(abfd); syms = (asymbol **) malloc(storage_needed); unsigned cSymbols = bfd_canonicalize_symtab(abfd, syms); text = bfd_get_section_by_name(abfd, ".text"); } long offset = ((long)address) - text->vma; if (offset > 0) { const char *file; const char *func; unsigned line; if (bfd_find_nearest_line(abfd, text, syms, offset, &file, &func, &line) && file) printf("file: %s, line: %u, func %s\n",file,line,func); } } static void print_trace() { void *array[MAX_FRAMES]; size_t size; size_t i; void *approx_text_end = (void*) ((128+100) * 2<<20); size = backtrace (array, MAX_FRAMES); printf ("Obtained %zd stack frames.\n", size); for (i = 0; i < size; i++) { if (array[i] < approx_text_end) { resolve(array[i]); } } } static void handler(int sig) { printf("signal(%d)\n",sig); print_trace(); _exit(1); } static void __attribute__((constructor)) init() { static struct sigaction sa; sa.sa_handler = handler; sigaction(SIGABRT, &sa, 0); sigaction(SIGFPE, &sa, 0); sigaction(SIGSEGV, &sa, 0); } From ctierney at hypermall.net Mon Jun 11 13:58:59 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> Message-ID: <466DB793.1040903@hypermall.net> Mark Hahn wrote: >>> It's highly dependant to implement but I should imagine most people who >>> need backtraces use a debugger, > > suppose your program is running on a hundred nodes for a week before you > hit the event you want the backtrace for... > yes, debugger+coredump can be used, but for obvious reasons, > we normally recommend users _not_ have them enabled. Sorry to start a flame war.... Make sure that your code generates the exact same answer with debug/backtrace enabled and disabled, then you add user-level checkpointing so that you can restart where you want. Then you run up until the problem and restart with the last checkpoint. Run for a week without checkpointing? Just begging for trouble. Craig > >>> the libc backtrace() function or >>> libbacktrace which can be use from either inside or outside the target >>> process, these tend to be platform independent. > > I started with the libc backtrace function, but wanted something better > than its backtrace_symbols() companion. > >> libbacktrace is AFAICT also gcc specific. Or do you any pointers to >> some more platform-info on libbacktrace ? > > I believe it's binutils/libc-specific, not compiler-specific. at least > "pathcc -O3 -fno-inline-functions -g" gave me a meaningful backtrace on > an mpi tester. > > anyway, appended is my current version of backtrace.c - I think it's > interesting and potentially useful, especially considering that it's not > really complex: > > /* print a backtrace. > written by Mark Hahn, SHARCnet, 2007. > > gcc -fPIC backtrace.c /usr/lib64/libbfd-2.15.92.0.2.so -shared -o > backtrace.so > > using -lbfd chokes on a symbol addressing issue with (static) libbfd.a > on my system. your libbfd version number may differ. > > LD_PRELOAD=./backtrace.so ./tester > signal(11) > Obtained 9 stack frames. > file: /home/hahn/private/tester.c, line: 10, func dosegv > file: /home/hahn/private/tester.c, line: 14, func bar > file: /home/hahn/private/tester.c, line: 17, func foo > file: /home/hahn/private/tester.c, line: 29, func main > > all symbols (globals and functions) are static to avoid contamination. > > you need -g on the target program, and potentially something like > -fno-inline-functions to dissuade the compiler from disappearing some > functions. > */ > > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > > #define MAX_FRAMES (20) > > /* globals retained across calls to resolve. */ > static bfd* abfd = 0; > static asymbol **syms = 0; > static asection *text = 0; > > static void resolve(char *address) { > if (!abfd) { > char ename[1024]; > int l = readlink("/proc/self/exe",ename,sizeof(ename)); > if (l == -1) { > perror("failed to find executable\n"); > return; > } > ename[l] = 0; > > bfd_init(); > > abfd = bfd_openr(ename, 0); > if (!abfd) { > perror("bfd_openr failed: "); > return; > } > /* oddly, this is required for it to work... */ > bfd_check_format(abfd,bfd_object); > > unsigned storage_needed = bfd_get_symtab_upper_bound(abfd); > syms = (asymbol **) malloc(storage_needed); > unsigned cSymbols = bfd_canonicalize_symtab(abfd, syms); > > text = bfd_get_section_by_name(abfd, ".text"); > } > long offset = ((long)address) - text->vma; > if (offset > 0) { > const char *file; > const char *func; > unsigned line; > if (bfd_find_nearest_line(abfd, text, syms, offset, &file, > &func, &line) && file) > printf("file: %s, line: %u, func %s\n",file,line,func); > } > } > > static void print_trace() { > void *array[MAX_FRAMES]; > size_t size; > size_t i; > void *approx_text_end = (void*) ((128+100) * 2<<20); > > size = backtrace (array, MAX_FRAMES); > printf ("Obtained %zd stack frames.\n", size); > for (i = 0; i < size; i++) { > if (array[i] < approx_text_end) { > resolve(array[i]); > } > } > } > > static void handler(int sig) { > printf("signal(%d)\n",sig); > print_trace(); > _exit(1); > } > > static void __attribute__((constructor)) init() { > static struct sigaction sa; > sa.sa_handler = handler; > sigaction(SIGABRT, &sa, 0); > sigaction(SIGFPE, &sa, 0); > sigaction(SIGSEGV, &sa, 0); > } > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From hahn at mcmaster.ca Mon Jun 11 19:00:02 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466DB793.1040903@hypermall.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> Message-ID: > Sorry to start a flame war.... what part do you think was inflamed? > Make sure that your code generates the exact same answer with debug/backtrace > enabled and disabled, part of the point of my very simple backtrace.so is that it has zero runtime overhead and doesn't require any special compilation. > then you add user-level checkpointing so that you can I'm most curious to hear people's experience with checkpointing. all our more serious, established codes do checkpointing, but it's extremely foreign to people writing newish codes. and, of course, it's a lot of extra work. I'm not arguing against checkpointing, just acknowledging that although we _require_ it, we don't actually demand "proof-of-checkpointability". > restart where you want. Then you > run up until the problem and restart with the last checkpoint. restarting from checkpoint is fine (the code in question could actually do it), but still means you have hours of running, presumably under a debugger. > Run for a week without checkpointing? Just begging for trouble. suppose you have 2k users, with ~300 active at any instant, and probably 200 unrelated codes running. while we do require checkpointing (I usually say "every 6-8 cpu hours"), I suspect that many users never do. how do you check/validate/encourage/support checkpointing? part of the reason I got a kick out of this simple backtrace.so is indeed that it's quite possible to conceive of a checkpoint.so which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job of checkpointing at least serial codes non-intrusively. regards, mark hahn. From ctierney at hypermall.net Mon Jun 11 20:54:28 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> Message-ID: <466E18F4.90701@hypermall.net> Mark Hahn wrote: >> Sorry to start a flame war.... > > what part do you think was inflamed? It was when I was trying to say "Real codes have user-level checkpointing implemented and no code should ever run for 7 days." > >> Make sure that your code generates the exact same answer with >> debug/backtrace enabled and disabled, > > part of the point of my very simple backtrace.so is that it has zero > runtime overhead and doesn't require any special compilation. > Does the Intel version have overhead? I never measured it before, but I never thought it was much. >> then you add user-level checkpointing so that you can > > I'm most curious to hear people's experience with checkpointing. > all our more serious, established codes do checkpointing, but it's > extremely foreign to people writing newish codes. > and, of course, it's a lot of extra work. I'm not arguing against > checkpointing, just acknowledging that although we _require_ it, > we don't actually demand "proof-of-checkpointability". > I included checkpointing in an ocean-model once. It was very easy, but that was most likely because of how it was organized (Fortran 77, most data structures were shared). I don't think that it is foreign to people writing new codes. It is foreign to scientists. Software developers (who could be scientists) would think of this from the beginning (I hope). >> restart where you want. Then you >> run up until the problem and restart with the last checkpoint. > > restarting from checkpoint is fine (the code in question could > actually do it), but still means you have hours of running, > presumably under a debugger. > >> Run for a week without checkpointing? Just begging for trouble. > > suppose you have 2k users, with ~300 active at any instant, > and probably 200 unrelated codes running. while we do require > checkpointing (I usually say "every 6-8 cpu hours"), I suspect that many > users never do. how do you check/validate/encourage/support > checkpointing? > Set your queue maximums to 6-8 hours. Prevents system hogging, encourages checkpointing for long runs. Make sure your IO system can support the checkpointing because it can create a lot of load. > part of the reason I got a kick out of this simple backtrace.so > is indeed that it's quite possible to conceive of a checkpoint.so > which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job > of checkpointing at least serial codes non-intrusively. > BTW, I like your code. I had a script written for me in the past (by Greg Lindahl in a galaxy far-far away). The one modification I would make is to print out the MPI ID evnironment variable (MPI flavors vary how it is set). Then when it crashes, you know which process actually died. Craig From lindahl at pbm.com Mon Jun 11 21:20:11 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466E18F4.90701@hypermall.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> Message-ID: <20070612042011.GA759@bx9.net> On Mon, Jun 11, 2007 at 08:54:28PM -0700, Craig Tierney wrote: > I don't think that it is foreign to people writing new codes. > It is foreign to scientists. Most serious supercomputing scientists -- those who have finite cpu allotments in particular -- put in checkpointing when they realize it saves them valuable resources. Until they lose work or money, it's not a priority. > BTW, I like your code. I had a script written for me in the past > (by Greg Lindahl in a galaxy far-far away). Hey, and here I was avoiding saying "You guys don't remember me talking about easy backtrace in conferences in 2000 and 2001? I was pretty insufferably on the topic..." That implementation used gdb and had zero overhead other than the memory gdb took. But fewer processes is always better, and OpenMPI and Intel and PathScale MPI & compilers all use a library implementation somewhat like Mark's. -- greg From gerry.creager at tamu.edu Mon Jun 11 21:55:02 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466E18F4.90701@hypermall.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> Message-ID: <466E2726.50407@tamu.edu> I've tried to stay out of this. Really, I have. Craig Tierney wrote: > Mark Hahn wrote: >>> Sorry to start a flame war.... >> >> what part do you think was inflamed? > > It was when I was trying to say "Real codes have user-level > checkpointing implemented and no code should ever run for 7 > days." A number of my climate simulations will run for 7-10 days to get century-long simulations to complete. I've run geodesy simulations that ran for up to 17 days in the past. I like to think that my codes are real enough! Real codes do have user-level checkpointing, though. And even better codes can be restarted without a lot of user intervention by invoking a run-time flag and going off for coffee. >>> Make sure that your code generates the exact same answer with >>> debug/backtrace enabled and disabled, >> >> part of the point of my very simple backtrace.so is that it has zero >> runtime overhead and doesn't require any special compilation. >> > > Does the Intel version have overhead? I never measured it before, > but I never thought it was much. Can't speak to the Intel compiler, as with their terms of use I've abandoned it and never tried its traceback or checkpointing capabilities. PGI, which I do use, and old IBM Fort-G and Fort-H did have overhead issues. The PGI compiler is what I tend to use almost all the time for my model compiling so I'm not able to speak to must of this new-fangled language stuff you're talking about :-) >>> then you add user-level checkpointing so that you can >> >> I'm most curious to hear people's experience with checkpointing. >> all our more serious, established codes do checkpointing, but it's >> extremely foreign to people writing newish codes. >> and, of course, it's a lot of extra work. I'm not arguing against >> checkpointing, just acknowledging that although we _require_ it, >> we don't actually demand "proof-of-checkpointability". >> > > I included checkpointing in an ocean-model once. It was very easy, > but that was most likely because of how it was organized (Fortran 77, > most data structures were shared). > > I don't think that it is foreign to people writing new codes. > It is foreign to scientists. Software developers (who could be > scientists) would think of this from the beginning (I hope). Let's see. WRF and MM5 on the atmospheric front, support user-level checkpointing and restart capabilities. So does ADCIRC and Wave Watch-III. And ROMS. So, the oceans side is covered. The older *nix version of PAGES (geodesy) didn't but it was easily added. Most folks didn't use PAGES like I did, and thus checkpointing was pretty useless. I'm not dabbling in genomics or protein folding but most of the folks I know who are, are computer scientists who "followed the money" and are collaborating on projects with discipline scientists, implementing code to support the "real" work. So, I strongly suspect they're implementing checkpointing, too. >>> restart where you want. Then you >>> run up until the problem and restart with the last checkpoint. >> >> restarting from checkpoint is fine (the code in question could >> actually do it), but still means you have hours of running, >> presumably under a debugger. >> >>> Run for a week without checkpointing? Just begging for trouble. >> >> suppose you have 2k users, with ~300 active at any instant, >> and probably 200 unrelated codes running. while we do require >> checkpointing (I usually say "every 6-8 cpu hours"), I suspect that >> many users never do. how do you check/validate/encourage/support >> checkpointing? >> > > Set your queue maximums to 6-8 hours. Prevents system hogging, > encourages checkpointing for long runs. Make sure your IO system > can support the checkpointing because it can create a lot of load. And how do you support my operational requirements with this policy during hurricane season? Let's see... "Stop that ensemble run now so the Monte Carlo chemists can play for awhile, then we'll let you back on. Don't worry about the timeliness of your simulations. No one needs a 35-member ensemble for statistical forecasting, anyway." Did I miss something? Yeah, we really do that. With boundary-condition munging we can run a statistical set of simulations and see what the probabilities are and where, for instance, maximum storm surge is likely to go. If we don't get sufficient membership in the ensemble, the statistical strength of the forecasting procedure decreases. Gerry >> part of the reason I got a kick out of this simple backtrace.so >> is indeed that it's quite possible to conceive of a checkpoint.so >> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent >> job of checkpointing at least serial codes non-intrusively. >> > > BTW, I like your code. I had a script written for me in the past > (by Greg Lindahl in a galaxy far-far away). The one modification > I would make is to print out the MPI ID evnironment variable (MPI > flavors vary how it is set). Then when it crashes, you know which > process actually died. > > Craig > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From lindahl at pbm.com Mon Jun 11 22:49:34 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466E2726.50407@tamu.edu> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> Message-ID: <20070612054934.GA8063@bx9.net> On Mon, Jun 11, 2007 at 11:55:02PM -0500, Gerry Creager wrote: > And how do you support my operational requirements with this policy > during hurricane season? By not over-generalizing from a general policy to a place where it doesn't apply? Craig has worked in weather forecasting, you know. You don't run your ensemble elements as separate jobs? Isn't that asking for disaster if something goes wrong? -- greg From Hakon.Bugge at scali.com Tue Jun 12 01:00:23 2007 From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] Re: Beowulf Digest, Vol 40, Issue 9 In-Reply-To: <200706081900.l58J06gq014487@bluewest.scyld.com> References: <200706081900.l58J06gq014487@bluewest.scyld.com> Message-ID: <20070612080145.A55E235AA28@mail.scali.no> At 21:00 08.06.2007, Mark Hahn wrote: >Message: 1 >Date: Fri, 8 Jun 2007 12:11:10 -0400 (EDT) >From: Mark Hahn >Subject: [Beowulf] backtraces >To: Beowulf Mailing List > >I had a user grumble about how it was not trivial to get >a basic backtrace on our clusters. his jobs tend to be 32-128p, >and run for a week, so it's not ideal to run them under the debugger. Using Scali MPI Connect, you can easily install signal handlers. When the signal(s) is caught, the application continues to run, the offending process writes out its registers and you can conveniently attach it with your favorite debugger. Regards, H?kon From gerry.creager at tamu.edu Tue Jun 12 05:11:57 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <20070612054934.GA8063@bx9.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> <20070612054934.GA8063@bx9.net> Message-ID: <466E8D8D.80808@tamu.edu> Greg Lindahl wrote: > On Mon, Jun 11, 2007 at 11:55:02PM -0500, Gerry Creager wrote: > >> And how do you support my operational requirements with this policy >> during hurricane season? > > By not over-generalizing from a general policy to a place where it > doesn't apply? Craig has worked in weather forecasting, you know. Actually, the tone sounded like it was already over-generalized. I merely followed the trend. > You don't run your ensemble elements as separate jobs? Isn't that > asking for disaster if something goes wrong? Actually, it depends on what you call a "job". Apparently IBM's LoadLeveler (hardly a Beowulf implementation, but what I'm working with right now) thinks that the job-file defines the job. I can check-point, sleep or do quite a bit more within the normal job script but IBM wants to treat that as a "job". Most of my runs on that machine complete in a couple of clock hours for a single ensemble member, or less. The job, however, can take 8-12 hours with WRF, Holland winds, ADCIRC, WaveWatch, SWAN and ELCIRC in ensemble mode. Some of my WRF climate runs can go for days, however. Those are cycle hogs. gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From rgb at phy.duke.edu Tue Jun 12 06:08:32 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> Message-ID: On Mon, 11 Jun 2007, Mark Hahn wrote: > part of the reason I got a kick out of this simple backtrace.so > is indeed that it's quite possible to conceive of a checkpoint.so > which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job of > checkpointing at least serial codes non-intrusively. IIRC, condor has just such a library that it uses both for serial job migration and checkpointing. rgb > > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From dnlombar at ichips.intel.com Tue Jun 12 07:02:45 2007 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> Message-ID: <20070612140245.GA14845@nlxdcldnl2.cl.intel.com> On Mon, Jun 11, 2007 at 10:00:02PM -0400, Mark Hahn wrote: > > part of the reason I got a kick out of this simple backtrace.so > is indeed that it's quite possible to conceive of a checkpoint.so > which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly > decent job of checkpointing at least serial codes non-intrusively. > Have you looked at Berkely Lab Checkpoint/Restart (BLCR) at It does far beyond serial codes; with proper support, it does MPI too... -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From apittman at concurrent-thinking.com Mon Jun 11 08:36:45 2007 From: apittman at concurrent-thinking.com (Ashley Pittman) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466CDE73.7020901@fft.be> References: <466CDE73.7020901@fft.be> Message-ID: <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> On Mon, 2007-06-11 at 07:32 +0200, Toon Knapen wrote: > Interesting indeed. On which platform is this backtrace.so available > (obtaining backtraces is higly platform dependent AFAIK) ? It's highly dependant to implement but I should imagine most people who need backtraces use a debugger, the libc backtrace() function or libbacktrace which can be use from either inside or outside the target process, these tend to be platform independent. > Mark Hahn wrote: > > I had a user grumble about how it was not trivial to get a basic > > backtrace on our clusters. his jobs tend to be 32-128p, > > and run for a week, so it's not ideal to run them under the debugger. It really shouldn't be that difficult, on a Quadrics cluster at least you can use the command "padb -x -r " from anywhere in the cluster to see a backtrace from any given rank. Ashley, From arnoldg at ncsa.uiuc.edu Mon Jun 11 12:31:56 2007 From: arnoldg at ncsa.uiuc.edu (Galen Arnold) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466D9EAE.8010105@fft.be> References: <466CDE73.7020901@fft.be> <466D69F2.60005@hypermall.net> <466D9EAE.8010105@fft.be> Message-ID: >> >> The Intel Compiler provides backtraces. I think (from memory) that >> you compile with -g --traceback. ...only for fortran source code [it's in icc in case you're linking with fortran]. -Galen From tmalas at ee.bilkent.edu.tr Tue Jun 12 00:25:37 2007 From: tmalas at ee.bilkent.edu.tr (Tahir Malas) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process Message-ID: <01ae01c7acc2$dfa8e810$d80cb38b@bs> Hi all, We have an 8 dual quad-core node HP cluster connected via Infiniband. We use Voltaire DDR cards and 24-port switch. We also use OFED 1.1 and MVAPICH 0.9.7. We have two interesting problems that we could not overcome yet: 1. In our test program which mimics the communications in our code, the nodes are paired as follows: (0 and 1), (2 and 3), (4 and 5), (6 and 7). We perform one to one communications between these pairs of nodes simultaneously. We use blocking MPI send and receive commands to communicate an integer array of various sizes. In addition, we consider different numbers of processes: (a) 1 process per node, 8 processes overall: One link is established between the pairs of nodes. (b) 2 process per node, 16 processes overall: Two links are established between the pairs of nodes. (c) 4 process per node, 32 processes overall: Four links are established between the pairs of nodes. (d) 8 process per node, 64 processes overall: Eight links are established between the pairs of nodes. We obtain logical timings, except for the following interesting comparison: For 32 processes (4 process per node), the arrays with 512-Byte size are communicated slower than the 4096-Byte size arrays. For both of them, we send/receive 1,000,000 arrays and take the average to find the time per package. Only package size changes. We have made many trials and confirmed this abnormal case is persistent. More specifically, communication of 4k-Byte packages are 2 times faster than the communication of 512-Byte packages. The OSU bandwidth and latency test around these points shows: Byte MB/s 256 417.53 512 592.34 1024 691.02 2048 857.35 4096 906.04 8192 1022.52 Time (usec) 256 4.79 512 5.48 1024 6.60 2048 8.30 4096 11.02 So this behavior does not seem reasonable to us. 2. SOMETIMES, after the test with overall 32 processes, one of the four processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the test program shows a "done." and waits for sometime. We can neither kill the process nor soft reboot the node. We have to wait for that process to terminate, which can last long. Does anybody have some comments in these issues? Thanks in advance, Tahir Malas Bilkent University Electrical and Electronics Engineering Department From hahn at mcmaster.ca Tue Jun 12 08:14:55 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process In-Reply-To: <01ae01c7acc2$dfa8e810$d80cb38b@bs> References: <01ae01c7acc2$dfa8e810$d80cb38b@bs> Message-ID: > For 32 processes (4 process per node), the arrays with 512-Byte size are > communicated slower than the 4096-Byte size arrays. For both of them, we do you mean that this is not the case in other configurations? an interconnect _should_ have some steep rise in effective bandwidth as packet size is increased. it's a useful metric to know the packet size at which half-peak bandwidth is achieved, since this offers some "sense of scale" to programmers judging whether their own packet sizes are appropriate. > this abnormal case is persistent. More specifically, communication of > 4k-Byte packages are 2 times faster than the communication of 512-Byte > packages. perhaps I'm dense this morning, but what's unexpected about that? > The OSU bandwidth and latency test around these points shows: > Byte MB/s > 256 417.53 > 512 592.34 > 1024 691.02 > 2048 857.35 > 4096 906.04 > 8192 1022.52 the osu_bw test is a streaming, fire-and-forget one which strongly rewards message aggregation. (this is not necessarily deceptive - it's measuring a real communication pattern, though it's not the only way to quantify bandwidth.) you can see that it's aggregating because the reported bandwidth for small packets is much higher than you'd expect if each packet took the latency reported below. (unless my math is wrong, 256/(2*4.79e-6) = 26.7 MB/s) > Time (usec) > 256 4.79 > 512 5.48 > 1024 6.60 > 2048 8.30 > 4096 11.02 > So this behavior does not seem reasonable to us. > > 2. SOMETIMES, after the test with overall 32 processes, one of the four > processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the test > program shows a "done." and waits for sometime. We can neither kill the > process nor soft reboot the node. We have to wait for that process to > terminate, which can last long. does /proc/$pid/wchan (on the 'D' state process) tell you anything? do all the ranks return from MPI_Finalize? regards, mark hahn. From ctierney at hypermall.net Tue Jun 12 08:34:22 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466E2726.50407@tamu.edu> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> Message-ID: <466EBCFE.4020606@hypermall.net> Gerry Creager wrote: > I've tried to stay out of this. Really, I have. > > Craig Tierney wrote: >> Mark Hahn wrote: >>>> Sorry to start a flame war.... >>> >>> what part do you think was inflamed? >> >> It was when I was trying to say "Real codes have user-level >> checkpointing implemented and no code should ever run for 7 >> days." > > A number of my climate simulations will run for 7-10 days to get > century-long simulations to complete. I've run geodesy simulations that > ran for up to 17 days in the past. I like to think that my codes are > real enough! > NCAR and GFDL run climate simulations for weeks as well. How longest period of time any one job can run? It is 8-12 hours. I can verify these numbers if needed, but I can guarantee you that no one is allowed to put their job in for 17 days. With explicit permission they may get 24 hours, but that would be for unique situations. > Real codes do have user-level checkpointing, though. And even better > codes can be restarted without a lot of user intervention by invoking a > run-time flag and going off for coffee. > You mean there are people that bother to implement checkpointing and then don't make it code like: if (checkpoint files exist in my directory) then load checkpoint files else start from scratch end ???? >> Set your queue maximums to 6-8 hours. Prevents system hogging, >> encourages checkpointing for long runs. Make sure your IO system >> can support the checkpointing because it can create a lot of load. > > And how do you support my operational requirements with this policy > during hurricane season? Let's see... "Stop that ensemble run now so > the Monte Carlo chemists can play for awhile, then we'll let you back > on. Don't worry about the timeliness of your simulations. No one needs > a 35-member ensemble for statistical forecasting, anyway." Did I miss > something? > You kick-off the users that are not running operational codes because their work is (probably) not as time constrained. Also, if you take so long to get your answer in an operational mode that the answer doesn't matter anymore, you need a faster computer. I would think that if you cannot spit out a 12-hour hurricane forecast in a couple of hours I would be concerned how valuable the answer would be. Craig > Yeah, we really do that. With boundary-condition munging we can run a > statistical set of simulations and see what the probabilities are and > where, for instance, maximum storm surge is likely to go. If we don't > get sufficient membership in the ensemble, the statistical strength of > the forecasting procedure decreases. > > Gerry > >>> part of the reason I got a kick out of this simple backtrace.so >>> is indeed that it's quite possible to conceive of a checkpoint.so >>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent >>> job of checkpointing at least serial codes non-intrusively. >>> >> >> BTW, I like your code. I had a script written for me in the past >> (by Greg Lindahl in a galaxy far-far away). The one modification >> I would make is to print out the MPI ID evnironment variable (MPI >> flavors vary how it is set). Then when it crashes, you know which >> process actually died. >> >> Craig >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > From gerry.creager at tamu.edu Tue Jun 12 09:19:53 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466EBCFE.4020606@hypermall.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> <466EBCFE.4020606@hypermall.net> Message-ID: <466EC7A9.4010609@tamu.edu> Craig Tierney wrote: > Gerry Creager wrote: >> I've tried to stay out of this. Really, I have. >> >> Craig Tierney wrote: >>> Mark Hahn wrote: >>>>> Sorry to start a flame war.... >>>> >>>> what part do you think was inflamed? >>> >>> It was when I was trying to say "Real codes have user-level >>> checkpointing implemented and no code should ever run for 7 >>> days." >> >> A number of my climate simulations will run for 7-10 days to get >> century-long simulations to complete. I've run geodesy simulations >> that ran for up to 17 days in the past. I like to think that my codes >> are real enough! >> > > NCAR and GFDL run climate simulations for weeks as well. How longest > period of time any one job can run? It is 8-12 hours. I can verify > these numbers if needed, but I can guarantee you that no one is allowed > to put their job in for 17 days. With explicit permission they may get > 24 hours, but that would be for unique situations. On the p575, we have similar constraints and I do work within those. In my lab, I can control access a bit more and have considerably fewer (and truly grateful) users, so if we need to run "forever" we can implement that. >> Real codes do have user-level checkpointing, though. And even better >> codes can be restarted without a lot of user intervention by invoking >> a run-time flag and going off for coffee. >> > > You mean there are people that bother to implement checkpointing and > then don't make it code like: > > if (checkpoint files exist in my directory) then > load checkpoint files > else > start from scratch > end > > ???? Yes, there are. No, I'm not one of them. My stuff does do a restart if it stops and finds evidence of a need to continue. However, I've seen this failure time and time again over the years. >>> Set your queue maximums to 6-8 hours. Prevents system hogging, >>> encourages checkpointing for long runs. Make sure your IO system >>> can support the checkpointing because it can create a lot of load. >> >> And how do you support my operational requirements with this policy >> during hurricane season? Let's see... "Stop that ensemble run now so >> the Monte Carlo chemists can play for awhile, then we'll let you back >> on. Don't worry about the timeliness of your simulations. No one >> needs a 35-member ensemble for statistical forecasting, anyway." Did >> I miss something? >> > > You kick-off the users that are not running operational codes because > their work is (probably) not as time constrained. Also, if you take > so long to get your answer in an operational mode that the answer > doesn't matter anymore, you need a faster computer. I would think that > if you cannot spit out a 12-hour hurricane forecast in a couple of > hours I would be concerned how valuable the answer would be. Several points in here. 1. Preemption is one approach I finally got the admin to buy into for forecasting codes. 2. MY operational codes for an individual simulation don't take long to run, save the fact that we don't do a 12 hr hurricane sim, but an 84 hour sim for the weather side (WRF). Saving grace here is that the nested grids are not too large so they can run to completion in a couple of wall-clock hours. 3. When one starts trying to twiddle initial conditions statistically to create an ensemble, one then has to run all the ensemble members. One usually starts with central cases first, especially if one "knows" which are central and which are peripheral. If one run takes 30 min on 128 processors, and one thinks one needs 57 members run, one exceeds a wall-clock day. And needs a bigger, faster computer, or at least a bigger queue reservation. If one does this without preemption, one gets all results back at the end of the hurricane season and declares success after 3 years of analysis instead of providing data in near real time. Part of this involves the social engineering required on my campus to get HPC efforts to work at all... Alas, nothing has to do with backtraces. gerry >> Yeah, we really do that. With boundary-condition munging we can run a >> statistical set of simulations and see what the probabilities are and >> where, for instance, maximum storm surge is likely to go. If we don't >> get sufficient membership in the ensemble, the statistical strength of >> the forecasting procedure decreases. >> >> Gerry >> >>>> part of the reason I got a kick out of this simple backtrace.so >>>> is indeed that it's quite possible to conceive of a checkpoint.so >>>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent >>>> job of checkpointing at least serial codes non-intrusively. >>>> >>> >>> BTW, I like your code. I had a script written for me in the past >>> (by Greg Lindahl in a galaxy far-far away). The one modification >>> I would make is to print out the MPI ID evnironment variable (MPI >>> flavors vary how it is set). Then when it crashes, you know which >>> process actually died. >>> >>> Craig >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> > > -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From surs at cse.ohio-state.edu Tue Jun 12 08:09:01 2007 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] Re: [mvapich-discuss] Two problems related to slowness and TASK_UNINTERRUPTABLE process In-Reply-To: <01ae01c7acc2$dfa8e810$d80cb38b@bs> References: <01ae01c7acc2$dfa8e810$d80cb38b@bs> Message-ID: <466EB70D.2000306@cse.ohio-state.edu> Hi Tahir, Thanks for sharing this data and your observations. It is interesting. We have a more recent release, MVAPICH-0.9.9 which is available from our website (mvapich.cse.ohio-state.edu) as well as with OFED-1.2 distribution. Could you please try out our newer release and see if the results change/remain the same? Thanks, Sayantan. Tahir Malas wrote: > Hi all, > We have an 8 dual quad-core node HP cluster connected via Infiniband. We use > Voltaire DDR cards and 24-port switch. We also use OFED 1.1 and MVAPICH > 0.9.7. We have two interesting problems that we could not overcome yet: > > 1. In our test program which mimics the communications in our code, the > nodes are paired as follows: (0 and 1), (2 and 3), (4 and 5), (6 and 7). We > perform one to one communications between these pairs of nodes > simultaneously. We use blocking MPI send and receive commands to communicate > an integer array of various sizes. In addition, we consider different > numbers of processes: > (a) 1 process per node, 8 processes overall: One link is established between > the pairs of nodes. > (b) 2 process per node, 16 processes overall: Two links are established > between the pairs of nodes. > (c) 4 process per node, 32 processes overall: Four links are established > between the pairs of nodes. > (d) 8 process per node, 64 processes overall: Eight links are established > between the pairs of nodes. > > We obtain logical timings, except for the following interesting comparison: > > For 32 processes (4 process per node), the arrays with 512-Byte size are > communicated slower than the 4096-Byte size arrays. For both of them, we > send/receive 1,000,000 arrays and take the average to find the time per > package. Only package size changes. We have made many trials and confirmed > this abnormal case is persistent. More specifically, communication of > 4k-Byte packages are 2 times faster than the communication of 512-Byte > packages. > > The OSU bandwidth and latency test around these points shows: > Byte MB/s > 256 417.53 > 512 592.34 > 1024 691.02 > 2048 857.35 > 4096 906.04 > 8192 1022.52 > Time (usec) > 256 4.79 > 512 5.48 > 1024 6.60 > 2048 8.30 > 4096 11.02 > So this behavior does not seem reasonable to us. > > 2. SOMETIMES, after the test with overall 32 processes, one of the four > processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the test > program shows a "done." and waits for sometime. We can neither kill the > process nor soft reboot the node. We have to wait for that process to > terminate, which can last long. > > Does anybody have some comments in these issues? > Thanks in advance, > Tahir Malas > Bilkent University > Electrical and Electronics Engineering Department > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- http://www.cse.ohio-state.edu/~surs From ctierney at hypermall.net Tue Jun 12 14:48:33 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] backtraces In-Reply-To: <466EC7A9.4010609@tamu.edu> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> <466EBCFE.4020606@hypermall.net> <466EC7A9.4010609@tamu.edu> Message-ID: <466F14B1.8070508@hypermall.net> > Several points in here. > 1. Preemption is one approach I finally got the admin to buy into for > forecasting codes. > 2. MY operational codes for an individual simulation don't take long to > run, save the fact that we don't do a 12 hr hurricane sim, but an 84 > hour sim for the weather side (WRF). Saving grace here is that the > nested grids are not too large so they can run to completion in a couple > of wall-clock hours. > 3. When one starts trying to twiddle initial conditions statistically > to create an ensemble, one then has to run all the ensemble members. One > usually starts with central cases first, especially if one "knows" which > are central and which are peripheral. If one run takes 30 min on 128 > processors, and one thinks one needs 57 members run, one exceeds a > wall-clock day. And needs a bigger, faster computer, or at least a > bigger queue reservation. If one does this without preemption, one gets > all results back at the end of the hurricane season and declares success > after 3 years of analysis instead of providing data in near real time. > So there are 57 jobs of 30 minutes each. Get your user to rewrite their scripts so it isn't one job. That shouldn't be too hard. > Part of this involves the social engineering required on my campus to > get HPC efforts to work at all... Alas, nothing has to do with backtraces. Very true (on both parts). Craig > > gerry > >>> Yeah, we really do that. With boundary-condition munging we can run >>> a statistical set of simulations and see what the probabilities are >>> and where, for instance, maximum storm surge is likely to go. If we >>> don't get sufficient membership in the ensemble, the statistical >>> strength of the forecasting procedure decreases. >>> >>> Gerry >>> >>>>> part of the reason I got a kick out of this simple backtrace.so >>>>> is indeed that it's quite possible to conceive of a checkpoint.so >>>>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly >>>>> decent job of checkpointing at least serial codes non-intrusively. >>>>> >>>> >>>> BTW, I like your code. I had a script written for me in the past >>>> (by Greg Lindahl in a galaxy far-far away). The one modification >>>> I would make is to print out the MPI ID evnironment variable (MPI >>>> flavors vary how it is set). Then when it crashes, you know which >>>> process actually died. >>>> >>>> Craig >>>> >>>> _______________________________________________ >>>> Beowulf mailing list, Beowulf@beowulf.org >>>> To change your subscription (digest mode or unsubscribe) visit >>>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> > From wrankin at ee.duke.edu Wed Jun 13 09:40:17 2007 From: wrankin at ee.duke.edu (Bill Rankin) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <466D70D5.5050701@charter.net> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> <466D70D5.5050701@charter.net> Message-ID: <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> Doug and Jeff have good points (and some good links). On thing to also pay attention to is the CPU utilization during the bandwidth and application testing. We found that on our cluster (various Dells with built in GigE NICs) while we did not see huge differences in effective bandwidth, the CPU overhead was notably less when using Jumbo Frames. Again, YMMV. Good luck, -bill On Jun 11, 2007, at 11:57 AM, Jeffrey B. Layton wrote: > Doug brings up some good points. If you want to try Jumbo > Frames to improve MPI performance you might have to > tweak the TCP buffers as well. There are some links around > the web on this. Sometimes it helps performance, sometimes > it doesn't. Your mileage may vary. > > Jeff From deadline at eadline.org Wed Jun 13 16:02:10 2007 From: deadline at eadline.org (Douglas Eadline) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> <466D70D5.5050701@charter.net> <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> Message-ID: <36018.192.168.1.1.1181775730.squirrel@mail.eadline.org> So this begs the question, if we are "core rich and packet small" do we care about packet size and overhead? In other words if we have plenty of cores when do we not care about communication overhead. Most GigE drivers have various interrupt coalescence strategies and of course Jumbo Frames to lessen the processor load, but if we have multi-core do we need to care about this as much ... any thoughts? -- Doug > Doug and Jeff have good points (and some good links). On thing to > also pay attention to is the CPU utilization during the bandwidth and > application testing. We found that on our cluster (various Dells > with built in GigE NICs) while we did not see huge differences in > effective bandwidth, the CPU overhead was notably less when using > Jumbo Frames. > > Again, YMMV. > > Good luck, > > -bill > > On Jun 11, 2007, at 11:57 AM, Jeffrey B. Layton wrote: > >> Doug brings up some good points. If you want to try Jumbo >> Frames to improve MPI performance you might have to >> tweak the TCP buffers as well. There are some links around >> the web on this. Sometimes it helps performance, sometimes >> it doesn't. Your mileage may vary. >> >> Jeff > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > !DSPAM:467021b9234289691080364! > -- Doug From laytonjb at charter.net Wed Jun 13 16:30:16 2007 From: laytonjb at charter.net (laytonjb@charter.net) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] MPI performance gain with jumbo frames Message-ID: <1022887901.1181777416954.JavaMail.root@fepweb12> More questions: One of the purposes of interrupt coalescence is to reduce the load on the CPU by ganging interrupt requests together (sorry for all of the technical jargon there). In a multi-core situation, do the interrupts affect all of the cores or just one core? If the interrupts affect all of the cores, then interrupt coalescence might be a good thing (even if the latency is much higher). I think Doug has some benchmarks that show some strange things when running NPB on multi-core nodes. This might show us something about what's going on. > > So this begs the question, if we are "core rich and packet small" > do we care about packet size and overhead? In other words if we have > plenty of cores when do we not care about communication > overhead. Most GigE drivers have various interrupt coalescence > strategies and of course Jumbo Frames to lessen the processor > load, but if we have multi-core do we need to care about this > as much ... any thoughts? > > -- > Doug > > > > Doug and Jeff have good points (and some good links). On thing to > > also pay attention to is the CPU utilization during the bandwidth and > > application testing. We found that on our cluster (various Dells > > with built in GigE NICs) while we did not see huge differences in > > effective bandwidth, the CPU overhead was notably less when using > > Jumbo Frames. > > > > Again, YMMV. > > > > Good luck, > > > > -bill > > > > On Jun 11, 2007, at 11:57 AM, Jeffrey B. Layton wrote: > > > >> Doug brings up some good points. If you want to try Jumbo > >> Frames to improve MPI performance you might have to > >> tweak the TCP buffers as well. There are some links around > >> the web on this. Sometimes it helps performance, sometimes > >> it doesn't. Your mileage may vary. > >> > >> Jeff > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > !DSPAM:467021b9234289691080364! > > > > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Wed Jun 13 16:37:05 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <36018.192.168.1.1.1181775730.squirrel@mail.eadline.org> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> <466D70D5.5050701@charter.net> <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> <36018.192.168.1.1.1181775730.squirrel@mail.eadline.org> Message-ID: <20070613233705.GA14997@bx9.net> On Wed, Jun 13, 2007 at 07:02:10PM -0400, Douglas Eadline wrote: > So this begs the question, if we are "core rich and packet small" > do we care about packet size and overhead? That's not quite the question. In many programs, there is no possible overlap between communication and computation, so they don't care how high the overhead is, although for smaller messages lower overhead can mean higher bandwidth (that "message rate" thing, again.) If you can overlap, then you do care about overhead, especially for Ethernet, where the cpu overhead is often unequally distributed over your cores. -- greg From lindahl at pbm.com Wed Jun 13 16:50:17 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <1022887901.1181777416954.JavaMail.root@fepweb12> References: <1022887901.1181777416954.JavaMail.root@fepweb12> Message-ID: <20070613235017.GA16124@bx9.net> On Wed, Jun 13, 2007 at 04:30:16PM -0700, laytonjb@charter.net wrote: > In a multi-core situation, > do the interrupts affect all of the cores or just one core? One core gets each interrupt. cat /proc/interrupts to see how this works in your system. > I personally like the concept that Level 5 Networks used in conjunction > with their GigE cards - user space drivers. This is how everyone does their EtherNot devices: InfiniPath, Myrinet, Quadrics, yadda yadda. Then the next question is, why are you bothering with TCP? With EtherNot, you can avoid all of the interrupts. A typical InfiniPath system only has a couple of interrupts after running for weeks; bringing the link up causes a couple. MPI doesn't cause any. -- greg From jmack at wm7d.net Wed Jun 13 07:29:29 2007 From: jmack at wm7d.net (Joseph Mack NA3T) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] programming multicore clusters Message-ID: I've googled the internet and searched the Beowulf archives for "hybrid" || "multicore" and the only definitive statement I've found is by Greg Lindahl, 17 Dec 2004 "Most of the folks interested in hybrid models a few years ago have now given it up". I assume this was from the era of 2-way SMP nodes. Multicore CPUs are being projected for 15yrs into the future (statement by Pat Gelsinger, Intel's CTO, quoted in http://cook.rfe.org/grid.pdf) I expect the programming model will be a little different for single image machines like the Altix, than for beowulfs where each node has its own kernel (and which I assume will be running dual quadcore mobos). Still if a flat, one network model is used, all processes communicate through the off-board networking. Someone with a quadcore machine, running MPI on a flat network, told me that their application scales poorly to 4 processors. Instead if processes on cores within a package were working on adjacent parts of the compute volume and communicated through the on-board networking, then for a quadcore machine, the off-board networking bandwidth requirement would drop by a factor of 4 and scaling would improve. In a quadcore machine, if 4 OMP/threads processes are started on each quadcore package, could they be rescheduled at the end of their timeslice, on different cores arriving at a cold cache? On a large single image machine, could a thread be scheduled on another node and have to communicate over the off-board network? In a single image machine (with a single address space) how does the OS know to malloc memory from the on-board memory, rather than some arbitary location (on another board)? I expect everyone here knows all this. How is everyone going to program the quadcore machines? Thanks Joe -- Joseph Mack NA3T EME(B,D), FM05lw North Carolina jmack (at) wm7d (dot) net - azimuthal equidistant map generator at http://www.wm7d.net/azproj.shtml Homepage http://www.austintek.com/ It's GNU/Linux! From lfarkas at bppiac.hu Wed Jun 13 09:11:05 2007 From: lfarkas at bppiac.hu (Farkas Levente) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] network raid filesystem Message-ID: <46701719.1030605@bppiac.hu> hi, we've a few 10-20 server in a lan each has 4 hdd. we'd like to create one big filesystem on these server hard disks. we'd like to create it in a redundant way ie: - if one (or more) of the hdd or server fails the whole filesystem still usable and consistent. - any server in this farm can see the same storage. it's someting a big network raid5-6... storage where we have about 40-80 partition added to the same filesystem. and there is an fs over it. which hide all internal network raid functionality. is there any such solution? i can't find any easy way to do this on our linux servers. thank you for your help in advance. -- Levente "Si vis pacem para bellum!" From pal at di.fct.unl.pt Wed Jun 13 16:00:33 2007 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> <466D70D5.5050701@charter.net> <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> Message-ID: <30836.89.26.129.109.1181775633.squirrel@www.di.fct.unl.pt> I can report a decrease of circa 10% CPU use per GbE link in an IBM x335 (dual Xeon 2.6GHz) with on-board Broadcom NICs and SMC switch, when going from standard 1500 to 9K frames on the netperf benchmark, at full bandwidth (circa 80MB/s). Best Regards, paulo > Doug and Jeff have good points (and some good links). On thing to > also pay attention to is the CPU utilization during the bandwidth and > application testing. We found that on our cluster (various Dells > with built in GigE NICs) while we did not see huge differences in > effective bandwidth, the CPU overhead was notably less when using > Jumbo Frames. > > Again, YMMV. > > Good luck, > > -bill > -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10763 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: pal@di.fct.unl.pt 2829-516 Caparica, PORTUGAL From tmalas at ee.bilkent.edu.tr Wed Jun 13 05:37:08 2007 From: tmalas at ee.bilkent.edu.tr (Tahir Malas) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process In-Reply-To: References: <01ae01c7acc2$dfa8e810$d80cb38b@bs> Message-ID: <000f01c7adb7$8ea0c5f0$d80cb38b@bs> > -----Original Message----- > From: Mark Hahn [mailto:hahn@mcmaster.ca] > Sent: Tuesday, June 12, 2007 6:15 PM > To: Tahir Malas > Cc: mvapich-discuss@cse.ohio-state.edu; beowulf@beowulf.org; > teoman.terzi@gmail.com; 'Ozgur Ergul' > Subject: Re: [Beowulf] Two problems related to slowness and > TASK_UNINTERRUPTABLE process > > > For 32 processes (4 process per node), the arrays with 512-Byte size are > > communicated slower than the 4096-Byte size arrays. For both of them, we > > do you mean that this is not the case in other configurations? > an interconnect _should_ have some steep rise in effective bandwidth > as packet size is increased. it's a useful metric to know the packet > size at which half-peak bandwidth is achieved, since this offers some > "sense of scale" to programmers judging whether their own packet sizes > are appropriate. > > > this abnormal case is persistent. More specifically, communication of > > 4k-Byte packages are 2 times faster than the communication of 512-Byte > > packages. > > perhaps I'm dense this morning, but what's unexpected about that? Considering the latency and bw measures, my expectation for the communication times in us: 512: 5.48 + 512/592.34 = 6.34 4096: 11.02 + 4096/906.04 = 15.54 Our test: 512: 29.434 4096: 16.209 So, somehow, isn't communication time for 512 bytes is unexpectedly slow? > > > > 2. SOMETIMES, after the test with overall 32 processes, one of the four > > processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the > test > > program shows a "done." and waits for sometime. We can neither kill the > > process nor soft reboot the node. We have to wait for that process to > > terminate, which can last long. > > does /proc/$pid/wchan (on the 'D' state process) tell you anything? > do all the ranks return from MPI_Finalize? > The file tells "__lock_buffer". Yes, all ranks return; but I think, this problematic process (i.e. one of the processes on node3) returns always the latest. Thanks, and regards, Tahir. From lindahl at pbm.com Wed Jun 13 22:55:39 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Tue May 13 01:06:10 2008 Subject: [Beowulf] programming multicore clusters In-Reply-To: References: Message-ID: <20070614055539.GA26746@bx9.net> On Wed, Jun 13, 2007 at 07:29:29AM -0700, Joseph Mack NA3T wrote: > "Most of the folks interested in hybrid models a few years > ago have now given it up". > > I assume this was from the era of 2-way SMP nodes. No, the main place you saw that style was on IBM SPs with 8+ cores/node. > I expect the programming model will be a little different > for single image machines like the Altix, than for beowulfs > where each node has its own kernel (and which I assume will > be running dual quadcore mobos). Most Altixes spend most of their time running MPI programs. Or at least that was certainly the case with Origin. > Still if a flat, one network model is used, all processes > communicate through the off-board networking. No, the typical MPI implementation does not use off-board networking for messages to local ranks. You use the same MPI calls, but the underlying implementation uses shared memory when possible. > Someone with a > quadcore machine, running MPI on a flat network, told me > that their application scales poorly to 4 processors. Which could be because he's out of memory bandwith, or network bandwidth, or message rate. There are a lot of postential reasons. > In a quadcore machine, if 4 OMP/threads processes are > started on each quadcore package, could they be rescheduled > at the end of their timeslice, on different cores arriving > at a cold cache? Most MPI and OpenMP implementations lock processes to cores for this very reason. > In a single image machine (with > a single address space) how does the OS know to malloc > memory from the on-board memory, rather than some arbitary > location (on another board)? Generally the default is to always malloc memory local to the process. Linux grew this feature when it started being used on NUMA machines like the Altix and the Opteron. > I expect everyone here knows all this. How is everyone going > to program