[Beowulf] Multirail Clusters: need comments
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Vincent Diepeveen diep at xs4all.nlMon Dec 5 16:17:47 PST 2005
- Previous message: [Beowulf] fast interconnects
- Next message: [Beowulf] This week on the Monkey
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 11:33 5-12-2005 +0000, Ashley Pittman wrote: >On Sun, 2005-12-04 at 13:13 -0500, Mark Hahn wrote: >> so you're talking about 6 GB/s over quad-rail. it's hard to >> imagine what you would do with that kind of bandwidth > >This is a very bold assertion... > >> only you can answer this. my experience is that very few applications >> need anything like that much bandwidth - I don't think anyone ever saturated >> 1x quadrics on our alpha/elan3 clusters (~300 MB/s), and even on >> opteron+myri-d, latency seems like more of an issue. > >Your alpha/elan3 clusters would have been quad CPU machines. > >> > Note: I found just one line in a CLRC Daresbury lab presentation about >> > quadrail Quadrics on Alpha (probably QsNet-1?) Any update on QsNet2/Eagle? >> >> at least several qsnet2/elan4 clusters ran with dual-rail. there seem to be >> lots of dual-port IB cards out there, but I have no idea how many sites are >> using both ports. >> >> as far as I can tell, dual-rail is quite a specialized thing, simply because >> the native interconnects are pretty damn fast at 1x, and because when you >> double the number of switch ports, you normally _more_ than double the cost. >> this is mostly an issue when you hit certain thresholds related to the switch >> granularity of the fabric. 128x for quadrics, for instance. once you start >> federating switches, you're swimming in cables. it's often quite reasonable >> to go with not-full-bisection fabrics at this scale, but if you're doing >> multirail in the first place, that doesn't make sense! > >I can't speak for IB but with quadrics the second rail is *exactly* the >same in terms of topology as the first one and hence the cost is double. >not-full-bisection and federation relate to the size (number of ports) >of the network, not the number of rails. Each rail needs it's own host >bus to plug into however (the bus is the bottleneck, not the network) so >you need to have the right machine in the first place which may cost >more money. >The one point you have missed however with multi-rail is that network >bandwidth is per *node* whereas number of CPU's per node is for the >large part increasing. 1Gb/s seems like a lot (or at least it did) but >put it in a 16 CPU machine and all of a sudden you have *less* per CPU >bandwidth than you had seven years ago in your alpha/elan3. Couple that >with CPU's being n times faster to boot and all of a sudden multi-rail >is starting to less pie-in-the-sky and more look like a good idea. >It's true that it won't buy you latency, how could it? Bandwidth for >the most part however does what you would expect, it increases linearly For many applications that get parallellized now, a little bit of latency is real important. Not so much to ship a lot of data, but simply to start and stop processors quickly. In many applications to parallellize them, it's important to do some things a couple of hundreds of times a second. After that follows of course again a big bandwdith flow, for example to do matrix calculations or to multiply 2 big numbers. We can definitely expect future processors to process huge amount of gflops. I would not be amazed if long before 2010 we have 1 teraflop processors in many supercomputers. I would rather expect most 'supercomputers' to be clusters that can simply do calculations at large scale, in short having enough bandwidth from node to node. An additional important requirement is quick synchronization during bandwidth streaming. With 16-32 processing cores a node or something, whatever type, you can expect that there is 16-32 streams to each core. So that means that switch latency of network cards is important too. I do realize that not a single manufacturer on this list likes to quote that switch latency, as it is usually real UGLY. However, such seemingly tiny details will get important. Because just calculate how much data a single core of say 350 gflop can deliver. 16 x 350 gflop = 5.6 teraflop. Of course that's just paper. Let's assume 2 teraflop effectively. If a programmer can achieve that in a program he's a big hero of course :) 2 * 10^12 calculations. I'll assume single precision now by the way. Everyone here is always discussing double precision, but reality is that single precision simply goes so so much faster at the fast cheapo processors. And any FFT you can make either in single precision OR double precision. The extra overhead for single precision is not that much. About factor 2. Yet it allows real cheapo processors that deliver all together *huge* amounts of gflops. For the bandwidth calculation of course single precision versus double precision is not real interesting too. It's just a factor 2. The thing is, 2 * 10^12 calculations, assuming efficient reusage of the caches and RAM within 1 node, it means that 1 node already has a total *output* of 8 terabyte a second. That's far beyond what any network delivers currently. It will get a major problem. Todays supercomputing simply isn't ready for that kind of bandwidth that a single cell type processor can use there. There is simple examples. Like from a 1 terabyte array i wanted to take the md5sum. Of course only a single processor would take the md5 sum. The streaming from the i/o wasn't the problem there. All those arrays easily can deliver speeds that are real big. But practical even at an origin3800, the md5sum of that data went with far under 10MB/s because the 500Mhz processors couldn't calculate it faster... Of course this was a single operation and i only had to do it one time, but it was real pathetic that it took days of calculation time. In general the problem with i/o is the bugs in the file system software more than the speed of the i/o. Usually when doing big operations with processors, it's possible to first do a lot of calculations before streaming it to disk. Yet the network will be a big problem if the number of glops a cpu is going to advance as much as it looks like they will do now. Add to that the hard fact that in past we had networks that were relative tiny compared to the average cluster size of hundreds if not thousands of nodes. The budget simply is several tens of millions for the big systems. I'd say at least each big IT nation should have a system of $20 million+ in future. Just calculate then what in future the total cpu power will be that governments can afford in that respect. If you then do the calculation how many petaflops networks need to adress, i am sure that you guys find clever solutions there to transport all that between nodes! >with the number of rails. There are some cases where this doesn't >appear to hold true, for example given a 16*16 machine average bandwidth >between two CPUS won't quite double as you double the number of rails >because 15/256 ranks are local to any given process so will get linear >bandwidth independent of the number of rails. This however is simply a >matter of understanding the topology of the machine. Another odd case >is broadcast, assuming the network can deliver into a 16 CPU machine at >2GB/s shepherding this data inside the the node 16 distinct memory >locations within that node at 2GB/s *each* isn't possible and the >network ends up waiting for the node. > >The greatest number of rails I've ever seen in one machine was seven >however this was a old alpha test cluster and was done as a proof of >concept rather than a actual product, it only had two nodes. > >Ashley, >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > >
- Previous message: [Beowulf] fast interconnects
- Next message: [Beowulf] This week on the Monkey
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
