[Beowulf] Parallel memory
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Vincent Diepeveen diep at xs4all.nlTue Oct 18 21:11:57 PDT 2005
- Previous message: [Beowulf] Parallel memory
- Next message: [Beowulf] VIA cluster server reference design
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 06:39 PM 10/18/2005 -0400, Mark Hahn wrote: >> memory, if you do a memory access like that over openssi it will forward >> in a very slow manner like 2048 bytes. >of course the correct answer is page size, since that's the granularity Which is not so trivial to change from 2KB to 64 bytes >at which a net-shm can hook app references. (actually, the segv handler >*could* trap each faulting instruction and deliver the data in smaller >pieces. it could even try to figure out when patching in a whole page >would be more effective. some of this sort of thing is done in papers >that involve "pointer swizzling" - various amounts of dynamic code patching, >as well. but using pages means that further references have zero overhead, >which is always nice.) > >> This is ok if you are streaming in some sort of way. > >or at least high locality within the pages you do touch. > >> Anyway you'll need to dramatically rewrite the application for it. >not at all. the whole point of mmu-based net-shm is to present the app >with the illusion that the pages are just *there*. you'll probably want >to change your allocator so that you call netshm_alloc for a few big >chunks of distributed memory, and leave most other allocs local. Actually i meant to say you need to dramatically rewrite OpenSSI/OpenMosix to get some speedup > 1.0 out of it for software of the Todd type. A SGI 64 processor itanium2 1.6Ghz is like $1 million. One way pingpong latency of it is 3-4 us. If you could replace that by some 16 nodes dual opteron dual core system 2.2ghz, which is priced way under $60k with as software as pdsh being free software. One way pingpong latency also 3-4 us. If that opteron machine ain't delivering enough gflops as compared to the itanium2 machine, you could take a 32 nodes dual opteron dual core machine for around $125k. Which definitely delivers more gflops. So the real big difference is that in this case the example SGI machine is SSI and the pdsh cluster is not. I'm not sure how many systems SGI would still sell if a good SSI alternative would be there. If the SSI in itself is losing 'just' a factor 2 in performance compared to MPI, that would be very acceptible in such a case. Just consider the price difference... ...and the ease of porting applications to run parallel like they do on pc's without all the overhead of all those nasty mpi calls. SSI is not peanuts to make. It definitely isn't something hobbyists can manage to do very well. I have a simple program written measuring what you can call 'two way pingpong times SSI'. Just simply 2 times the one way pingpong latency from 8 bytes messages is very well comparing to it. If you can run that on some openssi/openmosix clusters and then compare with optimized myri/quadrics/dolphin drivers built with specialized kernels for those highend clusters, then i really volunteer cooperating with those tests. If this program mine can be not slower than a factor 2 for such test than the mpi equivalent (which is 2 times the one way pingpong latency for 8 byte messages), that would really kick some butt. Right now it's more like factor 20, if highend cards work anyway with openmosix/openssi, as they require special kernels, which do not work for openmosix/openssi. See the problem? Vincent
- Previous message: [Beowulf] Parallel memory
- Next message: [Beowulf] VIA cluster server reference design
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
