[Beowulf] RE: programming multicore clusters
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caFri Jun 15 05:46:49 PDT 2007
- Previous message: [Beowulf] RE: programming multicore clusters
- Next message: [Beowulf] RE: programming multicore clusters
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> Is running a program using OpenMP on a SMP/multi-core box more efficient that > an MPI code with an implementation using localhost optimization? beyond 2-4p, all machines are message passing. take a look at Intel's recent products: they have products with one or two dual-core chips in a package, but if you want a dual sockets, you get two FSB's - partly for fanout/loading reasons, and partly because truely symmetric, flat SMP machines just don't scale. OK, so once you accept that even shared-memory machines are actually passing messages, the question becomes: what kind of protocol and message size do you want? on a typical message-massing SMP machine (multi-socket x86_64, even SGI Altix), the message size is a cache line (64 or 128B afaik). that's a pretty OK number, but to make effective use of it, you have to write your code so you make sure to pack as much relevant data into these appropriately aligned and sized chunks of memory, knowing that they'll implicitly become packets. you have to marshal your packets, if you will. gosh! same term is used in explicit msg-passing... in other words, you have to adopt a message-passing methodology regardless of whether your packets are fixed-sized implicit things, or variable-sized, explicit ones. the main difference is in how your messages are addressed - by a simple flat memory address, or by something typically like <node,port,tag>. in some cases, implicit, memory-based addressing is a real win - mainly if many of your remote one-sided references are to a space that can remain unsynchronized for an extended time (say per timestep). I don't think I've ever seen a paper that tried to quantify this directly, though it would be most interesting... ccNUMA - provides automatic synchony by tracking the state of each cache line. but limited by cache size, and perhaps this tracking is irrelevant given your access patterns. the level of consistency may also hurt you, since a naive programmer will waste major cpu time on false sharing or hot cache lines. RDMA - similar to ccNUMA except with no 'O' or 'E' states, or tracking of states at all. no hardware-supported consistency guarantees, but also significantly higher latency. explicit msg-passing - different addressing, explicit list of data, not purely what's in a cacheline, but also explicit synchronization, which may seem too rigid. latency not that much higher than RDMA. for the classic example of one worker wanting to collect state from its grid neighbors, direct memory access seems the most natural. but MPI codes can handle this pretty successfully by either using a nonblocking irecv or by having a data-serving thread. either one is, admittedly, extra overhead. unless most of your IPC is this kind of async, unsync, passive data reference, I wouldn't think twice: go MPI. the current media frenzy about multicore systems (nothing new!) doesn't change the picture much. regards, mark hahn.
- Previous message: [Beowulf] RE: programming multicore clusters
- Next message: [Beowulf] RE: programming multicore clusters
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
