Beowulf: A theorical approach
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David S. Greenberg dsg at super.orgMon Jun 26 09:06:34 PDT 2000
- Previous message: Beowulf: A theorical approach
- Next message: Beowulf: A theorical approach
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I'm at least partially responsible for coining the acronym SALC. I got tired of talking about "the T3E memory model" when I meant something more general such as Quadrics. Several of us (including some hardware architects, compiler writers, and applications writers) believe that it is important to populate the space in between pure distributed memory and pure shared memory models. I'll describe why we want something different by first summarizing our problems with the extrema models and then describing what we believe is necessary. "A pox on both your houses" The pure distributed memory models tend to require message passing, emphasize portability over all else, and please those whose applications are relatively loosely coupled -- in a word beowulfers. The pure shared memory models tend to push flat memory spaces, emphasize the need for global cache coherency, and please those whose applications have not been tailored to parallelism and data locality. The hardware costs (in design, scalability, etc.) of a globally shared and cache consistent memory can be huge. Most of the main stream shared memory systems (SGI, Sun, IBM, Compaq) are, in fact, patinas over distributed memory. Going beyond 64 processors gets difficult -- the systems tend to become unstable and not worth the increased design costs. I am convinced that if you want shared memory of this type then you are better off following more radical designs such as the Cray/Tera MTA. On the other hand, as has been pointed out by several people in this thread, classic ethernet-connected beowulfs can have limited applications scope. They are fantastic for some applications but quickly top out at 8 to 16 processors for other applications. Faster (higher bandwidth and lower latency) networks help somewhat but the processors (and nodes) are becoming more powerful faster than the interconnects are improving. PCI-X will be probably be a big step forward but almost certainly not enough. (I suspect that Infiniband like AGP will be more of a marketing coup than a technical break-through except that it will take all the credit for PCI-X). As an example of our plight consider that the ASCI red machine and Cray T3E machines had full featured, 400 MB/s interconnects four years ago with processing nodes which are anemic compared to today's nodes. PCI-X, when it appears, will maybe get us back to this level of bandwidth with many fewer nice features in early PCI-X NICS/systems than were in the MPP machines. We deserve better but are not greedy. Which is where the SALC model comes in. It is possible, Quadrics is a good example, to augment NICs with the ability to perform remote direct memory access. "Standards" such as VIA make this an option for hardware. I think we must make it a requirement. Furthermore, we need to, as a community, decide what attributes are necessary. I believe, based on my experience with Portals at Sandia and UPC here at CCS, that fairly simple hardware is sufficient. We do not need all the bells and whistles of Cray's e-registers (though they are very nice.) We do not need to support in hardware all the notification modes of MPI-2 one-sided (though every mode has its strong adherents.) We do not need the on NIC virtual-to-real address translation of Quadrics or Matt Welsh's myrinet control program (though they are certainly nice also.) Instead, we need the ability to register large chunks of contiguous physical memory with the NIC and associate them with a tag usable from user-space on all nodes. The registering is likely to occur in a library or be done by the compiler. Next we need the ability to issue remote loads and stores by issuing just a couple of machine instructions (ideally just two loads or a load and a store). These instructions should take advantage of all the processor machinery for outstanding loads and delayed writes. Lastly we need a relatively efficient synchronization primitive which allows us (again usually a library or the compiler) to create a fence guaranteeing that previous remote operations have completed. All this should be possible in a $100 NIC and require little or no increase in switch complexity. So please help us move forward. (1) Propose a better name than SALC. We need to keep the notion that addresses are somehow shared, i.e. you can address most if not all of the memory of remote nodes with simple efficient load/store type operations. We also need to convey the warning that once a datum is fetched there will be no automatic notification to the fetcher when someone else has written to the fetched location. (2) Think about the real needs of your codes. Can you save lots of space my not copying boundary regions? What are your "memes"? Do you use shared work queues? Do you use ghost cells? How do you load balance? How do you improve data locality and reuse? How do you synchronize? (3) Demand more from compiler writers, system architects, and machine purchasers? Beowulf's are nice but you can have more if you push for it. I think I've passed my "one-screen per posting" limit so I'll stop here but I'd love to continue the discussion one-on-one with anyone who cares to contact me. Or come to the Atlanta Linux Showcase and Conference in October. David Greg Lindahl wrote: > > > But it isn't actually cache coherent. Remember Cray shmem(): To use > > > data, you fetch it to be close to you, and then it isn't cache > > > coherent while it's local. It IS cache coherent in that the fetch > > > gets you the right data. > > > > Wrong. In the CS2 it _was_ cache coherent both locally and remotely. > > I wasn't discussing the CS2, I was discussing the T3E and SALC. Sorry if I > gave a different impression. I don't know of any other machines like the > CS2, nor do I think they're interesting. > > > Another of the reasons I don't like the SALC description is precisely > > this confusion about what the "local consistency" is intended to mean. > > You can take it up with Bob Numrich; I'm just the messenger. Bob invented > the shmem interface in the first place. shmem is interesting mainly because > it's cheaper to build hardware for it than for things like the CS2. > > > Making the NIC properly cache coherent is one of the main reasons to > > be on the processor bus, appearing as a second CPU. It allows the NIC > > to implement the full coherency protocol when accessing data (either > > bringing it in, or sending it out). > > That's far more expensive and difficult than the other benefit of getting on > the processor bus: reduced latency. > > -- g > > _______________________________________________ > Beowulf mailing list > Beowulf at beowulf.org > http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: Beowulf: A theorical approach
- Next message: Beowulf: A theorical approach
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
