[Beowulf] New HPCC results, and an MX question
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Patrick Geoffray patrick at myri.comWed Jul 20 18:49:09 PDT 2005
- Previous message: [Beowulf] New HPCC results, and an MX question
- Next message: [Beowulf] Performance issue - CPU Intel 00/02
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Vincent Diepeveen wrote: >>>There likely will be a difference, because average pingpong doesn't >>>run on all the cpus. On a 4-cpu node, that can make a big difference. >> >>I believe the difference will not be that big. I will get my hands on a >>quad in the next couple of weeks, I will look into int. > > > The difference will be huge of course, network processors have a switch > latency. That's why. > > If it must switch at the wrong moment that'll cost 50 us or something at > certain network chips. Switch latency is negligable in this problem, and in any event 50us is not a realistic switch latency with modern hardware. The real question is the following: does 4 processes running on 4 different CPUs affect greatly the latency when sending small messages to other nodes compared to only one process running on one CPU ? The answer, I argue, is "not much". Assuming that all processes sends at the exact same time, access to the PCI bus will be serialized, NIC processing will be serialized and access to the wire will be serialized. The most expensive resource in this pipeline for 0-byte messages is likely to be the NIC. So, it boils down to the NIC overhead per send (or recv) and that is not big with MX (and will be further reduce in the future). In any event, not in the order of 10us. With GM, it's a different story as it does not do PIO for small messages. > Additional there will be software layers that have to lock in some way. You don't have to lock when doing os-bypass. At least, you don't have to lock with other processes (which is kinda expensive). We take a spinlock because we have at least another thread in the lib. The gain of having such a thread outweight the cost of the spinlock, no questions about that. > Locking + unlocking is already like half a microsecond extra, just like that. Taking a spinlock on Opteron is ~50 us. On Xeon or Nocona, it's a bit more (~150ns). > Tests at all processors at the same time make major sense. Yes and no. Most networking people believe the job of a node is to send messages. Actually, it's mainly to compute, and sometimes sends messages. So, would running a pingpong test on multiple processors at the same time sharing a NIC an interesting benchmark ? Not really, it won't happen much on real codes that compute most of the time. I prefer to optimize other things that help the host compute faster. > Any denial in advance that it will be the same speed is just ballony. And I thought I was the bulliest on this list... I just give my opinion and at least my opinion is backed up by first-hand experience. I don't know how to play chess, but I know my stuff. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com
- Previous message: [Beowulf] New HPCC results, and an MX question
- Next message: [Beowulf] Performance issue - CPU Intel 00/02
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
