[Beowulf] Maximizing intra-node communication performance
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joe Landman landman at scalableinformatics.comWed Dec 28 20:11:57 PST 2005
- Previous message: [Beowulf] Maximizing intra-node communication performance
- Next message: [Beowulf] p4_error
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Tahir: Tahir Malas wrote: > Hi all, > Taking advice from a previous discussion, we have purchased an Tyan server > with 8 dual-core Opteron 870 processors. Now I want to wonder how I can > maximize the intra-node communication of the server. We have been using By maximize, do you mean maximizing bandwidth? Minimizing latency? Both? > LAM-MPI, but I think that TCP/IP protocol may degrade the performance. In mpich 1.2.x using the ch_p4 device, I am not sure if it will automatically use shared memory for MPI processes running on the same machine. I suspect not. I have used ch_shmem with such units with some success, though you have to start worrying about contention for shared memory arenas in a quad system when you are using a shared memory device. Also, you need to make sure that memorys and processes are pinned to the appropriate cpu (affinity scheduling using numactl and other bits). > Has > anybody tried new implementations of MPI, or anybody knows some other > support for intra-node communication? With mpich 1.2.x you could use ch_shmem. I have run into some performance issues with this in the recent past, where an 8 way run on a dual core quad unit using mpich and the ch_shmem device was not as fast similar runs using other mpi stacks (mpich-ib, mpich-gm). I have done some very recent work with mpi and compiler bits from Pathscale for the LAMMPS code (molecular dynamics) which have shown excellent scalability per node and across nodes. I have not been successful to date getting LAMMPS to run with LAM. LAM 7.x offers (IMO) some nice features/functionality relative to mpich 1.2.x . The issues in running on large NUMA systems are significant. For large shared memory units with lots of memory controllers, you need to worry about first touch (usually more so with OpenMP) allocations. You really don't want lots of other things to get in the way of your performance, so time spent traversing a network stack is to be avoided. A good MPI implmentation is in order. If you will only run on individual nodes and never across nodes, OpenMP can be quite powerful. Mixed model (MPI across nodes, OpenMP on each node) is somewhat harder to do. Joe > Thanks in advance, > Tahir Malas > Bilkent University > Electrical and Electronics Engineering Department > Phone: +90 312 290 1385 -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615
- Previous message: [Beowulf] Maximizing intra-node communication performance
- Next message: [Beowulf] p4_error
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
