Bizarre problems when adding a PPC machine...
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
t mirrorsh at atlantech.netTue Oct 29 14:52:35 PST 2002
- Previous message: mpich and scheduling issues
- Next message: optimization for scyld environment
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I am replying to a post I saw in the archives here dating from Jan 2002, primarily because it is the same problem I have been having when I added a PPC machine into my x86 Linux cluster running MPICH. > I really hate to bother the mailing list but this one has me somewhat > stumped. I have a four node cluster comprising Linux machines and one > PPC machine. The Linux machines have been adequately tested and play > well together. That PPC machine is another matter. When I include the > PPC machine (a Mac 8500 running YellowDog Linux) in my network > cluster... well things fall apart. Here's what appears on the console > after running a simple test on my "root" node.... > > > > [john at adenine examples]$ ./mpirun -np 4 simpleio > > p2_9722: p4_error: Could not allocate memory for commandline args: > > 553648128 Someone on the least suggested that this was probably an endian problem and to try to contact the authors of MPICH. Well, this has apparently been a problem since at least 1996 (from groups.google archives) yet I have not found a solution anywhere on the web (via google anyway). Also I haven't had much luck talking to the MPICH authors in the past about bugs in the MPICH implementation, so instead I just fixed it myself. Just in case anyone else ends up having this problem, I thought I'd post a possible solution so that at least it will be saved in some form of archive for posterity. A complete fix would be relatively time-consuming, involving changing how MPI reads/writes data in p4, or altering the configure scripts and other things to base machine "type" on something other than simply operating system, or who knows. Here's a quick fix for anyone interested. The problem is that MPICH (as of 1.2.4) cannot handle heterogenous networks in which machines of different bytesex are all running Linux or BSD using the ch_p4 device. p4 seems to write stuff out in network order only if MPICH thinks the person on the other end is a machine of a different architecture. Problem is, for MPICH, a machine's architecture is determined by the operating system it is using, not by its processor architecture. There is a quasi-fix in MPICH to compensate, but it is broken. Step 1) On the Linux PPC machines, edit mpid/ch_p4/p4/include/p4_MD.h Where: #if defined(LINUX) #define P4_MACHINE_TYPE "LINUX" #endif Replace "LINUX" with "LINUX_PPC" In mpid/ch_p4/p4/lib/p4_MD.c, in the function data_representation (at the bottom of the file), remove the ENTIRE #ifdef WORDS_BIGENDIAN block and replace it with something like this: if (strcmp(machine_type, "LINUX_PPC") == 0) return 21; if (strcmp(machine_type, "LINUX_X86") == 0) return 2; Step 2) On the Linux x86 machines, do the same except in p4_MD.h, replace "LINUX" with "LINUX_X86" So you will have two "versions": MPICH 1.2.X-ppc and MPICH 1.2.X-x86. Or however you enjoy naming things. This can be generalized for other machines types, e.g., NETBSD_X86, FREEBSD_ALPHA, LINUX_FOOZWITZ, so long as you add the requisite entries in the data_representation() function and hack p4_MD.h as necessary. I don't know if this is the proper forum for this sort of thing, but at least there will now be a solution posted to a google-accessible archive. The MPICH folks can at some point create an actual general-purpose patch that isn't quite as hacky and put it in 1.2.5. -- Stephen Lawler
- Previous message: mpich and scheduling issues
- Next message: optimization for scyld environment
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
