Poor scaling (was Re: Question about custers)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bogdan Costescu bogdan.costescu at iwr.uni-heidelberg.deMon Feb 10 02:47:38 PST 2003
- Previous message: Question about custers
- Next message: How to leave 3c905B and Intel Pro/1000 NICs in WOL mode without entering BIOS?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, 7 Feb 2003, Ken Chase wrote: > Im curious, when people see really poor scaling on their clusters (HSI > or GBE or 100BT, doesnt matter) at like 16 or 32 or more nodes (Im thinking > CHARMM and Gromacs here), what do you do with the extra cpu? Well, this is for the case when you have SMP nodes and run only one process per node. I agree that this scales better, but for administrative reasons this is not always the case... Based on a similar reasoning and Moore's law, we never bought a one-time large number of nodes for running CHARMM. Instead, we always buy multiple of 4 in the range 4-16, sometimes UP, sometimes SMP (the multiple of 4 thing is not because it's a power of 2 or somehow related to CHARMM but because it's the number of computer cases that fit on one of our shelves :-)). Because they are often of different speeds and sometimes connected to different switches, it's really not efficient to run on nodes bought in different batches, so we limit in this way the maximum number of CPUs that can be allocated to a job. > Just let it float away unused? Do you use it? Do you run other jobs on > them at the same time? We usually use the SMP nodes in 2 ways: - 2 jobs: one parallel and one single. As CHARMM still has important features that do not run in parallel (f.e. normal modes), we run one of these (usually large memory-) jobs along with a (usually low memory-) parallel one. This requires SMP nodes with large amounts of memory (>=1Gb) - 2 parallel jobs. We have found (by trying, so don't get this as the definitive answer!) that the total throughput is higher; this however is true only when the jobs have similar data sizes - if the jobs have numbers of atoms that are one degree or more different, this is not true anymore. > Do you nice those jobs to 19? No, we usually let them run at normal priority. At least in the first case they don't seem to interfere with each other. In the second case, we know that there will be interference (even in the case of HSI - Myrinet here), so we just say "That's life" :-) > Do you see your cache being thrashed by this How do you quantify the cache thrashes ? I've found that CHARMM scales surprisingly well with CPU speed for classical MD jobs and not affected much by cache size; this comes only from job-level timing, for various reasons I've never been able to include some CPU counter library in the "production" cluster kernels. Until now (P4-Xeon era) I haven't seen a significant speed improvement when compiled with PGI compilers vs. fort77+f2c+gcc2.x (which for me always generated faster code than g77 -O6 and all other optimizations present in the CHARMM Makefile). I didn't have time yet to test how the speed compares on Xeons with f2c+gcc-2.x vs. g77-3.x vs. PGI vs. Intel compilers. Why I'm saying this: because I believe that the speed gain from gcc-3.x/PGI/Intel compilers would come from better memory access patterns which would more fully use the cache and so the effect of cache thrashing would be more evident - please correct me if I'm talking rubbish. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De
- Previous message: Question about custers
- Next message: How to leave 3c905B and Intel Pro/1000 NICs in WOL mode without entering BIOS?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
