Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] hang-up of HPC Challenge

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Mikhail Kuzminsky kus at free.net
Tue Aug 19 16:45:43 PDT 2008


For some localization of possible problem reason, I ran pure HPL test 
instead of HPCC. HPL performs direct output to screen instead of 
writing to the file.

Using MPICH w/np=8 I obtained normal HPL result for N=35000 - 
including 
3 "PASSED" strings for ||Ax-b|| calculations. BUT ! Linux hang-ups 
immediately after output of this strings.

Mikhail 
  

In message from "Mikhail Kuzminsky" <kus at free.net> (Mon, 18 Aug 2008 
22:20:16 +0400):
>I ran a set of HPC Challenge benchmarks on ONE dual socket quad-core 
>Opteron2350 (Rev. B3) based server (8 logical CPUs).
>RAM size is 16 Gbytes. The tests performed were under SuSE 
>10.3/x86-64, for LAM MPI 7.1.4 and MPICH 1.2.7 from SuSE 
>distribution, using Atlas 3.9. Unfortunately there is only one such 
>cluster node, and I can't reproduce the run on another node :-(
>
>For N (matrix size) up to 10000 all looks OK. But for more large N 
>(15000/20000/...) hpcc execution (mpirun -np 8 hpcc) leads to Linux 
>hang-up.
>
>In the "top" output I see 8 hpcc examplars each eating about 100% of 
>CPU, and reasonable amounts of virtual and RSS memory per hpcc 
>process, and the absense of swap using. Usually there is no PTRANS 
>results in hpccoutf.txt results file, but in a few cases (when I 
>"activelly looked" to hpcc execution by means of ps/top issuing) I 
>see reasonable PTRANS results but absense of HPLinpack results. One 
>time I obtained PTRANS, HPL and DGEMM results for N=20000, but hangup 
>later - on STREAM tests. May be it's simple because of absense (at 
>hangup) of final writing of output buffer to output file on HDD.
>
>One of possible reasons of hang-ups is memory hardware problem, but 
>what is about possible software reasons of hangups ? 
>The hpcc executable is 64-bit dynamically linked. 
>/etc/security/limits.conf is empty. stacksize limit (for user issuing 
>mpirun) is "unlimited", main memory limit - about 14 GB, virtual 
>memory limit - about 30 GB. Atlas was compiled for 32-bit integers, 
>but it's enough for such N values. Even /proc/sys/kernel/shmmax is 
>2^63-1.
>
>What else may be the reason of hangup ?
>
>Mikhail Kuzminskiy
>Computer Assistance to Chemical Research Center
>Zelinsky Institute of Organic Chemistry
>Moscow
>  
>
>  
>
>  
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list