[Beowulf] Opteron memory rank limits with DDR-400

Vincent Diepeveen diep at xs4all.nl
Wed Jul 27 14:12:54 PDT 2005


Quad opteron dual core 1.8Ghz

Dmesg gives:
"AMD Opteron(tm) Processor 865 stepping 00"

All 16 banks filled with 256 registered+ecc PC3200 memory.

How do i check what clock it runs the memory?

Latency timings as measured with 250MB ram a cpu (so that's 2 GB with 8
cores):
1 cpu   : 144-147 ns
2 cpu's : 174 ns
4 cpu's : 206 ns
8 cpu's : 234 ns

To test it with this program do: 
  gcc -O2 -o lat latencylinux.c
  ./lat 250000000       // single cpu eating 250MB
  ./lat 250000000 2     // dual eating 500MB in total
  ./lat 250000000 4     // quad
  ./lat 250000000 8     // 8 cpu's

etc. 
confirmed working till 500 cpu's.

At 10:26 AM 7/27/2005 -0600, Josip Loncaric wrote:
>Hello,
>
>Can anyone confirm that Opteron processors Rev. E and later can operate 
>four dual-rank 2GB memory modules (8 ranks total) at full DDR-400 speed?
>
>AMD used to recommend no more than 4 ranks of DDR-400 memory.  See 
>http://forums.amd.com/lofiversion/index.php/t39745.html where the 
>relevant quote from AMD technical service reads:
>
>"AMD does recommend to downclock the memory of the system to 333MHz,
>if more than 4 ranks is used in the DIMM slots. What this means is
>that only 2 sticks of 2 rank memory is recommended to run at the full
>400MHz or 4 sticks of 1 rank memory. There is a memory timing issue
>with more than 4 ranks of memory, which is a limitation of the memory
>controller on the Opteron chips."
>
>In the past, this downclocking was automatically enforced by some 
>BIOSes, but supposedly there is no need to do so with currently shipping 
>Opteron Rev. E and later, provided that the motherboard also allows full 
>8 ranks at DDR-400.
>
>I'd just like to be sure...  Also, has anyone observed increased memory 
>latency with dual-rank modules?
>
>Sincerely,
>Josip
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
/*-----------------10-6-2003 3:48-------------------*
 *
 * This program rasml.c measures the Random Average Shared Memory Latency (RASML)
 * Thanks to Agner Fog for his excellent random number generator.
 *
 * This testset is using a 64 bits optimized RNG of Agner Fog's ranrot generator.
 *
 * Created by Vincent Diepeveen who hereby releases this under GPL
 * Feel free to look at the FSF (free software foundation) for what
 * GPL is and its conditions.
 *
 * Please don't confuse the times achieved here with two times the one 
 * way pingpong latency, though at
 * ideal scaling supercomputers/clusters they will be close. There is a few
 * differences:
 *    a) this is TLB trashing
 *    b) this test tests ALL processors at the same time and not
 *       just 2 cpu's while the rest of the entire cluster is idle.
 *    c) this test ships 8 bytes whereas one way pingpong typical also
 *       gets used to test several kilobyte sizes, or just returns a pong.
 *    d) this doesn't use MPI but shared memory and the way such protocols are
 *       implemented matters possibly for latency.
 *
 * Vincent Diepeveen                 diep at xs4all.nl
 * Veenendaal, The Netherlands       10 june 2003
 *
 * First a few lines about the random number generator. Note that I modified Agner Fog's
 * RanRot very slightly. Basically its initialization has been done better and some dead
 * slow FPU code rewritten to fast 64 bits integer code.
 */

#define UNIX 1  /* put to 1 when you are under unix or using gcc a look like compilers */
#define IRIX 1  /* this value only matters when UNIX is set to 1. For Linux put to 0
                 * basically allocating shared memory in linux is pretty buggy done in
                 * its kernel.
                 *
                 * Therefore you might want to do 'cat /proc/sys/kernel/shmmax'
                 * and look for yourself how much shared memory YOU can allocate in linux.
                 *
                 * If that is not enough to benchmark this program then try modifying it with:
                 *    echo <newsize> > /proc/sys/kernel/shmmmax
                 * Be sure you are root when doing that each time the system boots.
                 */
#define FREEBSD 0 // be sure to not use more than 2 GB memory with freebsd with this test. sorry.


#if UNIX
  #include <pthread.h>
  #include <sys/ipc.h>
  #include <sys/shm.h>
  #include <sys/times.h>
  #include <sys/time.h>
  #include <unistd.h>
#else
  #include <windows.h>
  #include <winbase.h> // for GetTickCount()
  #include <process.h> // _spawnl
#endif

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

#define SWITCHTIME      60000   /* in milliseconds. Modify this to let a test run longer or shorter.
                                 * basically it is a good idea to use about the cpu number times
                                 * thousand for this. 30 seconds is fine for PC's, but a very
                                 * bad idea for supercomputers. I recomment several minutes
                                 * there, and at least a few hours for big supers if the partition isn't started yet
                                 * if the partition is started starting it at 460 processors (SGI) should 
                                 * take 10 minutes, otherwise it takes 3 hours to attach all. 
                                 * Of course that let's a test take way way longer.
                                 */
#define MAXPROCESSES     512    /* this test can go up to this amount of processes to be tested */
#define CACHELINELENGTH  128    /* cache line length at the machine. Modify this if you want to */


#if UNIX
  #include <time.h>
 // #include <memory.h>
  #define FORCEINLINE       __inline
  /* UNIX and such this is 64 bits unsigned variable: */
  #define BITBOARD                     unsigned long long
#else
  #define FORCEINLINE       __forceinline
  /* in WINDOWS we also want to be 64 bits: */
  #define BITBOARD                     unsigned _int64
#endif

#define     STATUS_NOTSTARTED    0
#define     STATUS_ATTACH        1
#define     STATUS_GOATTACH      2
#define     STATUS_ATTACHED      3
#define     STATUS_STARTREAD     4
#define     STATUS_READ          5
#define     STATUS_MEASUREREAD   6
#define     STATUS_MEASUREDREAD  7

#define     STATUS_QUIT         10

struct ProcessState {
  volatile int status; /*  0  = not started yet
                        *  1  = ready to start reading
                        *
                        *  10 = quitted
                        * */

  /* now the numbers each cpu gathers. The name of the first number is what
   * cpu0 is doing and the second name what all the other cpu's were doing at that
   * time
   */
  volatile BITBOARD readread; /* */
  char dummycacheline[CACHELINELENGTH];
};

typedef struct {
  BITBOARD nentries; // number of entries of 64 bits used for cache.
  struct ProcessState ps[MAXPROCESSES];
} GlobalTree;

void     RanrotAInit(void);
float    ToNano(BITBOARD);
int      GetClock(void);
float    TimeRandom(void);

void     ParseBuffer(BITBOARD);
void     ClearHash(void);
void     DeAllocate(void);
int      DoNrng(BITBOARD);
int      DoNreads(BITBOARD);
int      DoNreadwrites(BITBOARD);
//void     TestLatency(float);
int      AllocateTree(void);
void     InitTree(int);
void     WaitForStatus(int,int);
void     PutStatus(int,int);
int      CheckStatus(int,int);
int      CheckAllStatus(int,int);
void     Slapen(int);
float    LoopRandom(void);



/* define parameters (R1 and R2 must be smaller than the integer size): */
#define KK  17
#define JJ  10
#define R1   5
#define R2   3

/* global variables Ranrot */
BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */
 0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,0x195e36fe715fad23,
 0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,0x5db2d651a7bdf825,
 0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,0x00d47d10ffdc8a9f,
 0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,0x43d64ed75a9ad5d9
 /*0xa05a7755512c0c03,0x960880d9ea857ccd,0x7d9c520a4cc1d30f,0x73b1eb7d8891a8a1,0x116e3fc3a6b7aadb*/
};
int r_p1, r_p2;          /* indexes into history buffer */

/* global variables RASML */
BITBOARD *hashtable[MAXPROCESSES],nentries,globaldummy=0;
GlobalTree *tree;
int ProcessNumber,
    cpus;  // number of processes for this test
#if UNIX
int shm_tree,shm_hash[MAXPROCESSES];
#endif
char rasmexename[2048];

 /******************************************************** AgF 1999-03-03 *
 *  Random Number generator 'RANROT' type B                               *
 *  by Agner Fog                                                          *
 *                                                                        *
 *  This is a lagged-Fibonacci type of random number generator with       *
 *  rotation of bits.  The algorithm is:                                  *
 *  X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b               *
 *                                                                        *
 *  The last k values of X are stored in a circular buffer named          *
 *  randbuffer.                                                           *
 *                                                                        *
 *  This version works with any integer size: 16, 32, 64 bits etc.        *
 *  The integers must be unsigned. The resolution depends on the integer  *
 *  size.                                                                 *
 *                                                                        *
 *  Note that the function RanrotAInit must be called before the first    *
 *  call to RanrotA or iRanrotA                                           *
 *                                                                        *
 *  The theory of the RANROT type of generators is described at           *
 *  www.agner.org/random/ranrot.htm                                       *
 *                                                                        *
 *************************************************************************/

FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<<r)|(x>>(64-r));}

/* returns a random number of 64 bits unsigned */
FORCEINLINE BITBOARD RanrotA(void) {
  /* generate next random number */
  BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2);
  /* rotate list pointers */
  if( --r_p1 < 0)
    r_p1 = KK - 1;
  if( --r_p2 < 0 )
    r_p2 = KK - 1;
  return x;
}

/* this function initializes the random number generator.      */
void RanrotAInit(void) {
  int i;

  /* one can fill the randbuffer here with possible other values here */
  randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber;
  randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12);

  /* initialize pointers to circular buffer */
  r_p1 = 0;
  r_p2 = JJ;

  /* randomize */
  for( i = 0; i < 300; i++ )
    (void)RanrotA();
}

/* Now the RASML code */
char *To64(BITBOARD x) {
  static char buf[256];
  char *sb;

  sb = &buf[0];
  #if UNIX
  sprintf(buf,"%llu",x);
  #else
  sprintf(buf,"%I64u",x);
  #endif
  return sb;
}

int GetClock(void) {
/* The accuracy is measured in millisecondes. The used function is very accurate according
 * to the NT team, way more accurate nowadays than mentionned in the MSDN manual. The accuracy
 * for linux or unix we can only guess. Too many experts there.
 */
  #if UNIX
  struct timeval timeval;
  struct timezone timezone;
  gettimeofday(&timeval, &timezone);
  return((int)(timeval.tv_sec*1000+(timeval.tv_usec/1000)));
  #else
  return((int)GetTickCount());
  #endif
}

float ToNano(BITBOARD nps) {
  /* convert something from times a second to nanoseconds.
   * NOTE THAT THERE IS COMPILER BUGS SOMETIMES AT OLD COMPILERS
   * SO THAT'S WHY MY CODE ISN'T A 1 LINE RETURN HERE. PLEASE DO
   * NOT MODIFY THIS CODE */
  float tn;
  tn = 1000000000/(float)nps;
  return tn;
}

float TimeRandom(void) {
  /* timing the random number generator is very easy of course. Returns
   * number of random numbers a second that can get generated
   */
  BITBOARD bb=0,i,value,nps;
  float ns_rng;
  int t1,t2,took;

  printf("Benchmarking Pseudo Random Number Generator speed, RanRot type 'B'!\n");
  printf("Speed depends upon CPU and compile options from RASML,\n therefore we benchmark the RNG\n");
  printf("Please wait a few seconds.. "); fflush(stdout);
  value = 100000;
  took  = 0;
  while( took < 3000 ) {
    value <<= 2; //  x4
    t1 = GetClock();

    for( i = 0; i < value; i++ ) {
      bb ^= RanrotA();
    }
    t2 = GetClock();
    took = t2-t1;
  }

  nps = (1000*value)/(BITBOARD)took;

  #if UNIX
  printf("..took %i milliseconds to generate %llu numbers\n",took,value);
  printf("Speed of RNG = %llu numbers a second\n",nps);
  #else
  printf("..took %i milliseconds to generate %I64 numbers\n",took,value);
  printf("Speed of RNG = %I64u numbers a second\n",nps);
  #endif

  ns_rng = ToNano(nps);
  printf("So 1 RNG call takes %f nanoseconds\n",ns_rng);


  return ns_rng;
}

void ParseBuffer(BITBOARD nbytes) {
  tree->nentries = nbytes/sizeof(BITBOARD);
  #if UNIX
  printf("Trying to allocate %llu entries. ",tree->nentries);
  printf("In total %llu bytes\n",tree->nentries*(BITBOARD)sizeof(BITBOARD));
  #else
  printf("Trying to allocate %s entries. ",To64(tree->nentries));
  printf("In total %s bytes\n",To64(tree->nentries*(BITBOARD)sizeof(BITBOARD)));
  #endif
}

void ClearHash(void) {
  BITBOARD *hi,i,nentries = tree->nentries;
  /* clearing hashtable */
  printf("Clearing hashtable for processor %i\n",ProcessNumber);
  fflush(stdout);
  hi = hashtable[ProcessNumber];
  for( i = 0 ; i < nentries ; i++ ) /* very unoptimized way of clearing */
    hi[i] = i;
}

void DeAllocate(void) {
  int i;
  #if UNIX
  shmctl(shm_tree,IPC_RMID,0);
  for( i = 0; i < cpus; i++ ) {
    shmctl(shm_hash[i],IPC_RMID,0);
  }
  #else
  UnmapViewOfFile(tree);
  for( i = 0; i < cpus; i++ ) {
    UnmapViewOfFile(hashtable[i]);
  }
  #endif
}

int DoNrng(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2,ncpu;

  ncpu     = cpus;
  nents    = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD rani=RanrotA(),index=rani%nents;
    unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
    dummyres ^= (index+(BITBOARD)i2);
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;
  return(t2-t1);
}

int DoNreads(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2,ncpu;

  ncpu     = cpus;
  nents    = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD rani=RanrotA(),index=rani%nents;
    unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
    dummyres ^= hashtable[i2][index];
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;

  return(t2-t1);
}

#if 0
int DoNreadwrites(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2;

  nents = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD index = RanrotA()%nents;
    dummyres ^= hashtable[index];
    hashtable[index] = dummyres;
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;

  return(t2-t1);
}

void TestLatency(float ns_rng) {
  BITBOARD n,nps_read,nps_rw,nps_rng;
  float ns,fns;
  int timetaken;

  printf("Doing random RNG test. Please wait..\n");
  n = 50000000; // 50 mln
  timetaken = DoNrng(n);
  nps_rng = (1000*n) / (BITBOARD)timetaken;
  fns  = ToNano(nps_rng);
  printf("Machine needs %f ns for RND loop\n",fns);

  /* READING SINGLE CPU RANDOM ENTRIES */
  printf("Doing random read tests single cpu. Please wait..\n");
  n = 100000000; // 100 mln
  timetaken = DoNreads(n);
  nps_read = (1000*n) / (BITBOARD)timetaken;
  ns  = ToNano(nps_read);
  printf("Machine needs %f ns for single cpu random reads.\nExtrapolated=%f nanoseconds a read\n",ns,ns-fns);

  /* READING AND THEN WRITING SINGLE CPU RANDOM ENTRIES */
  printf("Doing random readwrite tests single cpu. Please wait..\n");
  n = 100000000; // 100 mln
  timetaken = DoNreadwrites(n);
  nps_rw = (1000*n) / (BITBOARD)timetaken;
  ns  = ToNano(nps_rw);
  printf("Machine needs %f ns for single cpu random readwrites.\n",ns);
  printf("Extrapolated=%f nanoseconds a readwrite (to the same slot)\n\n",ns-fns);

  printf("So far the useless tests.\nBut we have vague read/write nodes a second numbers now\n");
}
#endif

int AllocateTree(void) { /* initialize the tree. returns 0 if error */
  #if UNIX
  shm_tree = shmget(
              ftok(".",'t'),
              sizeof(GlobalTree),IPC_CREAT|0777);
  if( shm_tree == -1 )
    return 0;
  tree = (GlobalTree *)shmat(shm_tree,0,0);
  if( tree == (GlobalTree *)-1 )
    return 0;
  #else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
  if( !ProcessNumber ) {
    HANDLE TreeFileMap;
    TreeFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
     (DWORD)sizeof(GlobalTree),"RASM_Tree");
    if( TreeFileMap == NULL )
      return 0;
    tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( tree == NULL )
      return 0;
  }
  else { /* Slaves attach also try to attach to the tree */
    HANDLE TreeFileMap;
    TreeFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Tree");
    if( TreeFileMap == NULL )
      return 0;
    tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( tree == NULL )
      return 0;
  }
  #endif
  return 1;
}

int AttachAll(void) {
  #if UNIX
  #else
  HANDLE HashFileMap;
  #endif
  char hashname2[32] = {"RASM_Hash00"},hashname[32];
  int i,r;
  for( r = 0; r < cpus; r++ ) {
    i = ProcessNumber+r;
    i %= cpus;
    if( i == ProcessNumber )
      continue;
    #if UNIX
    shm_hash[i] = shmget(
              #if IRIX
              ftok(".",200+i),
              #else
              ftok(".",(char)i),
              #endif
              tree->nentries*8,IPC_CREAT|0777);
    if( shm_hash[i] == -1 )
      return 0;
    hashtable[i] = (BITBOARD *)shmat(shm_hash[i],0,0);
    if( hashtable[i] == (BITBOARD *)-1 )
      return 0;
    #else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */

    strcpy(hashname,hashname2);
    hashname[9] += (i/10);
    hashname[10] += (i%10);

    HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,hashname);
    if( HashFileMap == NULL )
      return 0;
    hashtable[i] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( hashtable[i] == NULL )
      return 0;
    #endif
  }
  return 1;
}

int AllocateHash(void) { /* initialize the hashtable (cache). returns 0 if error */
  char hashname[32] = {"RASM_Hash00"};
  #if UNIX
  shm_hash[ProcessNumber] = shmget(
              #if IRIX
              ftok(".",200+ProcessNumber),
              #else
              ftok(".",(char)ProcessNumber),
              #endif
              tree->nentries*8,IPC_CREAT|0777);
  if( shm_hash[ProcessNumber] == -1 )
    return 0;
  hashtable[ProcessNumber] = (BITBOARD *)shmat(shm_hash[ProcessNumber],0,0);
  if( hashtable[ProcessNumber] == (BITBOARD *)-1 )
    return 0;
  #else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
  //if( !ProcessNumber ) {
    HANDLE HashFileMap;

    hashname[9] += (ProcessNumber/10);
    hashname[10] += (ProcessNumber%10);

    HashFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
     (DWORD)tree->nentries*8,hashname);
    if( HashFileMap == NULL )
      return 0;
    hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( hashtable[ProcessNumber] == NULL )
      return 0;
  //}
  //else { /* Slaves attach also try to attach to the tree */
 /*   HANDLE HashFileMap;
    HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Hash");
    if( HashFileMap == NULL )
      return 0;
    hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( hashtable[ProcessNumber] == NULL )
      return 0;*/
  //}
  #endif
  return 1;
}

int StartProcesses(int ncpus) {
  char buf[256];
  int i;
  /* returns 1 if ncpus-1 started ok */
  if( ncpus == 1 )
    return 1;

  for( i = 1 ; i < ncpus ; i++ ) {
    sprintf(buf,"%i_%i",i+1,ncpus);
    #if UNIX
    if( !fork() )
      execl(rasmexename,rasmexename,buf,NULL);
    #else
    (void)_spawnl(_P_NOWAIT,rasmexename,rasmexename,buf,NULL);
     #endif
  }
  return 1;
}

void InitTree(int ncpus) {
  int i;

  for( i = 0 ; i < ncpus ; i++ ) {
    tree->ps[i].status   = STATUS_NOTSTARTED;
    tree->ps[i].readread = 0;
  }
}

void WaitForStatus(int ncpus,int waitforstate) {
  /* wait for all processors to have the same state */
  int i,badluck=1;

  while( badluck ) {
    badluck = 0;
    for( i = 0 ; i < ncpus ; i++ ) {
      if( tree->ps[i].status != waitforstate )
        badluck = 1;
    }
  }
}

void PutStatus(int ncpus,int statenew) {
  int i;
  for( i = 0 ; i < ncpus ; i++ ) {
    tree->ps[i].status = statenew;
  }
}

int CheckStatus(int ncpus,int statenew) {
  /* returns false when not all cpu's are in the new state */
  int i;
  for( i = 0 ; i < ncpus ; i++ ) {
    if( tree->ps[i].status != statenew )
      return 0;
  }
  return 1;
}


int CheckAllStatus(int ncpus,int status) {
  /* Tries with a single loop to determine whether the other cpu's also finished
   *
   * returns:
   *     true  ==> when all the processes have this status
   *     false ==> when 1 or more are still busy measuring
   */
  int i,badluck=1;
  for( i = 0 ; i < ncpus ; i++ ) {
    if( tree->ps[i].status != status ) {
      badluck = 0;
      break;
    }
  }
  return badluck;
}

void Slapen(int ms) {
  #if UNIX
  usleep(ms*1000); /* 0.050 000 secondes, it is in microseconds! */
  #else
  Sleep(ms);     /* 0.050 seconds, it is in milliseconds */
  #endif
}

float LoopRandom(void) {
  BITBOARD n,nps_rng;
  float fns;
  int timetaken;
  printf("Benchmarking random RNG test. Please wait..\n");
  n = 25000000; // 50 mln
  timetaken = 0;
  while( timetaken < 500 ) {
    n += n;
    timetaken = DoNrng(n);
  }
  printf("timetaken=%i\n",timetaken);
  nps_rng = (1000*n) / (BITBOARD)timetaken;
  fns  = ToNano(nps_rng);
  printf("Machine needs %f ns for RND loop\n",fns);
  return fns;
}

/* Example showing how to use the random number generator: */
int main(int argc,char *argv[]) {
  /* allocate a big memory buffer parameter is in bytes.
   * don't hesitate to MODIFY this to how many gigabytes
   * you want to try.
   * The more the better i keep saying to myself.
   *
   * Note that under linux your maximum shared memory limit can be set with:
   *
   * echo <size> > /proc/sys/kernel/shmmax
   *
   * and under IRIX it is usually 80% from the total RAM onboard that can get allocated
   */

  BITBOARD nbytes,firstguess;
  float ns_rng,f_loop;
  int tottimes,t1,t2;


  if( argc <= 1 ) {
    printf("Latency test usage is: latency <buffer> <cpus>\n");
    printf("Where 'buffer' is the buffer in number of bytes to allocate PRO PROCESSOR\n");
    printf("and where 'cpus' is the number of processes that this test will try to use (1 = default) \n");
    return 1;
  }

  /* parse the input */
  nbytes = 0;
  cpus   = 1; // default

  if( strchr(argv[1],'_') == NULL ) { /* main startup process */
    int np = 0;
    #if UNIX
     #if FREEBSD
     nbytes = (BITBOARD)atoi(argv[1]); // freebsd doesn't support > 2 GB memory
     #else
     nbytes = (BITBOARD)atoll(argv[1]);
     #endif
    #else
    nbytes = (BITBOARD)_atoi64(argv[1]);
    #endif

    printf("Welcome to RASM Latency!\n");
    printf("RASML measures the RANDOM AVERAGE SHARED MEMORY LATENCY!\n\n");

    if( argc > 2 ) {
      cpus = 0;
      do {
        cpus *= 10;
        cpus += (int)(argv[2][np]-'1')+1;
        np++;
      } while( argv[2][np] >= '0' && argv[2][np] <= '9' );
    }
    //printf("Master: buffer = %s bytes. #CPUs = %i\n",To64(nbytes),cpus);
    ProcessNumber = 0;

    /* check whether we are not getting out of bounds */
    if( cpus > MAXPROCESSES ) {
      printf("Error: Recompile with a bigger stack for MAXPROCESSES. %i processors is too much\n",cpus);
      return 1;
    }

    /* find out the file name */
    #if UNIX
    strcpy(rasmexename,argv[0]);
    #else
    GetModuleFileName(NULL,rasmexename,2044);
    #endif
    printf("Stored in rasmexename = %s\n",rasmexename);
  }
  else { //   latency 2_452  ==>  means processor 2 out of 452.
    int np = 0;

    ProcessNumber = 0;
    do {
      ProcessNumber *= 10;
      ProcessNumber += (argv[1][np]-'1')+1;      // n
      np++;
    } while( argv[1][np] >= '0' && argv[1][np] <= '9' );

    ProcessNumber--; // 1 less because of ProcessNumber ==> [0..n-1]

    np++; // skip underscore

    cpus = 0;
    do {
      cpus *= 10;
      cpus += (argv[1][np]-'1')+1;      // n
      np++;
    } while( argv[1][np] >= '0' && argv[1][np] <= '9' );
    //printf("Slave: ProcessNumber=%i cpus=%i\n",ProcessNumber,cpus);
  }

  /* first we setup the random number generator. */
  RanrotAInit();

  /* initialize shared memory tree; it gets used for communication between the processes */
  if( !AllocateTree() ) {
    printf("Error: ProcessNumber %i could not allocate the tree\n",ProcessNumber);
    return 1;
  }

  if( !ProcessNumber )
    ParseBuffer(nbytes);

  nentries = tree->nentries;

  /* Now some stuff only the Master has to do */
  if( !ProcessNumber ) {
    /* Master: now let's time the pseudo random generators speed in nanoseconds a call */
    ns_rng = TimeRandom();
    f_loop = LoopRandom();

    printf("Trying to Allocate Buffer\n");
    t1 = GetClock();
    if( !AllocateHash() ) {
      printf("Error: Could not allocate buffer!\n");
      return 1;
    }
    t2 = GetClock();
    printf("Took %i.%03i seconds to allocate Hash\n",(t2-t1)/1000,(t2-t1)%1000);
    ClearHash(); // local hash
    t1 = GetClock();
    printf("Took %i.%03i seconds to clear Hash\n",(t1-t2)/1000,(t1-t2)%1000);

    /* so now hashtable is setup and we know quite some stuff. So it is time to
     * start all other processes */
    InitTree(cpus);

    printf("Starting Other processes\n");
    t1 = GetClock();
    if( !StartProcesses(cpus) ) {
      printf("Error: Could not start processes\n");
      DeAllocate();
    }
    t2 = GetClock();
    printf("Took %i milliseconds to start %i additional processes\n",t2-t1,cpus-1);
    t1 = GetClock();
  }
  else { /* all Slaves do this */
    if( !AllocateHash() ) {
      printf("Error: slave %i Could not allocate buffer!\n",ProcessNumber);
      return 1;
    }
    ClearHash(); // local hash
  }

  tree->ps[ProcessNumber].status = STATUS_ATTACH;
  if( ! ProcessNumber ) {
    WaitForStatus(cpus,STATUS_ATTACH);
    t2 = GetClock();
    printf("Took %i milliseconds to synchronize %i additional processes\n",t2-t1,cpus-1);
    t1 = GetClock();

    /* now we can continue with the next phase that is attaching all the segments */
    PutStatus(cpus,STATUS_GOATTACH);
  }
  else {
    while( tree->ps[ProcessNumber].status == STATUS_ATTACH ) {
      Slapen(500);
    }
  }

  if( !AttachAll() ) {
    printf("Error: process %i Could not attach correctly!\n",ProcessNumber);
    return 1;
  }
  tree->ps[ProcessNumber].status = STATUS_ATTACHED;

  if( ! ProcessNumber ) {
    WaitForStatus(cpus,STATUS_ATTACHED);
    t2 = GetClock();
    printf("Took %i milliseconds to ATTACH. %llu total RAM\n",t2-t1,(BITBOARD)cpus*tree->nentries*8);
    PutStatus(cpus,STATUS_STARTREAD);
    printf("Read latency measurement STARTS NOW using steps of 2 * %i.%03i seconds :\n",
     (SWITCHTIME/1000),(SWITCHTIME%1000));
  }
  else {
    while( tree->ps[ProcessNumber].status == STATUS_ATTACHED ) {
      Slapen(500);
    }
  }

  tree->ps[ProcessNumber].status = STATUS_READ;

  firstguess = 200000;
  tottimes   = 0;
  for( ;; ) {
    int timetaken = 0;
    if( tree->ps[ProcessNumber].status == STATUS_MEASUREREAD ) {
      /* this really MEASURES the readread */
      BITBOARD ntried = 0,avnumber;
      int totaltime=0;
      while( totaltime < SWITCHTIME ) { /* go measure around switchtime seconds */
        totaltime += DoNreads(firstguess);
        ntried += firstguess;
      }
      /* now put the average number of readreads into the shared memory */
      avnumber = (ntried*1000) / (BITBOARD)totaltime;
      tree->ps[ProcessNumber].readread = avnumber;

      /* show that it is finished */
      tree->ps[ProcessNumber].status = STATUS_MEASUREDREAD;

      /* now keep doing the same thing until status gets modified */
      while( tree->ps[ProcessNumber].status == STATUS_MEASUREDREAD ) {
        (void)DoNreads(firstguess);
        if( !ProcessNumber ) {
          if( CheckAllStatus(cpus,STATUS_MEASUREDREAD) ) {
            PutStatus(cpus,STATUS_QUIT);
            break;
          }
        }
      }
    }
    else if( tree->ps[ProcessNumber].status == STATUS_READ ) {
      BITBOARD nextguess;
      /* now software must try to determine how many reads a seconds are possible for that
       * process
       */
      //printf("proc=%i trying %s reads\n",ProcessNumber,To64(firstguess));
      timetaken = DoNreads(firstguess);
      /* try to guess such that next test takes 1 second, or if test was too inaccurate
       * then double the number simply. also prevents divide by zero error ;)
       */
      if( timetaken < 400 )
        nextguess = firstguess*2;
      else
        nextguess = (firstguess*1000)/(BITBOARD)timetaken;
      firstguess = nextguess;
      if( !ProcessNumber ) {
        tottimes += timetaken;
        if( tottimes >= SWITCHTIME ) { // 30 seconds to a few minutes
          tottimes = 0;
          if( CheckStatus(cpus,STATUS_READ) ) {
            PutStatus(cpus,STATUS_MEASUREREAD);
          } /* waits another SWITCH time before starting to measure */
        }
      }
    }
    else if( tree->ps[ProcessNumber].status == STATUS_QUIT )
      break;
  }

  /* now do the latency tests
   */
  //TestLatency(ns_rng);
  tree->ps[ProcessNumber].status = STATUS_QUIT;
  if( !ProcessNumber ) {
    BITBOARD averagereadread;
    int i;
    averagereadread = 0;
    WaitForStatus(cpus,STATUS_QUIT);
    printf("the raw output\n");
    for( i = 0; i < cpus ; i++ ) {
      BITBOARD tr=tree->ps[i].readread;
      averagereadread += tr;
      printf("%llu ",tr);
    }
    printf("\n");
    averagereadread /= (BITBOARD)cpus;
    printf("Raw Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread));
    printf("Now for the final calculation it gets compensated:\n");
    printf("  Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread)-f_loop);
  }

  DeAllocate();
  return 0;
}

/* EOF latencyC.c */


More information about the Beowulf mailing list