[Beowulf] RASM : random memory latency test

Vincent Diepeveen diep at xs4all.nl
Mon Jul 4 12:04:20 PDT 2005


Hi, 

I wrote this program to test random memory latencies lookups at SSI (single
system images) using shared memory. 

Latencies are NOT dependant upon memory frequency, just dependant upon the
speed the RAM works at (which sometimes IS related to cpu frequency in case
of opteron as the faster opterons allow 400Mhz ram.

Attached the program to test with. Just compile it and then start it with
for example:

  gcc -O2 -o lat latencylinux.c
(as root) :
  echo 450000000 > /proc/sys/kernel/shmmax

now you can run it for example at 2 processors, each processor 200MB shared
memory where the other processor(s) attach to :
  ./lat 200000000 2

if you want to run it at 4 processors (it used 800MB in total then) :
  ./lat 200000000 4

And so on. In fact the program has been bugfixed to run up to 500 processors. 

Of course random access to memory is SLOWER in general when you take a
larger buffer. That's not because of the L2 cache only. So doing tests with
each processor the same buffer, a lot bigger than the L2 cache, say 200MB
is a good thing to do. 

The reason why i wrote this test is a bit sad, but just for historic
consumption.

SGI was claiming 460 nanoseconds for it worst case at 512 processor
partition, even when TLB trashing and when doing reads from and to random
processors; which obviously wasn't true.

Yet for computerchess and searching algorithms in general it matters a lot
as huge caches get used. 

I tested it at 460 processors to be 5800 nanoseconds on average to get 8
bytes from a random processor at a random memory location, let alone worst
case.

Altix3000 is not better when using this test, to say polite. 

Amazingly it's pretty accurate to test pc's with too.

Take into account that it is just doing READS. It's not WRITING.

So from intel viewpoint this is an intel friendly test, because when
sharing a memory controller, reading can be done usually in parallel, but
writing can't.

So if for your software you notice a difference when writing at random to
memory, that's probably because you also WRITE to memory.

At 09:48 PM 7/4/2005 +0400, Mikhail Kuzminsky wrote:
>In message from Vincent Diepeveen <diep at xs4all.nl> (Mon, 04 Jul 2005 
>17:59:40 +0200):
>> ...
>> ...
>>Of course we take a large buffer. Around 400MB is the working set 
>>size for
>>the hashtable which i use for my chess software (which is reading 
>>randomly
>>a 8-64 bytes from the cache).
>>
>>Results:
>>   single cpu A64 : 91 ns  (cl2 memory)
>>   single cpu P4  : 220 ns (cl2 memory, bus overclocked)
>>   dual opteron   : 120 ns 
>>   quad opteron   : 133 ns
>>   dual xeon      : 280 ns (800Mhz bus)
>>   dual xeon      : 400 ns (533Mhz bus)
>The latencies should depends from processors frequencies (although
>RAM part is much higher),
>so what was the frequencies for A64/P4/Opteron/Xeon ? 
>
>And do I understand you correctly that you have 1/2/4 threads which
>perform "random" read of some bytes from main memory ? 
>
>>
>>So obviously things that do not fit in L2 cache, the opteron runs 
>>away with
>>it. Only if the executable is optimized in question by the intel c++
>>compiler it will have done stuff to run it faster at intel processors 
>>than >at opteron, 
>>then results do not look too bad for P4.
>If the results above are for "bad" (bad optimizing) compiler -
>in some sense it's the problem of compiler :-) Yes, old binary 
>software will work slow. But many, many HPC applications may be 
>compiled
>from source.
>BTW, more good results are for icc++ only - do you know
>something about PGI and PathScale compilers ?
>    
>> Yet that's a matter of 
>>optimizing
>>it for opteron better, which most software dudes do NOT do, as intel
>>delivers good support and AMD historically didn't deliver *any* kind 
>>of
>>support (they are improving now, but even then their math libraries 
>>are so
>>pathetic compared to the ease of the intel libraries that i can 
>>imagine at
>>least *that* part of
>>the problems).
>acml 2.1 gives me a set of good results for Opteron in comparison 
>w/MKL
>
>Yours
>Mikhail
>
>
-------------- next part --------------
/*-----------------10-6-2003 3:48-------------------*
 *
 * This program rasml.c measures the Random Average Shared Memory Latency (RASML)
 * Thanks to Agner Fog for his excellent random number generator.
 *
 * This testset is using a 64 bits optimized RNG of Agner Fog's ranrot generator.
 *
 * Created by Vincent Diepeveen who hereby releases this under GPL
 * Feel free to look at the FSF (free software foundation) for what
 * GPL is and its conditions.
 *
 * Please don't confuse the times achieved here with two times the one 
 * way pingpong latency, though at
 * ideal scaling supercomputers/clusters they will be close. There is a few
 * differences:
 *    a) this is TLB trashing
 *    b) this test tests ALL processors at the same time and not
 *       just 2 cpu's while the rest of the entire cluster is idle.
 *    c) this test ships 8 bytes whereas one way pingpong typical also
 *       gets used to test several kilobyte sizes, or just returns a pong.
 *    d) this doesn't use MPI but shared memory and the way such protocols are
 *       implemented matters possibly for latency.
 *
 * Vincent Diepeveen                 diep at xs4all.nl
 * Veenendaal, The Netherlands       10 june 2003
 *
 * First a few lines about the random number generator. Note that I modified Agner Fog's
 * RanRot very slightly. Basically its initialization has been done better and some dead
 * slow FPU code rewritten to fast 64 bits integer code.
 */

#define UNIX 1  /* put to 1 when you are under unix or using gcc a look like compilers */
#define IRIX 1  /* this value only matters when UNIX is set to 1. For Linux put to 0
                 * basically allocating shared memory in linux is pretty buggy done in
                 * its kernel.
                 *
                 * Therefore you might want to do 'cat /proc/sys/kernel/shmmax'
                 * and look for yourself how much shared memory YOU can allocate in linux.
                 *
                 * If that is not enough to benchmark this program then try modifying it with:
                 *    echo <newsize> > /proc/sys/kernel/shmmmax
                 * Be sure you are root when doing that each time the system boots.
                 */
#define FREEBSD 0 // be sure to not use more than 2 GB memory with freebsd with this test. sorry.


#if UNIX
  #include <pthread.h>
  #include <sys/ipc.h>
  #include <sys/shm.h>
  #include <sys/times.h>
  #include <sys/time.h>
  #include <unistd.h>
#else
  #include <windows.h>
  #include <winbase.h> // for GetTickCount()
  #include <process.h> // _spawnl
#endif

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

#define SWITCHTIME      60000   /* in milliseconds. Modify this to let a test run longer or shorter.
                                 * basically it is a good idea to use about the cpu number times
                                 * thousand for this. 30 seconds is fine for PC's, but a very
                                 * bad idea for supercomputers. I recomment several minutes
                                 * there, and at least a few hours for big supers if the partition isn't started yet
                                 * if the partition is started starting it at 460 processors (SGI) should 
                                 * take 10 minutes, otherwise it takes 3 hours to attach all. 
                                 * Of course that let's a test take way way longer.
                                 */
#define MAXPROCESSES     512    /* this test can go up to this amount of processes to be tested */
#define CACHELINELENGTH  128    /* cache line length at the machine. Modify this if you want to */


#if UNIX
  #include <time.h>
 // #include <memory.h>
  #define FORCEINLINE       __inline
  /* UNIX and such this is 64 bits unsigned variable: */
  #define BITBOARD                     unsigned long long
#else
  #define FORCEINLINE       __forceinline
  /* in WINDOWS we also want to be 64 bits: */
  #define BITBOARD                     unsigned _int64
#endif

#define     STATUS_NOTSTARTED    0
#define     STATUS_ATTACH        1
#define     STATUS_GOATTACH      2
#define     STATUS_ATTACHED      3
#define     STATUS_STARTREAD     4
#define     STATUS_READ          5
#define     STATUS_MEASUREREAD   6
#define     STATUS_MEASUREDREAD  7

#define     STATUS_QUIT         10

struct ProcessState {
  volatile int status; /*  0  = not started yet
                        *  1  = ready to start reading
                        *
                        *  10 = quitted
                        * */

  /* now the numbers each cpu gathers. The name of the first number is what
   * cpu0 is doing and the second name what all the other cpu's were doing at that
   * time
   */
  volatile BITBOARD readread; /* */
  char dummycacheline[CACHELINELENGTH];
};

typedef struct {
  BITBOARD nentries; // number of entries of 64 bits used for cache.
  struct ProcessState ps[MAXPROCESSES];
} GlobalTree;

void     RanrotAInit(void);
float    ToNano(BITBOARD);
int      GetClock(void);
float    TimeRandom(void);

void     ParseBuffer(BITBOARD);
void     ClearHash(void);
void     DeAllocate(void);
int      DoNrng(BITBOARD);
int      DoNreads(BITBOARD);
int      DoNreadwrites(BITBOARD);
//void     TestLatency(float);
int      AllocateTree(void);
void     InitTree(int);
void     WaitForStatus(int,int);
void     PutStatus(int,int);
int      CheckStatus(int,int);
int      CheckAllStatus(int,int);
void     Slapen(int);
float    LoopRandom(void);



/* define parameters (R1 and R2 must be smaller than the integer size): */
#define KK  17
#define JJ  10
#define R1   5
#define R2   3

/* global variables Ranrot */
BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */
 0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,0x195e36fe715fad23,
 0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,0x5db2d651a7bdf825,
 0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,0x00d47d10ffdc8a9f,
 0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,0x43d64ed75a9ad5d9
 /*0xa05a7755512c0c03,0x960880d9ea857ccd,0x7d9c520a4cc1d30f,0x73b1eb7d8891a8a1,0x116e3fc3a6b7aadb*/
};
int r_p1, r_p2;          /* indexes into history buffer */

/* global variables RASML */
BITBOARD *hashtable[MAXPROCESSES],nentries,globaldummy=0;
GlobalTree *tree;
int ProcessNumber,
    cpus;  // number of processes for this test
#if UNIX
int shm_tree,shm_hash[MAXPROCESSES];
#endif
char rasmexename[2048];

 /******************************************************** AgF 1999-03-03 *
 *  Random Number generator 'RANROT' type B                               *
 *  by Agner Fog                                                          *
 *                                                                        *
 *  This is a lagged-Fibonacci type of random number generator with       *
 *  rotation of bits.  The algorithm is:                                  *
 *  X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b               *
 *                                                                        *
 *  The last k values of X are stored in a circular buffer named          *
 *  randbuffer.                                                           *
 *                                                                        *
 *  This version works with any integer size: 16, 32, 64 bits etc.        *
 *  The integers must be unsigned. The resolution depends on the integer  *
 *  size.                                                                 *
 *                                                                        *
 *  Note that the function RanrotAInit must be called before the first    *
 *  call to RanrotA or iRanrotA                                           *
 *                                                                        *
 *  The theory of the RANROT type of generators is described at           *
 *  www.agner.org/random/ranrot.htm                                       *
 *                                                                        *
 *************************************************************************/

FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<<r)|(x>>(64-r));}

/* returns a random number of 64 bits unsigned */
FORCEINLINE BITBOARD RanrotA(void) {
  /* generate next random number */
  BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2);
  /* rotate list pointers */
  if( --r_p1 < 0)
    r_p1 = KK - 1;
  if( --r_p2 < 0 )
    r_p2 = KK - 1;
  return x;
}

/* this function initializes the random number generator.      */
void RanrotAInit(void) {
  int i;

  /* one can fill the randbuffer here with possible other values here */
  randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber;
  randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12);

  /* initialize pointers to circular buffer */
  r_p1 = 0;
  r_p2 = JJ;

  /* randomize */
  for( i = 0; i < 300; i++ )
    (void)RanrotA();
}

/* Now the RASML code */
char *To64(BITBOARD x) {
  static char buf[256];
  char *sb;

  sb = &buf[0];
  #if UNIX
  sprintf(buf,"%llu",x);
  #else
  sprintf(buf,"%I64u",x);
  #endif
  return sb;
}

int GetClock(void) {
/* The accuracy is measured in millisecondes. The used function is very accurate according
 * to the NT team, way more accurate nowadays than mentionned in the MSDN manual. The accuracy
 * for linux or unix we can only guess. Too many experts there.
 */
  #if UNIX
  struct timeval timeval;
  struct timezone timezone;
  gettimeofday(&timeval, &timezone);
  return((int)(timeval.tv_sec*1000+(timeval.tv_usec/1000)));
  #else
  return((int)GetTickCount());
  #endif
}

float ToNano(BITBOARD nps) {
  /* convert something from times a second to nanoseconds.
   * NOTE THAT THERE IS COMPILER BUGS SOMETIMES AT OLD COMPILERS
   * SO THAT'S WHY MY CODE ISN'T A 1 LINE RETURN HERE. PLEASE DO
   * NOT MODIFY THIS CODE */
  float tn;
  tn = 1000000000/(float)nps;
  return tn;
}

float TimeRandom(void) {
  /* timing the random number generator is very easy of course. Returns
   * number of random numbers a second that can get generated
   */
  BITBOARD bb=0,i,value,nps;
  float ns_rng;
  int t1,t2,took;

  printf("Benchmarking Pseudo Random Number Generator speed, RanRot type 'B'!\n");
  printf("Speed depends upon CPU and compile options from RASML,\n therefore we benchmark the RNG\n");
  printf("Please wait a few seconds.. "); fflush(stdout);
  value = 100000;
  took  = 0;
  while( took < 3000 ) {
    value <<= 2; //  x4
    t1 = GetClock();

    for( i = 0; i < value; i++ ) {
      bb ^= RanrotA();
    }
    t2 = GetClock();
    took = t2-t1;
  }

  nps = (1000*value)/(BITBOARD)took;

  #if UNIX
  printf("..took %i milliseconds to generate %llu numbers\n",took,value);
  printf("Speed of RNG = %llu numbers a second\n",nps);
  #else
  printf("..took %i milliseconds to generate %I64 numbers\n",took,value);
  printf("Speed of RNG = %I64u numbers a second\n",nps);
  #endif

  ns_rng = ToNano(nps);
  printf("So 1 RNG call takes %f nanoseconds\n",ns_rng);


  return ns_rng;
}

void ParseBuffer(BITBOARD nbytes) {
  tree->nentries = nbytes/sizeof(BITBOARD);
  #if UNIX
  printf("Trying to allocate %llu entries. ",tree->nentries);
  printf("In total %llu bytes\n",tree->nentries*(BITBOARD)sizeof(BITBOARD));
  #else
  printf("Trying to allocate %s entries. ",To64(tree->nentries));
  printf("In total %s bytes\n",To64(tree->nentries*(BITBOARD)sizeof(BITBOARD)));
  #endif
}

void ClearHash(void) {
  BITBOARD *hi,i,nentries = tree->nentries;
  /* clearing hashtable */
  printf("Clearing hashtable for processor %i\n",ProcessNumber);
  fflush(stdout);
  hi = hashtable[ProcessNumber];
  for( i = 0 ; i < nentries ; i++ ) /* very unoptimized way of clearing */
    hi[i] = i;
}

void DeAllocate(void) {
  int i;
  #if UNIX
  shmctl(shm_tree,IPC_RMID,0);
  for( i = 0; i < cpus; i++ ) {
    shmctl(shm_hash[i],IPC_RMID,0);
  }
  #else
  UnmapViewOfFile(tree);
  for( i = 0; i < cpus; i++ ) {
    UnmapViewOfFile(hashtable[i]);
  }
  #endif
}

int DoNrng(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2,ncpu;

  ncpu     = cpus;
  nents    = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD rani=RanrotA(),index=rani%nents;
    unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
    dummyres ^= (index+(BITBOARD)i2);
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;
  return(t2-t1);
}

int DoNreads(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2,ncpu;

  ncpu     = cpus;
  nents    = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD rani=RanrotA(),index=rani%nents;
    unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
    dummyres ^= hashtable[i2][index];
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;

  return(t2-t1);
}

#if 0
int DoNreadwrites(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2;

  nents = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD index = RanrotA()%nents;
    dummyres ^= hashtable[index];
    hashtable[index] = dummyres;
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;

  return(t2-t1);
}

void TestLatency(float ns_rng) {
  BITBOARD n,nps_read,nps_rw,nps_rng;
  float ns,fns;
  int timetaken;

  printf("Doing random RNG test. Please wait..\n");
  n = 50000000; // 50 mln
  timetaken = DoNrng(n);
  nps_rng = (1000*n) / (BITBOARD)timetaken;
  fns  = ToNano(nps_rng);
  printf("Machine needs %f ns for RND loop\n",fns);

  /* READING SINGLE CPU RANDOM ENTRIES */
  printf("Doing random read tests single cpu. Please wait..\n");
  n = 100000000; // 100 mln
  timetaken = DoNreads(n);
  nps_read = (1000*n) / (BITBOARD)timetaken;
  ns  = ToNano(nps_read);
  printf("Machine needs %f ns for single cpu random reads.\nExtrapolated=%f nanoseconds a read\n",ns,ns-fns);

  /* READING AND THEN WRITING SINGLE CPU RANDOM ENTRIES */
  printf("Doing random readwrite tests single cpu. Please wait..\n");
  n = 100000000; // 100 mln
  timetaken = DoNreadwrites(n);
  nps_rw = (1000*n) / (BITBOARD)timetaken;
  ns  = ToNano(nps_rw);
  printf("Machine needs %f ns for single cpu random readwrites.\n",ns);
  printf("Extrapolated=%f nanoseconds a readwrite (to the same slot)\n\n",ns-fns);

  printf("So far the useless tests.\nBut we have vague read/write nodes a second numbers now\n");
}
#endif

int AllocateTree(void) { /* initialize the tree. returns 0 if error */
  #if UNIX
  shm_tree = shmget(
              ftok(".",'t'),
              sizeof(GlobalTree),IPC_CREAT|0777);
  if( shm_tree == -1 )
    return 0;
  tree = (GlobalTree *)shmat(shm_tree,0,0);
  if( tree == (GlobalTree *)-1 )
    return 0;
  #else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
  if( !ProcessNumber ) {
    HANDLE TreeFileMap;
    TreeFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
     (DWORD)sizeof(GlobalTree),"RASM_Tree");
    if( TreeFileMap == NULL )
      return 0;
    tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( tree == NULL )
      return 0;
  }
  else { /* Slaves attach also try to attach to the tree */
    HANDLE TreeFileMap;
    TreeFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Tree");
    if( TreeFileMap == NULL )
      return 0;
    tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( tree == NULL )
      return 0;
  }
  #endif
  return 1;
}

int AttachAll(void) {
  #if UNIX
  #else
  HANDLE HashFileMap;
  #endif
  char hashname2[32] = {"RASM_Hash00"},hashname[32];
  int i,r;
  for( r = 0; r < cpus; r++ ) {
    i = ProcessNumber+r;
    i %= cpus;
    if( i == ProcessNumber )
      continue;
    #if UNIX
    shm_hash[i] = shmget(
              #if IRIX
              ftok(".",200+i),
              #else
              ftok(".",(char)i),
              #endif
              tree->nentries*8,IPC_CREAT|0777);
    if( shm_hash[i] == -1 )
      return 0;
    hashtable[i] = (BITBOARD *)shmat(shm_hash[i],0,0);
    if( hashtable[i] == (BITBOARD *)-1 )
      return 0;
    #else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */

    strcpy(hashname,hashname2);
    hashname[9] += (i/10);
    hashname[10] += (i%10);

    HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,hashname);
    if( HashFileMap == NULL )
      return 0;
    hashtable[i] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( hashtable[i] == NULL )
      return 0;
    #endif
  }
  return 1;
}

int AllocateHash(void) { /* initialize the hashtable (cache). returns 0 if error */
  char hashname[32] = {"RASM_Hash00"};
  #if UNIX
  shm_hash[ProcessNumber] = shmget(
              #if IRIX
              ftok(".",200+ProcessNumber),
              #else
              ftok(".",(char)ProcessNumber),
              #endif
              tree->nentries*8,IPC_CREAT|0777);
  if( shm_hash[ProcessNumber] == -1 )
    return 0;
  hashtable[ProcessNumber] = (BITBOARD *)shmat(shm_hash[ProcessNumber],0,0);
  if( hashtable[ProcessNumber] == (BITBOARD *)-1 )
    return 0;
  #else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
  //if( !ProcessNumber ) {
    HANDLE HashFileMap;

    hashname[9] += (ProcessNumber/10);
    hashname[10] += (ProcessNumber%10);

    HashFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
     (DWORD)tree->nentries*8,hashname);
    if( HashFileMap == NULL )
      return 0;
    hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( hashtable[ProcessNumber] == NULL )
      return 0;
  //}
  //else { /* Slaves attach also try to attach to the tree */
 /*   HANDLE HashFileMap;
    HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Hash");
    if( HashFileMap == NULL )
      return 0;
    hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( hashtable[ProcessNumber] == NULL )
      return 0;*/
  //}
  #endif
  return 1;
}

int StartProcesses(int ncpus) {
  char buf[256];
  int i;
  /* returns 1 if ncpus-1 started ok */
  if( ncpus == 1 )
    return 1;

  for( i = 1 ; i < ncpus ; i++ ) {
    sprintf(buf,"%i_%i",i+1,ncpus);
    #if UNIX
    if( !fork() )
      execl(rasmexename,rasmexename,buf,NULL);
    #else
    (void)_spawnl(_P_NOWAIT,rasmexename,rasmexename,buf,NULL);
     #endif
  }
  return 1;
}

void InitTree(int ncpus) {
  int i;

  for( i = 0 ; i < ncpus ; i++ ) {
    tree->ps[i].status   = STATUS_NOTSTARTED;
    tree->ps[i].readread = 0;
  }
}

void WaitForStatus(int ncpus,int waitforstate) {
  /* wait for all processors to have the same state */
  int i,badluck=1;

  while( badluck ) {
    badluck = 0;
    for( i = 0 ; i < ncpus ; i++ ) {
      if( tree->ps[i].status != waitforstate )
        badluck = 1;
    }
  }
}

void PutStatus(int ncpus,int statenew) {
  int i;
  for( i = 0 ; i < ncpus ; i++ ) {
    tree->ps[i].status = statenew;
  }
}

int CheckStatus(int ncpus,int statenew) {
  /* returns false when not all cpu's are in the new state */
  int i;
  for( i = 0 ; i < ncpus ; i++ ) {
    if( tree->ps[i].status != statenew )
      return 0;
  }
  return 1;
}


int CheckAllStatus(int ncpus,int status) {
  /* Tries with a single loop to determine whether the other cpu's also finished
   *
   * returns:
   *     true  ==> when all the processes have this status
   *     false ==> when 1 or more are still busy measuring
   */
  int i,badluck=1;
  for( i = 0 ; i < ncpus ; i++ ) {
    if( tree->ps[i].status != status ) {
      badluck = 0;
      break;
    }
  }
  return badluck;
}

void Slapen(int ms) {
  #if UNIX
  usleep(ms*1000); /* 0.050 000 secondes, it is in microseconds! */
  #else
  Sleep(ms);     /* 0.050 seconds, it is in milliseconds */
  #endif
}

float LoopRandom(void) {
  BITBOARD n,nps_rng;
  float fns;
  int timetaken;
  printf("Benchmarking random RNG test. Please wait..\n");
  n = 25000000; // 50 mln
  timetaken = 0;
  while( timetaken < 500 ) {
    n += n;
    timetaken = DoNrng(n);
  }
  printf("timetaken=%i\n",timetaken);
  nps_rng = (1000*n) / (BITBOARD)timetaken;
  fns  = ToNano(nps_rng);
  printf("Machine needs %f ns for RND loop\n",fns);
  return fns;
}

/* Example showing how to use the random number generator: */
int main(int argc,char *argv[]) {
  /* allocate a big memory buffer parameter is in bytes.
   * don't hesitate to MODIFY this to how many gigabytes
   * you want to try.
   * The more the better i keep saying to myself.
   *
   * Note that under linux your maximum shared memory limit can be set with:
   *
   * echo <size> > /proc/sys/kernel/shmmax
   *
   * and under IRIX it is usually 80% from the total RAM onboard that can get allocated
   */

  BITBOARD nbytes,firstguess;
  float ns_rng,f_loop;
  int tottimes,t1,t2;


  if( argc <= 1 ) {
    printf("Latency test usage is: latency <buffer> <cpus>\n");
    printf("Where 'buffer' is the buffer in number of bytes to allocate PRO PROCESSOR\n");
    printf("and where 'cpus' is the number of processes that this test will try to use (1 = default) \n");
    return 1;
  }

  /* parse the input */
  nbytes = 0;
  cpus   = 1; // default

  if( strchr(argv[1],'_') == NULL ) { /* main startup process */
    int np = 0;
    #if UNIX
     #if FREEBSD
     nbytes = (BITBOARD)atoi(argv[1]); // freebsd doesn't support > 2 GB memory
     #else
     nbytes = (BITBOARD)atoll(argv[1]);
     #endif
    #else
    nbytes = (BITBOARD)_atoi64(argv[1]);
    #endif

    printf("Welcome to RASM Latency!\n");
    printf("RASML measures the RANDOM AVERAGE SHARED MEMORY LATENCY!\n\n");

    if( argc > 2 ) {
      cpus = 0;
      do {
        cpus *= 10;
        cpus += (int)(argv[2][np]-'1')+1;
        np++;
      } while( argv[2][np] >= '0' && argv[2][np] <= '9' );
    }
    //printf("Master: buffer = %s bytes. #CPUs = %i\n",To64(nbytes),cpus);
    ProcessNumber = 0;

    /* check whether we are not getting out of bounds */
    if( cpus > MAXPROCESSES ) {
      printf("Error: Recompile with a bigger stack for MAXPROCESSES. %i processors is too much\n",cpus);
      return 1;
    }

    /* find out the file name */
    #if UNIX
    strcpy(rasmexename,argv[0]);
    #else
    GetModuleFileName(NULL,rasmexename,2044);
    #endif
    printf("Stored in rasmexename = %s\n",rasmexename);
  }
  else { //   latency 2_452  ==>  means processor 2 out of 452.
    int np = 0;

    ProcessNumber = 0;
    do {
      ProcessNumber *= 10;
      ProcessNumber += (argv[1][np]-'1')+1;      // n
      np++;
    } while( argv[1][np] >= '0' && argv[1][np] <= '9' );

    ProcessNumber--; // 1 less because of ProcessNumber ==> [0..n-1]

    np++; // skip underscore

    cpus = 0;
    do {
      cpus *= 10;
      cpus += (argv[1][np]-'1')+1;      // n
      np++;
    } while( argv[1][np] >= '0' && argv[1][np] <= '9' );
    //printf("Slave: ProcessNumber=%i cpus=%i\n",ProcessNumber,cpus);
  }

  /* first we setup the random number generator. */
  RanrotAInit();

  /* initialize shared memory tree; it gets used for communication between the processes */
  if( !AllocateTree() ) {
    printf("Error: ProcessNumber %i could not allocate the tree\n",ProcessNumber);
    return 1;
  }

  if( !ProcessNumber )
    ParseBuffer(nbytes);

  nentries = tree->nentries;

  /* Now some stuff only the Master has to do */
  if( !ProcessNumber ) {
    /* Master: now let's time the pseudo random generators speed in nanoseconds a call */
    ns_rng = TimeRandom();
    f_loop = LoopRandom();

    printf("Trying to Allocate Buffer\n");
    t1 = GetClock();
    if( !AllocateHash() ) {
      printf("Error: Could not allocate buffer!\n");
      return 1;
    }
    t2 = GetClock();
    printf("Took %i.%03i seconds to allocate Hash\n",(t2-t1)/1000,(t2-t1)%1000);
    ClearHash(); // local hash
    t1 = GetClock();
    printf("Took %i.%03i seconds to clear Hash\n",(t1-t2)/1000,(t1-t2)%1000);

    /* so now hashtable is setup and we know quite some stuff. So it is time to
     * start all other processes */
    InitTree(cpus);

    printf("Starting Other processes\n");
    t1 = GetClock();
    if( !StartProcesses(cpus) ) {
      printf("Error: Could not start processes\n");
      DeAllocate();
    }
    t2 = GetClock();
    printf("Took %i milliseconds to start %i additional processes\n",t2-t1,cpus-1);
    t1 = GetClock();
  }
  else { /* all Slaves do this */
    if( !AllocateHash() ) {
      printf("Error: slave %i Could not allocate buffer!\n",ProcessNumber);
      return 1;
    }
    ClearHash(); // local hash
  }

  tree->ps[ProcessNumber].status = STATUS_ATTACH;
  if( ! ProcessNumber ) {
    WaitForStatus(cpus,STATUS_ATTACH);
    t2 = GetClock();
    printf("Took %i milliseconds to synchronize %i additional processes\n",t2-t1,cpus-1);
    t1 = GetClock();

    /* now we can continue with the next phase that is attaching all the segments */
    PutStatus(cpus,STATUS_GOATTACH);
  }
  else {
    while( tree->ps[ProcessNumber].status == STATUS_ATTACH ) {
      Slapen(500);
    }
  }

  if( !AttachAll() ) {
    printf("Error: process %i Could not attach correctly!\n",ProcessNumber);
    return 1;
  }
  tree->ps[ProcessNumber].status = STATUS_ATTACHED;

  if( ! ProcessNumber ) {
    WaitForStatus(cpus,STATUS_ATTACHED);
    t2 = GetClock();
    printf("Took %i milliseconds to ATTACH. %llu total RAM\n",t2-t1,(BITBOARD)cpus*tree->nentries*8);
    PutStatus(cpus,STATUS_STARTREAD);
    printf("Read latency measurement STARTS NOW using steps of 2 * %i.%03i seconds :\n",
     (SWITCHTIME/1000),(SWITCHTIME%1000));
  }
  else {
    while( tree->ps[ProcessNumber].status == STATUS_ATTACHED ) {
      Slapen(500);
    }
  }

  tree->ps[ProcessNumber].status = STATUS_READ;

  firstguess = 200000;
  tottimes   = 0;
  for( ;; ) {
    int timetaken = 0;
    if( tree->ps[ProcessNumber].status == STATUS_MEASUREREAD ) {
      /* this really MEASURES the readread */
      BITBOARD ntried = 0,avnumber;
      int totaltime=0;
      while( totaltime < SWITCHTIME ) { /* go measure around switchtime seconds */
        totaltime += DoNreads(firstguess);
        ntried += firstguess;
      }
      /* now put the average number of readreads into the shared memory */
      avnumber = (ntried*1000) / (BITBOARD)totaltime;
      tree->ps[ProcessNumber].readread = avnumber;

      /* show that it is finished */
      tree->ps[ProcessNumber].status = STATUS_MEASUREDREAD;

      /* now keep doing the same thing until status gets modified */
      while( tree->ps[ProcessNumber].status == STATUS_MEASUREDREAD ) {
        (void)DoNreads(firstguess);
        if( !ProcessNumber ) {
          if( CheckAllStatus(cpus,STATUS_MEASUREDREAD) ) {
            PutStatus(cpus,STATUS_QUIT);
            break;
          }
        }
      }
    }
    else if( tree->ps[ProcessNumber].status == STATUS_READ ) {
      BITBOARD nextguess;
      /* now software must try to determine how many reads a seconds are possible for that
       * process
       */
      //printf("proc=%i trying %s reads\n",ProcessNumber,To64(firstguess));
      timetaken = DoNreads(firstguess);
      /* try to guess such that next test takes 1 second, or if test was too inaccurate
       * then double the number simply. also prevents divide by zero error ;)
       */
      if( timetaken < 400 )
        nextguess = firstguess*2;
      else
        nextguess = (firstguess*1000)/(BITBOARD)timetaken;
      firstguess = nextguess;
      if( !ProcessNumber ) {
        tottimes += timetaken;
        if( tottimes >= SWITCHTIME ) { // 30 seconds to a few minutes
          tottimes = 0;
          if( CheckStatus(cpus,STATUS_READ) ) {
            PutStatus(cpus,STATUS_MEASUREREAD);
          } /* waits another SWITCH time before starting to measure */
        }
      }
    }
    else if( tree->ps[ProcessNumber].status == STATUS_QUIT )
      break;
  }

  /* now do the latency tests
   */
  //TestLatency(ns_rng);
  tree->ps[ProcessNumber].status = STATUS_QUIT;
  if( !ProcessNumber ) {
    BITBOARD averagereadread;
    int i;
    averagereadread = 0;
    WaitForStatus(cpus,STATUS_QUIT);
    printf("the raw output\n");
    for( i = 0; i < cpus ; i++ ) {
      BITBOARD tr=tree->ps[i].readread;
      averagereadread += tr;
      printf("%llu ",tr);
    }
    printf("\n");
    averagereadread /= (BITBOARD)cpus;
    printf("Raw Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread));
    printf("Now for the final calculation it gets compensated:\n");
    printf("  Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread)-f_loop);
  }

  DeAllocate();
  return 0;
}

/* EOF latencyC.c */


More information about the Beowulf mailing list