Random Number Generation. Was Re: Beowulf Questions
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduMon Jan 6 07:15:37 PST 2003
- Previous message: Random Number Generation. Was Re: Beowulf Questions
- Next message: Random Number Generation. Was Re: Beowulf Questions
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 6 Jan 2003, Florent Calvayrac wrote: > > Taking this method a step further still, MPI latency might even able to > > be used (delta time between the compute nodes and the head) > > this is certainly much simpler to implement than downloading sprng, > (on http://sprng.cs.fsu.edu/ ) > or just implementing a node-number dependent recurrence formula on each > machine, as RGB is also doing. Measuring an MPI latency time > (in order of several thousands cycles), communicating back and forth, > and running the same calculation on another cluster to predict the > results as you suggest is also certainly faster and more elegant to > achieve independence of the streams. Sure, but a) this sort of thing is very dangerous and b) why bother? /dev/random is ALREADY doing a moderately sophisticated (and probably much more sophisticated and better thought through) job of doing this sort of thing. If the issue is one of "random" (by which I mean "uncorrelated by all measures of correlation") vs "uniform deviate generator" (a term that avoids the oxymoronic descriptor "random number generator":-) then it is extremely difficult to come up with entropy from ANY source or sources available on OTC systems on an OTC network sufficient to provide "truly random" numbers fast enough for a random number-hungry application. There are all kinds of subtle time correlations in nearly anything you choose to use as an entropy source short of a piece of specifically engineered hardware desirnged for that purpose, and you have to wait for several of the longest correlation times of each source in order for its next output to be "random" relative to the previous one. By definition, by the way. The problem is: what are these correlation times? They turn out in many cases to be order milliseconds to seconds, not nanoseconds to microseconds (relative to 32 bits). Or if you prefer, order of 100 microseconds per "random" bit. And there's the rub. If a pair of systems were executing some fairly deterministic MPI code and passing a systematic pattern of same-size messages, you would have to do a lengthy and nontrivial computation just to MEASURE the correlation time, and the correlation time for a nominally periodic process could be VERY LONG. If you fail to do this measurement and just "assume" that you know when the times are adequately decorrelated, you could end up with very significant correlation in your "random" number stream. Worse, since your measurement will be heavily state dependent, you have to ensure that a state with a longer correlation time than whatever you measure can "never" occur, which is all but impossible on a system running on a fixed clock. Basically, you should read "man 4 random" (which uses a variety of sources of entropy, not just one) before deciding that you can do better, and you should recognize that /dev/random is really only suitable for generating seeds or other infrequently required "really random" numbers. Or, to argue the other way, if you INSIST on using e.g. /dev/random as a source of a stream of "really random" random numbers, accepting that it will block until it accumulates enough entropy (by its internal measures) to return the next one AND accepting that abusing it by USING it to generate a stream in this way, where each number is almost by definition "barely random enough" according to the standards of "barely" implemented by the designers and may or may not be random "enough" for your application, then this is indeed a suitable sort of thing to implement on a cluster basis! Inserting /dev/random reads into my cpu_rate timing harness, I measure a /dev/random return rate on the order of 5 milliseconds. This is a very believable number, suggesting correlation times averaging maybe 1 millisecond in the entropy sources used (relative to 32 bits -- ~100 microsecond bit correlation times). In order to saturate even 100BT (call it two and a half million 32 bit random numbers per second) one would need, lessee, 200 per second per node into 2.5x10^6, um, order of 10^4 nodes, PRESUMING that entropy accumulates as rapidly on nodes with no keyboard, mouse, or significant multitasking as it does on my "noisy" desktop where I just ran the benchmark. Somehow I think that buying a hardware RNG device based on e.g. radioactivity or thermal noise would be cheaper and would probably work better as well. Now, if you only wish to generate "a few" really random (heh!;-) numbers to use as seeds for UDG's so that they can produce distinct sequences of uniform deviates (accepting whatever degree of high-dimensional correlation associated with the method, as usual), you simply won't "build" an entropy-based source of the random seed that is any MORE suitable than /dev/random, and implementing /dev/random is (with unsigned int seed): if ((fp = fopen("/dev/random","r")) == NULL) { if(verbose == 10) printf("Cannot open /dev/random, setting seed to 0\n"); seed = 0; } else { fread(&seed,sizeof(seed),1,fp); if(verbose == 10) printf("Got seed %u from /dev/random\n",seed); } gsl_rng_set(random,seed); Hard to get any simpler. Then, as noted before, you can either wear your pants with belt AND suspenders and keep track of all the seeds used to ensure that the gods of randomness don't whack you via the 1/2^32 chance of getting any given unsigned long int over the sequence of your jobs (a very good chance -- basically unity -- if you plan to run say a million sequences, btw) or you can accept the overlap and ignore it (reasonable for a few hundred or even a few thousand runs) or if you are REALLY going to use a LOT of runs (order 10^9), you can create a list of all the ulong ints and shuffle it with a good shuffle algorithm for starters. However, few UDG's will likely turn out to be adequately decorrelated if you start saturating the space of potential seeds -- iterated maps aren't really random and if you literally sample the space of starting points densely, you will almost certainly get significant but very difficult to detect correlations in the generated streams. So once again, your error estimates (based on independence) will be incorrect, and other problems can arise. As always, if you are planning to do anything with very large numbers of UDG's or random numbers or the like, you are well advised to do some moderately significant research on UDG's and randomness before implementing anything at all. For some applications (e.g. writing a game:-) it won't matter -- "any" UDG will suffice, as long as it yields independent play experiences. For others (doing a long running simulation for the purposes of publication) it is essential -- there are famous cases of results obtained at great expense and published in the most respected journals only to discover (to the embarrassment of the authors, the referees, everybody) that they were based on a "bad" UDG (relative to the purpose) and were, alas, cosmic debris, incorrect, totally erroneous, suitable for the trash can. Truth be told, I'd say that most of us who actually work in this game have this as one of our secret nightmares. The oxymoron is a real one. There are no RNG's, only UDG's, and it is very, very difficult to predict whether the correlations that you KNOW are present in a UDG will significantly affect your answers at any given level of precision, provided that you avoid the UDGs with known and obvious problems. Real Computer Scientists (and theoretical physicists, mathematicians, statisticians) spend careers studying this problem. There is almost a heisengbergian feeling to it -- if you only generate a few UD's, you can easily enough get adequately decorrelated ones but your statistical accuracy (in a simulation) will suck. The more UD's you need, the greater your statistical accuracy (presuming independence) but the greater the contamination of those results with the generally occult correlations. It Is Not Easy to know when one will hit the optimum -- the best results one can obtain that have a reliable estimate of statistical accuracy. Indeed, one way of determining the PRESENCE of the correlations is to look for statistically significant deviations from outcomes theoretically predicted for model problems for perfectly random numbers (e.g. the mean length of a random walk in N dimensions). When the outcome isn't known a priori, and one is using a UDG that passes most of the "simple" tests for correlation satisfactorily, one is basically engaged in a crap shoot...maybe your problem is the one that (eventually) will reveal a Weakness in your particular UDG, or UDGs. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Random Number Generation. Was Re: Beowulf Questions
- Next message: Random Number Generation. Was Re: Beowulf Questions
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
