[Beowulf] OT: public random numbers?

Fri Aug 26 04:56:15 PDT 2011

On Aug 26, 2011, at 8:07 AM, Robert G. Brown wrote:

> On Fri, 26 Aug 2011, Vincent Diepeveen wrote:
>
>> If we assume that reality of life represents randomness, which is  
>> another
>> rather good question in how far that theory is plausible, then  
>> using that
>> assumption i'm very sure that the RNG's i investigated so far
>> have a distribution which is too perfect, more perfect than i have  
>> seen
>> in any reality.
>
> That's because you live in a different reality than everybody else,
> Vincent.

Or reality we live in might not be so random as we all guess...

But it's good that you took a look at the die-harder test now - which  
you didn't do before.

>
>> In fact most RNG's fill all slots faster than O ( n log n ), yet  
>> it's O ( n log n )
>> that they follow.
>
> In fact, they don't.
>
>> This is RNG's that have come through all tests as being a good and
>> very acceptabe RNG to be used.
>
> No, it's not.
>
>> Realize i'm no RNG expert, so all the names of all those tests.
>>
>> For me it's just push button technology. I just designed a test
>> and found it very odd that all RNG's have such perfect distributions
>> that they don't even miss a single slot.
>
> It's odd because your test is broken.
>
>>
>> I'd argue the only test that would be interesting to me to see how it
>> might be in reality is the lottery machine test - yet with really  
>> a lot
>> of balls. I'd prefer 10k balls over a 1000 in fact - yet for  
>> practical
>> reasons i would agree with a number of above a 1000.
>>
>> Paper fiddling is really not interesting to me there to prove  
>> anything,
>> as what i've seen in reality in randomness is total different from  
>> how
>> RNG's model that.
>
> Let's try a bit of "paper fiddling".  The expected number of filled  
> slots
> is (this is actual code, not pseudocode, for n slots):
>
>  nlogn = log10(n)*n;
>  expected = (n - n*pow(1.0-1.0/(1.0*n),nlogn));
>
> The reasoning is enormously simple.  The probability of a slot being
> empty after one pull is (1 - 1/n). After nlogn pulls, it is p_e = (1 -
> 1/n)^nlogn.  The probability of a slot being filled is thus 1 -  
> p_e, and
> given n slots n - n*(1-1/n)^nlogn of them "should" be filled, within
> random noise, n*(1-1/n)^nlogn of them "should" be empty.
>
> Well, I've got a random number generator tester harness, so I hacked
> your test into it. One major bug in your code, BTW, is using a modulus
> to generate your random numbers -- dunno what that's about, but if  
> your

EVERY PROGRAMMER IS DOING THIS TO USE RANDOM NUMBERS IN THEIR PROGRAM.

Apologies for the caps. I hope how important this is. You're claiming  
all programmers
use random numbers in a faulty manner?

This is important enough to further discuss about it.

As nearly always you need random numbers from within a given domain  
say 0.. n-1
So projecting a RNG onto that domain is pretty crucial. How would you  
want to do that in a correct manner?

In the slot test in fact a simple AND is enough.

> rng returned numbers between (say) 0 and 7 and you use it to generate
> numbers in the range 0 to 5 by means of r%5 then you'll get (for the
> sequence of numbers) 0 1 2 3 4 0 1 2.  Note well that you get twice as
> many 0's, 1's and 2's as 3's and 4's assuming a random draw on 0 to 7.
> So you aren't even testing a uniformly distributed sequence of  
> integers.
>
> Fixing this relatively minor bug, removing your breakout and actually
> counting up filledn for the full nlogn samples, and applying the  
> test to
> mt19937, we get:
>
> rgb at lilith|B:1009>./dieharder -r 6 -n 10000000 -p 1 -t 1 | head
> n = 10000000
> nlogn = 70000000
> table not all filled: filledn = 9990811, expected = 9990881
>
> We run it again:
> rgb at lilith|B:1010>./dieharder -r 6 -n 10000000 -p 1 -t 1 | head
> n = 10000000
> nlogn = 70000000
> table not all filled: filledn = 9990802, expected = 9990881
>
> We run it for R250 -- a well-known not-good generator:
> rgb at lilith|B:1012>./dieharder -g 16 -r 6 -n 10000000 -p 1 -t 1 | head
> n = 10000000
> nlogn = 70000000
> table not all filled: filledn = 9990794, expected = 9990881
>
> We run it on the literally infamous randu:
> n = 10000000
> nlogn = 70000000
> table not all filled: filledn = 9999482, expected = 9990881
>
> Note, Vincent, that the last two examples of correctly computed  
> results
> from known-terrible generators are much farther from the expected mean
> than mt19937, a well-known damn good one.  This suggests that your  
> test
> (perhaps unsurprisingly) has some sensitivity, not because some slots
> are or aren't empty, but because the NUMBER of slots that are or  
> aren't
> empty isn't quite correct.  Note also that in the "paper fiddling"
> analysis above, the use of nlogn is quite unimportant -- we could make
> this an independent variable and evaluate the table filling for any
> value m of pulls, as long as our expected value is n - n*(1 - 1/n)^m.
>
> If I have the energy, I'll see if the distribution of filledn around
> expected is e.g. Gaussian -- it seems pretty reasonable that it  
> would be
> -- with some expected or empirically computable variance.  If it is,
> then this can be fairly easily turned into an actual test that  
> returns a
> p-value that humans can use to make rational judgements, or rational
> humans can use to make judgements or something like that.  I doubt  
> that
> the test will have MUCH sensitivity -- modern generators are way too
> good to have their flaws picked out quite this simply, although
> Marsaglia's "monkey tests" do something very similar although a lot  
> more
> sophisticated mathematically (and arguably more sensitive) and do
> suffice to nail randu (anything nails randu) and semi-weak tests like
> R250.
>
> Now, let's see what we've learned from this fiddling.  One is that
> without it, you just waste a lot of people's time making egregious and
> false claims that belittle the tremendously sophisticated and  
> difficult
> work a whole lot of "fiddlers" have put into inventing, writing, and
> testing modern RNGs.  The truth is that >>all<< RNGs in dieharder  
> "pass"
> your test (if the test is "producing at least one zero") once your  
> test
> isn't broken.  We've learned that in fact, the best of the modern RNGs
> are damn good, and that you could work for five years trying to  
> invent a
> test that is good enough to fail any of them and still not succeed.
> Finally, we've learned that you should not, not, not take your
> Martingale to a casino and try the doubling strategy out to make  
> money,

It's not interesting to discuss - but yes this strategy makes money  
in casino's,
you just get thrown out of the casino and end up at the blacklist if  
you do.

For good chessplayers all this is not so tough. The casino's  
blacklist of people
too strong in blackjack is endless... ...this is practice for long  
than we live now...

So Casino reality is much simpler. They kick you out if you're good.

That's why they try to popularize poker now - you don't play against  
the casino there.

> or if you do put a firm upper bound -- something like 63 Euro -- on  
> what
> you're willing to lose with your base stake of 1 Euro.  That way you
> have maybe a 40% chance of doubling your 63 Euro before you go broke.
> Really, you should read the Wikipedia article I linked, in spite of  
> the
> fact that it presents more "paper fiddling".
>

>   Sincerely,
>
>       rgb
>
> (See P.S. comments below...)
>
>>>> n = 2^30; // 2 to the power 30
>>>> Function TestNumbersForRandomness(RNG,n) {
>>>> declare array hashtable[size n];
>>>> guessednlogn = 2 * (log n / log 2) * n;
>
> Why guess nlogn?  nlog is n*log10(n).  Why nlogn anyway?  Call it m  
> and
> make it a parameter.
>
>>>> for( i = 0 ; i < n ; i++ )
>>>>  hashtable[i] = FALSE;
>>>> ndraws = filledn = 0;
>>>> while( ndraws  < guessednlogn ) {
>>>>   randomnumber = RNG();
>>>>   r = randomnumber % n; //     randomnumber =  r  (mod n)
>
> no, r = n*RNG_UNIFORM(); where RNG_UNIFORM() is e.g. RNG/UINT_MAX.   
> Yes
> there are roundoff errors, but they are uniform and consistent and as
> you can see, don't affect this problem.  What you have isn't even  
> close
> to uniform -- it is badly nonrandom.
>
>>>>   if( hashtable[r] == FALSE ) {
>>>>      hashtable[r] = TRUE;
>>>>      filledn++;
>
>>>>      if( filledn >= n )
>>>>        break;
>
> Don't break.  Just count up filledn.  It will never be more than n now
> anyway, for any n, or any reasonable m. There probably is some  
> number of
> pulls that will raise "expected" to n, but it is pretty big  
> compared to
> n, way bigger than nlogn.
>
>>>>
>>>>  }
>>>>  ndraws++;
>>>> }
>>>> if( filledn >= n )
>>>>   print "With high degree of certainty data generated by a RNG\n");
>>>> else
>>>>   print "Not so sure it's a RNG\n";
>>>> }
>
> I'm guessing the correct statistic here is something like |expected -
> filledn|/expected, but as I said, I haven't really worked at it.  I
> haven't decided whether or not it is worth adding this to dieharder --
> without a formal derivation of the expected statistic it would be yet
> another empirical test, which means you're really comparing one RNG to
> another presumed better one, which I don't like.  And do I have  
> time to
> do the "fiddling" needed to do a proper derivation?  Aye, that's the
> rub...;-)
>
>    rgb
>
>>>> Regards,
>>>> Vincent
>>>>> -- both unpredictable and
>>>>> flat/decorrelated at all orders, and even though there aren't  
>>>>> really
>>>>> enough of them for my purposes, I've used them as one of the  
>>>>> (small)
>>>>> "gold standard" sources for testing dieharder even as I test  
>>>>> them.  For
>>>>> all practical purposes threefish or aes are truly random as  
>>>>> well and
>>>>> they are a lot faster and easier to use as gold standard  
>>>>> generators,
>>>>> though.
>>>>> I don't quite understand why the single site restriction is  
>>>>> important --
>>>>> this site has been up for years and I don't expect it to go  
>>>>> away soon;
>>>>> it is quite reliable.  I don't think there is anything secret  
>>>>> about how
>>>>> the numbers are generated, and I'll certify that the numbers it  
>>>>> produces
>>>>> don't make dieharder unhappy.  So 1 is fixable with a bit of  
>>>>> effort on
>>>>> your part; 6 I don't really understand but the guy who runs the  
>>>>> site is
>>>>> clearly willing to construct a custom feed for cash customers,  
>>>>> if there
>>>>> is enough value in whatever it is you are trying to do to pay for
>>>>> access.  If it's just a lottery, well, lord, I can think of a  
>>>>> dozen ways
>>>>> to make numbers so random that they'd be unimpeachable for any  
>>>>> sort of
>>>>> lottery, both unpredictable and uncorrelated, and they don't  
>>>>> any of them
>>>>> require any significant amount of entropy to get started.
>>>>> I will add one warning -- "randomness" is a rather stringent
>>>>> mathematical criterion, and is generally tested against the null
>>>>> hypothesis.  Amateurs who want to make random number generators  
>>>>> out of
>>>>> supposedly "random" data streams or fancy algorithms almost  
>>>>> invariably
>>>>> fail, sometimes spectacularly so.  There are a half dozen or more
>>>>> really, really good pseudorandom number generators out there  
>>>>> and it is
>>>>> easy to hotwire them together into an xor-based high entropy  
>>>>> stream that
>>>>> basically never repeats (feeding it a bit of real entropy now  
>>>>> and then
>>>>> as it operates).  I would strongly counsel you against trying  
>>>>> to take
>>>>> e.g. weather data and make something "random" out of it.   
>>>>> Unless you
>>>>> really know what you are doing, you will probably make  
>>>>> something that
>>>>> isn't at all random and may not even be unpredictable.  Even most
>>>>> sources of "quantum" randomness (which is at least possibly "truly
>>>>> random", although I doubt it) aren't flat, so that they carry the
>>>>> signature of their generation process unless/until you manage to
>>>>> transform them into something flat (difficult unless you KNOW the
>>>>> distribution they are producing).  Pseudorandom number  
>>>>> generators have
>>>>> the serious advantage of being amenable to at least some  
>>>>> theoretical
>>>>> analysis (so you can "guarantee" flatness out to some high
>>>>> dimensionality, say) as well as empirical testing with e.g.  
>>>>> dieharder.
>>>>> HTH,
>>>>>
>>>>>    rgb
>>>>>> Thanks,
>>>>>> David Mathog
>>>>>> mathog at caltech.edu
>>>>>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>>>>> Robert G. Brown	                       http://www.phy.duke.edu/ 
>>>>> ~rgb/
>>>>> Duke University Dept. of Physics, Box 90305
>>>>> Durham, N.C. 27708-0305
>>>>> Phone: 1-919-660-2567  Fax: 919-660-2525      
>>>>> email:rgb at phy.duke.edu
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
>>>>> Computing
>>>>> To change your subscription (digest mode or unsubscribe) visit  
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>>> Duke University Dept. of Physics, Box 90305
>>> Durham, N.C. 27708-0305
>>> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>>
>
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>