[Beowulf] Moores Law is dying
landman at scalableinformatics.com
Tue Apr 14 14:55:09 PDT 2009
Jon Forrest wrote:
> Joe Landman wrote:
>> ... so I see you have never used an interprocedural analysis (-ipa)
>> switch :)
>> Allows you do do things like, I dunno, inline one whole routine inside
>> another ...
> I've never used this but from your description I don't
> see how it leads to larger text sizes at runtime. After all, if you have
> routine A which is 10 bytes, and routine B which is 20 bytes,
> it would seem that they collectively take 30 bytes no matter
> if they stand alone or one inside the other. I might not
> be understanding this right, though.
More like N*20 bytes ... use the routine more than once :)
>> Usually leads to much larger program text sizes.
>> This said, I have seen very large programs from RISC days hitting well
>> more than 1 GB of text. I haven't played with any recently though.
> Let's say this is about right. Do you see such programs getting
> even larger in the future?
>>> Why is sharing expensive in performance? It might take a little
>>> overhead to setup and manage, but why is having multiple virtual
>>> addresses map to the same physical memory expensive?
>> Contention. Memory hot spots. Been there, done that. We are about
>> to do this all over again (collectively).
> Naively I would think that text memory hot spots would be a good
> thing, because then all the benefits of caching would kick in.
> There would be no cache coherence overhead since text is read-only.
> Why is this a bad thing?
Ohhhh.... You *really* don't want your system brought to its knees over
false sharing. Its a great way to turn a large expensive machine into a
very slow large expensive machine. Listen to Greg Lindahl, and he'll
likely point to this as one of the great fallicies of 'why shared memory
is better' than distributed memory :) (not shoving words into his mouth,
so if he has changed his mind or thinks differently ... thats ok)
Imagine you are a processor, and you have written to a location in ram.
So now your cache line is dirty, and waiting in queue to be flushed
out. In your parallel program, along comes someone else who really,
really wants to read that cache line. Ok, so this forces you to a)
flush it now, b) mark that line as clean. Then the next CPU gets that
cache line, does it's write, and whammo, some other CPU wants to do the
same thing to it as you did.
Sadly enough this is a common programming error in shared memory
programming. Think of it like you have a bunch of loops operating in
parallel, all trying up update the same counter, at once. In parallel.
Each update has to wait until it can grab the cache line, and then it
proceeds. The more updaters you have, the more contention for that
resource you have. Your performance scales as 1/N rather than (constant)*N.
Now do this with a page at a time, say a buffer. Like, I dunno, an
Infiniband MPI buffer, or a 10 GbE MPI buffer. Throw more CPUs behind
this buffer, and force them to get in line to shoot data over to their
counterparts. The IB or 10 GbE resource becomes contended for, and as
you increase Ncpu, the contention and performance loss gets worse and
worse (this is basically what Doug Eadline is worried about).
There are ways you can work around some of this stuff. Share nothing is
one way, though this is hard to do at an OS level where you share IO
devices etc. Allocate some private memory queues, a scheduler, and
other bits (you have to do this with Cuda systems and most accelerators
to get reasonable performance).
I know you might postulate that 32 bit text is effectively the CS
equivalent of "C" in physics ... you may approach it asymptotically, but
never actually get there ... but unlike in physics, there isn't really
an underlying reason why you might not get there.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf