landman at scalableinformatics.com
Wed Apr 27 15:46:54 PDT 2005
Vincent Diepeveen wrote:
> A raid5 array of 2 terabyte costs like $2000-$3000 and it can deliver
> 400-600MB/s i/o hands down when attached to a single machine. So if you
> make the 1 tflop processor, there is no need to worry!
I need to find out where you are getting your raids...
> Anything that has to do with huge calculations is in the first place cpu
> power limited. Not anything else.
There is a statement I like to make when I see comments like this.
"Gross generalizations tend to be incorrect".
If you think about it long enough, you can see the recursive humor.
There are many different factors that will affect the overall
performance of a machine on a particular code/data set. To illustrate
this, I often suggested the following gedankenexperiment.
Imagine you have a CPU that is infinitely fast, coupled to resources
that are not infinitely fast. This means that while operations take
exactly 0 time on the CPU, we haven't done a thing to make the memory or
IO faster. In this gedankenexperiment, how much of a speedup do you get
from an infinitely fast CPU? Memory moves still take time. Data
loading and storing still takes time. Data motion is quickly becoming
one of the (if not the) most critical aspect of performance for a fair
number of calculations. So unless all parts are infinitely fast, you
still have to pay for the data motion time, the IO time, the memory->
memory time, the memory->CPU time (and CPU to memory time).
In short, an infinitely fast CPU would reduce the execution time of
(possibly significantly) a class of applications that are only CPU
bound (say operating out of internal cache only).
It will do very little for a code which is IO or memory bandwidth or
> Big RAM is nice to have for most clever algorithms, but it is second most
> important. CPU power is most important. If there is some bottleneck that
> limits the RAM we have, do not worry!
> We will find a solution!
> The real bottleneck is in the end the number of instructions a cpu can
> process a second.
Not really. The bottleneck in performance is how full you can keep the
multiple pipelines of the processor. Branch statements tend to force
pipeline flushes. You can "handle" this with speculative execution.
Real memory accesses can bottleneck the memory subsystem, so real
processors allow specific mixtures of instructions in flight at once to
reduce resource contention. If you overflow any of the fixed CPU
resources, you can stall a pipeline while waiting for the contention to
be eliminated, or you can stall the entire CPU while flushing TLB and
other shared resources. Basically you have multiple simultaneous zero
sum games (fixed number of operations per unit time, specific mixtures
of operations that maximize the performance of instructions in flight).
Compilers are, as I indicated before, not particularly smart in most
cases, and they generate code locally that might not make sense
globally. Moreover, how instructions are ordered and presented to the
CPU will fundamentally impact the overall performance. Code optimizers
are, in a large sense, an attempt to better fit the emitted instructions
to the processor architecture, by rewriting loops, mathematical
constructs, and related. Optimizers are not perfect.
Some architectures are pretty much impossible to write optimal code for
(turns out to be NP-hard), and you have to accept a set of compromises
at some point to avoid having your compilation take 24 hours (my MD
codes used to take about 24 hours to build on a Trace Multiflow, VLIW
The overall point of this is
a) writing good code is hard
b) writing fast code is harder
c) CPUs don't automagically make things faster, compilers are implicated
in this mess
d) some optimizers are better left off :(
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
More information about the Beowulf