>2 p4 processor systems

Robert G. Brown rgb at phy.duke.edu
Tue Aug 27 13:19:55 PDT 2002


On Tue, 27 Aug 2002, Brian LaMere wrote:

> For whatever reason, management has decided that real estate in the server
> room is an *extreme* issue.  There are plenty of empty racks, but hey...what
> do I know.
> 
> After testing, I found a single-cpu p4 system literally did almost exactly
> (averaged, less than 1% difference) the same amount as a dual p3 1ghz.
> 
> For what ever reason, it has been decided that getting 1U 6-way p3 1ghz
> systems at extreme costs would be better than simply getting a dual p4
> system.   The dual p4 would do 2/3 the work, at 1/7 the cost.  Maybe I
> didn't take that special math that has caused so many places so many issues
> lately, but...that just doesn't make sense to me.

I'm assuming that you mean one 2.4 GHz P4's (or thereabouts) compared to
one dual P3 at 1 GHz, although you don't say.  I assume the higher than
2x P4 clock because (surprisingly) I personally find that a P4 doesn't
quite scale with clock relative to the P3, for at least some of my
applications.

Still, the rest of your argument is flawless.  Here are some more
arguments to through out there to even more strongly support your point:

  a) What is the power consumption of the 6xP3 loaded with your task(s)?
The dual P4's consumption is likely to be lower.  Depending on how your
server room is wired, it might well be that circuit capacity is far more
of a limiting issue in your server room than rack space.  See also the
recent discussion on harmonics and switching power supplies -- loading
your circuits to capacity could reveal all sorts of horrible problems if
the server room wiring wasn't done correctly.

  b) Same question backwards -- if it uses more power, it generates more
heat (in a 1U case!).  Just because you can find boxes out there in odd
configurations doesn't mean that they will run stably under all load
mixes.  A 1U 6xP3 might do just fine for a HA application (where the
per-CPU load is poissonian and intermittant but die a heat death in an
HPC application (where it generates the maximum possible amount of heat
all the time).  In many cases AC capacity is the fundamental limitation
on what you can put in a server room without VERY expensive renovations.

So, you need to figure out how many circuits there are in the server
room and what the maximum kVA they can supply under any circumstances is
(multiplying by 0.6-0.8 of theoretical capacity to obtain real capacity
to accomodate the harmonic problem unless the room is on a harmonic
mitigating transformer or you use expensive power factor correcting
power supplies in all your nodes).  You need to also find out what the
peak AC capacity is serving the space (presuming that you need to
maintain an ambient temperature of <70F, ideally <60F).  You may find
that rackspace is a total non-issue and can never BECOME an issue unless
somebody spends a $100K or more on room renovations.

 c) Your computation that with P4's you are buying 2/3 of the capacity
for 1/7 the money is still pessimistic.  This is true IF AND ONLY IF
your parallel code scales perfectly and Amdahl's law is all but
irrelevant (for example, doing nothing but embarrassingly parallel
computations for very long blocks of time).  Otherwise, fewer, faster
CPUs will often significantly outperform more, slower CPUs because they
complete the rate-limiting SERIAL fraction of the code many times
faster.  As a single (extreme) example, suppose your parallel
application has a 1/7 serial component.  It takes one minute to execute
the serial component on one of the P3s.  It then splits and takes one
more minute to complete the other six minutes worth of task using all
six available P3 CPUs.

On the P4, it takes 30 seconds to complete the serial fraction (half the
time as the single P3).  It then completes each remaining sixth of the
computation in 30 seconds of CPU time, with two CPUs (taking a minute
and a half).  The two architectures complete in the SAME about of time!
It would be fairer to state that the BEST the 6xP3 can do on ideal
parallel code is 1.5x the work of the 2xP4, where it could easily be a
lot less depending on where and how the code contains serial
bottlenecks.  YMMV, and is likely to be worse (for the P3s) than your
estimate, almost certainly never better.

 d) On a similar note, consider other hardware bottlenecks and
architectural limitations.  OK, so you have a 6xP3.  How much memory
does it hold, and how is it apportioned for use per CPU?  How is disk
shared?  How is the network shared?  Do all the CPUs access the same
memory, or does each CPU have its own memory space?  If the latter, how
do the CPUs communicate?

These are all CRITICAL questions, and would ordinarily be among the
FIRST questions addressed in any serious consideration of a
multiprocessor machine.  Consider some of the possible answers:

    i) 6 P3s share one block of three SDRAM slots, maximum memory of 1.5
GB (or 256 MB per CPU).  Memory is shared, so all CPUs can access all
memory, but can EASILY saturate the memory bus, so one might easily end
up with five CPUs waiting and doing NOTHING while the other CPU reads a
big block of memory.  For certain classes of memory bus bound code, your
6x P3 might be no faster than a 2x P3, while your 2x P4, sharing 2 GB
(or 1 GB per CPU) might also use DDR or RDRAM with a faster FSB and be
able to run flat out, or nearly so.  This is presuming SERIAL
applications, not even parallel at all.  Big applications simply don't
run 6 way.

   ii) 6 P3's each have a block of three SDRAM slots, maximum memory of
1.5 GB per CPU.  Now you have lots of memory, but have to pay a penalty
for CPUs to talk to each other.  What lives between them?  If they are
on a network of some sort, you need to know what sort, and whether or
not you have drivers for it in any given OS.  The system runs MUCH
hotter.  Things like DMA for any shared peripherals (e.g. NICs, disks)
become "interesting" (in the sense of the chinese curse).

Similar issues accompany all the other shared resources, including NICs
and disks and video controllers and anything else.  6 way shared means
the potential for HUGE bottlenecks if you try to run ANY application
that bottlenecks on a shared resource.  Resources that are not shared
create bottlenecks or resource crunches in other places.  You are NOT
just comparing raw FLOPS or IPS, you have to look at the nonlinearities
and bottlenecks in the systems' design to have any idea of how parallel
performance will scale even if all you want to do is run 6 more or less
independent tasks.

  e) Finally, in light of all of the above, what sort of operating
system support is there for the 6x system?  Perhaps you can run linux,
perhaps not.  Presumably you can run SOMETHING, but that something might
be a proprietary OS with lots of custom drivers.  In this case the cost
of the hardware may be a tiny fraction of the total cost of ownership of
the 6x system, and if (as is not horribly unlikely) the system turns out
to be moderately unstable and not particularly aggressively maintained,
system downtime becomes yet another hidden cost waiting to bite you.

I mean, I personally would want to SEE and PLAY WITH the 6x P3 fairly
extensively before committing to it on this basis alone to ensure that
yes, I can install it and manage it and that it won't just crash the
minute I try to actually USE it.

I personally think that one would almost certainly have to be
fair-to-middling insane OR to have one of a small class of very specific
tasks to accomplish to go with the 6x 1U system, and even then one
should probably look at the other "exotic" architectures (such as blade
computers using the transmeta CPU) before deciding, as tasks that
actually can do well on the 6xP3 (with its likely mish-mash of memory
and peripherals) can probably ALSO do well on a blade computer that
would beat it in flops/U (over several U), heat and power
(significantly) and architectural cleanliness.  If you're going to spend
a lot of money anyway, of course -- a single or dual P4 or Athlon is
likely to be your best price/performer and will probably perform MUCH
better in a real parallel or memory intensive application either way.

HTH,

   rgb

> 
> So I'm trying to find out if anyone knows of a 4-way p4 system out there.
> I'm wanting to bring a couple dual-p4's in here just so they'll see that the
> performance far surpases the current per-node performance we have on our
> cluster, but...brick wall.  The guy above me agrees with me, the guy above
> him won't talk to me about it.  He just gets all excited about a 6-way p3
> server in 1u.  Whoopie.
> 
> So...help?  Anyone know of any 4-way p4 systems?  And no, amd isn't an
> option (unfortunately).
> 
> Brian
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list