[Beowulf] Vector coprocessors AND CILK

Wed Mar 22 14:52:19 PST 2006

----- Original Message ----- 
From: "Bill Harman" <billharman at comcast.net>
To: "'Vincent Diepeveen'" <diep at xs4all.nl>; <daniel.pfenniger at obs.unige.ch>; 
"'Jim Lux'" <James.P.Lux at jpl.nasa.gov>
Cc: <beowulf at beowulf.org>
Sent: Wednesday, March 22, 2006 11:05 PM
Subject: RE: [Beowulf] Vector coprocessors AND CILK

> Should the question not be: how much are you willing to pay.  If you only
> get a 2X speed up in your run time for $10K, then you will buy more nodes.
> If you get a 10X, you may call it a wash and could go either way.  If you
> see a 1000X then you will pay a great deal more than $10K.  I have seen a
> bioinformatics application where the site installed a reconfigurable FPGA
> box and got an approx. 1400X improvement on a Smith-Waterman application, 
> vs
> their Linux Cluster.

Only 1400x speedimprovement in bioinformatics?

That's very little.

I remember helping out a PHD student there whose university in USA is using 
some very slow software program.
There is a commercial solution which calculates the same thing factor 
100000x faster. There is another 10 commercial
solutions which are also nearly a 100k times faster.

However they want a certain outcome, so this years 80 program, is what his 
professor likes more.
As the outcome supports the faculties viewpoint on how in past certain 
things have evolved until now,
and only that years 80 program that's eating a cpu or 100-400 now nonstop, 
is having that outcome they like to conclude.

Note also in physics sometimes a factor 1000 speedup is something you just 
blink your eyes for. I know examples of departments where some good 
programmer started work who speeded up those matrix calculations with the 
precision they needed about factor 1000 on average.

If i just look in my own area what universities do there, then their 
research is 20 years outdated already, or some wrong conclusion gets drawn 
simply because they can't program or do not want to program, as programming 
gets seen as 'dirty work'.

Especially in Artificial Intelligence, you see endless numbers of 
researches, who nonstop rewrite conclusions from old researches, as their 
own software attempt was so hopeless and such a beginners attempt, that if 
you compare that to Cilk, this Leierson & co are pretty holy.

"only a factor 50 slower".

> Economy of scale applies to the computer consumer mass
> market, but rarely does it apply in the more demanding HPC market, in
> general terms, for these type of products.

Economy in the end drives every technology.

You won't get funding right now for a Cray 4 processor machine @ 1Ghz which 
is eating 500 kilowatt.

The pc processors simply have won based upon production price.
My attempt there very clearly demonstrated that.

With respect to highend just the time it takes to realize for those 
commissions what a processor
of today can and cannot do, after which we'll see a huge number of 
bankrupcies, or downsizing of companies, in highend.

The university world will simply take time to also follow that path. To give 
concrete examples why:

My government here in europe ordered a 416 processor Altix3000 delivering 2 
Tflop on
paper for i guess around 12.5 million euro. For one third of that the lofar 
project ordered a 12288 processor IBM machine
delivering tens of tflops.

The professor in the commission of that SGI machine, that's writing reports 
didn't even realize
that the k8 processor has SSE2 nor that it has 1024KB L2.

He wrote down in his report. Quote page 18 (overview of recent 
supercomputers):
   "due to shrinkage of components the chip now can harbour the secondary 
cache of 256KB and the memory controller".

This report was printed in januari 2004 and written end of 2003.
Opteron was tested and showcased around end of april 2003 and had a 1MB L2 
right from the start.

He just doesn't realize that k8 has SSE2 for example.
Only from Intel Xeon processor they realized it.

Additionally many of those "high performances" professors are just busy with 
L2 cache,
whereas the most important cache is the L1 cache.

I'll forgive them. They are not active working anymore, just busy with 
meetings and doing good work for their students,
and cashing in some money in tens of commissions.

Following their own field must be incredible hard with all those activity, 
no hard words there. They earned their marks in the past.

However it's easy to demonstrate why only L1 is real important, the rest is 
just of secondary importance.
90%+ of all hits are usually in L1, some databases which get 80% hits 
excepted, putting thereby the size
and speed of L2 completely irrelevant.

For my own program this was clearly demonstrated by Johan de Gelas, testing 
2 near to identical P4's just a difference in L2 size.
It was 0% difference.

If you're going to multiply things that aren't in registers yet, then 
obviously not the multiplication unit is the biggest problem, but
the fact that the Prescott P4 can only issue 1 instruction in 4 cycles. 
Which means you have a latency of 8 cycles before the next multiplication 
can take place. Versus at A64 you can do it every 3 cycles, as you can 2 
reads simultaneously.

Size of L2 is completely irrelevant in all those cases.

Of course, such a report simply is 1 year behind. After such report releases 
then another 6 months pass to buy a new system.
By then of course the knowledge where decisions get taken upon is 2 years 
outdated then another delay and a year later a machine is installed. Total 
time lost 3 years. By then a new generation product is already there at 
other companies which is either a lot cheaper or delivering more for the 
same price.

Development goes fast in hardware in that sense.

But it's hard to deny that the PC processors have won everywhere.
So the only real interesting question now is: "how to cluster them?"

> Bill Harman,
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On
> Behalf Of Vincent Diepeveen
> Sent: Wednesday, March 22, 2006 12:16 PM
> To: daniel.pfenniger at obs.unige.ch; Jim Lux
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] Vector coprocessors AND CILK
>
>
> ----- Original Message -----
> From: "Jim Lux" <James.P.Lux at jpl.nasa.gov>
> To: "Vincent Diepeveen" <diep at xs4all.nl>; <daniel.pfenniger at obs.unige.ch>
> Cc: <beowulf at beowulf.org>
> Sent: Wednesday, March 22, 2006 6:00 AM
> Subject: Re: [Beowulf] Vector coprocessors AND CILK
>
>
>> At 07:18 PM 3/21/2006, Vincent Diepeveen wrote:
>>
>>>----- Original Message ----- From: "Daniel Pfenniger"
>>><daniel.pfenniger at obs.unige.ch>
>>>To: "Jim Lux" <James.P.Lux at jpl.nasa.gov>
>>>Cc: <beowulf at beowulf.org>
>>>Sent: Thursday, March 16, 2006 6:32 PM
>>>Subject: Re: [Beowulf] Vector coprocessors
>>>
>>>
>>>
>>>If you produce such cards in low quantity you lose roughly 100 dollar to
>>>the pci card to
>>>royalties basically then add chip production price. 2 big chips, well i 
>>>do
>
>>>not know what price
>>>they are. Sound expensive to me. I talked about 1 big chip for some other
>>>card.
>>>
>>>That chip had a price, when mass produced, of 50 dollar a chip.
>>
>> If it's a full custom chip, figure a "first chip" cost of $2M. (layout, a
>> couple spins, etc., but assuming you know basically what the chip is
>> supposed to do and how to do it)
>
> Most projects are indeed having a total cost of around $1M for such chips,
> including programming and design and package design for
> supermarkets/resellers/salesmen/stores.
>
>> I work with a fair number of very low volume but fairly complex chips
>> (intended for space applications, but not in Class S quality grade) and
>> they all seem to run about $5K to $10K each, which must be a sort of 
>> basic
>
> That must be something not sold in a shop then, but something intended to
> cream off the university world. Those universities waste money by the
> shitload,
> so for them $10k is affordable handsdown.
>
> However for products and cards that you want to sell to ordinary people 
> who
> simply want a bit better card, a price of $10k is too much.
>
>> price for them to build small runs where there's not a huge NRE.  Things
>> like MOSIS (http://www.mosis.org/) (or Atmel's equivalent, the name of
>> which I forget) can be less expensive, but probably not for something of
>> this scale.  $5K probably covers the cost of running the wafer, dicing,
>> testing, and putting it in a package, in quantities of <100.
>>
>> So, to get the $50/chip cost, you need an order of 40,000-50,000 pieces.
>
> No no. 1000-5000.
>
> For 20000+ you can get the entire product including packaging down to way
> smaller
> money.
>
> This card has however 2 chips, not 1. That's a huge difference, 
> additionally
>
> it clocks perhaps
> "only" at 250Mhz, but it might be more complex technology than the chip we
> wanted to produce.
>
> We just wanted a single giant chip, also at a comparable, though bit 
> higher
> Mhz range. Mhz range is however
> less important than product price.
>
> Of course for this product if it would be a success, there would be 
> printed
> more of them, which probably
> reduced the price of the manufacturers offering.
>
> Succesful products, such as chesscomputers, when they do well and are
> succesful, you sell around 100000 of them.
>
> That's only for the *succesful* products.
>
> Usually such numbers are only for low priced items, like 99.95 euro
> computers.
>
> For small amount of products it was not possible to get production price
> under $150 however (excluding packaging and
> delivery, just card+chip), which meant simply the entire product was not
> possible to produce as the chip wouldn't carry any RAM which meant a PC
> would outperform it and in general a product sells for 4x more. So that
> would mean a bruto price of 600 dollar.
> Or a netto price for the customer (add roughly 20% VAT for europe at
> products) == 600 euro.
>
> Now of course dollar will go down bigtime, which means effectively the 
> sales
>
> price could perhaps become 499 euro,
> which is a very competative price.
>
> Yet you'll have to have RAM on the card then to compete.
>
> It's easy to sell a lot of products if a product outperforms all software
> that is out there. Asking 1000-1500 euro a product is possible in that 
> case,
>
> if it doesn't, then a price of 500-1000.
>
> So for example if i put my chessprogram in hardware, that's nearly
> impossible, as it's too big (what i write in a few hundreds of lines of C
> code in hardware goes default even very well optimized to like 50000
> transistors, and the code is 2.2MB in total)
> and software is more efficient than hardware, because in software you can
> use all kind of caches which in hardware are either
> too slow to access, or too expensive to make.
>
> Another alternative is a real chesscomputer from wood with real pieces and
> inside it a chip, that's of course interesting.
> Alternative is a pci card with a single chip. That's cheaper. But it's 150
> euro, 150 dollar without salestax.
>
> The problem is, that software at a k8 just completely outperforms such a
> hardware chip.
>
>>>So bare production price of this card i estimate at around 250 dollar. 
>>>You
>
>>>don't want to lose bigtime
>>>on such a card of course.
>>>
>>>That means an importer price of 500 and a consumer price is a minimum of
>>>1000 dollar.
>> When I was working for a developer of retail products, we'd figure retail
>> selling price is 10x material cost.  For products with high integration
>> (i.e. an ASIC) you'd probably go down to 5x.
>>
>>
>>>Now you skip the importer of course with such types of cards.
>>>
>>>According to my economy book then a company can then follow 2 approaches.
>>>You can try to
>>>flood the market and sell 50 million of them, which means that the card
>>>will be priced 1000 dollar.
>>
>> Don't need to sell that many.. a hundred thousand would probably do 
>> <grin>
>
> A good chesscomputer gets 100k handsdown.
>
> True, in past they sold even more, and it's dissappearing slowly, as no 
> one
> wants to invest in such
> products.
>
> Please note that there never was any chipproducer involved. Usually they 
> put
>
> single chips in the
> chesscomputers of like 30Mhz and a SH7000 chip or so.
>
> That's real real cheap. That's why they do not sell anymore of course. PC
> software and hardware has won
> from the own designs.
>
>>
>>>If you're serious and you want to buy 200 of their cards, then you're a
>>>big customer.
>>>Propose them a secret deal in this sense that you don't publicly reveal
>>>the price paid,
>>>and you sign for it that first 3 years you won't resell their cards nor
>>>lend them nor hire them
>>>to other persons. Under that condition you offer $200k for 200 cards.
>
>> But they're not going to even be able to cover a fraction of the
>> development cost for that.  But, perhaps, if they are thinking about
>> "buying market share" with OPM (other people's money). It's been done,
>> more than once.
>
> If you want to earn back your development costs with 1 client,
> then you better stop producing such a product.
>
> Only money wasting governments want to pay that much.
>
> Vincent
>
>>
>>
>> Jim
>>
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
>