[Beowulf] Has anyone actually seen/used a cell system?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Vincent Diepeveen diep at xs4all.nlWed Sep 20 15:15:10 PDT 2006
- Previous message: [Beowulf] Has anyone actually seen/used a cell system?
- Next message: [Beowulf] Has anyone actually seen/used a cell system?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
----- Original Message ----- From: "Mark Hahn" <hahn at physics.mcmaster.ca> To: <J.A.Delcorso at larc.nasa.gov> Cc: <beowulf at beowulf.org> Sent: Wednesday, September 20, 2006 6:51 PM Subject: Re: [Beowulf] Has anyone actually seen/used a cell system? >> Can anyone point me to a url, or tell me what their >> experience is with this technology? Is it as fast as >> it's purported to be? > > I haven't come anywhere near a Cell, but then again, I'm not sure I'd want > to. 14.6 Gflops (64b, and assuming the full 8 SPE's) isn't bad, but then > again, a 3 GHz Core2 dual-core is 24 Gflops, and almost certainly a lot > more accessible, shipping now, runs linux, supported by compilers and > goto-blas, etc. Comeon let's do some realistic comparision. Assuming IBM didn't totally mess up, let's do an objective compare for multiplication. Gflops is an overrated definition simply. The thing determining the number of matrix elements you can multiply a second more than anything else, is the slow instruction on most cpu called multiply. It is 4 cycles at P4 or so (SSE2) and 4 cycles at K8. Didn't see a conroe document yet but knowing it also has just a SINGLE execution unit doing multiplies (and probably casting the SSE2 multiplication unit for FPU and also using that one for integers or something) it means probably also a cycle or 4 for it. Just it is possible that when doing a multiplication that it doesn't block all other execution units (which is what K8 seems to be doing). For the NTT i'm doing here (that is a bugfree form of multiplication, the FFT version you never know for sure your result is correct and you have to redo it a second time to be 100% sure) what is interesting is a multiplication of 64 x 64 bits == 128 bits. So that's obviously integer calculation. If we compare core2 there, then core2 is an ideal processor for about everything, yet it has 2 cores @ 3Ghz. 2 cores @ 3Ghz / 4 cycles = 1.5 Ghz multiply cycle Now if we compare the CELL processor. Not sure about its latest plans (i remember vaguely 4Ghz as its target and i would be amazed if IBM actually gets it to 4Ghz). Now it most likely will also manage to get it down to a cycle or 4 for a multiply 64 x 64 bits == 128 bits. Then we're speaking about 8 * 4Ghz / 4 = 8Ghz multiply cycle. A potential 6 times faster simply than core2 for what is the most time consuming part of matrix multiplications, namely the multiplication unit. Now there is something to say for SSE here which with 1 dang can multiply 2 at a time. On other hand we do not know the specs of the CELL there which should be able to do more instructions a cycle than core2 in one document i read (could be totally outdated). If not then core wins back factor 1.5 or so in speed there, still no big deal. CELL just beats it totally there. Now it is of course obvious that the vaste majority of resources that go from clusters to software is used for matrix multiplication type software. So that it might be extremely ugly weak in branch mispredicts, which means it is a selfdestructing chip that cell for my chess software, that's the other part of the story. Say about 70% will be extremely happy with that chip and 30% will just praise core2 into the skies. There is something positive however about core2 which cell cannot say and that is that core2 we can already order in a store. > if you could readily get a 8-16x PCIE card with 2 or more Cell chips and a > bunch of ~50 GB/s local memory, for cheap, it could be quite something. Yeah that's faster than most supercomputers for matrix calculations. And also for a CHEAP price. For all the highend guys who will then say: "oh ahhh au, but how about losing bits". Well, nothing as inaccurate as FFT calculations with floating point roundoffs everywhere. NTT is totally superior there (but factor 2 slower). And if you really have no other argument than that, well just run a SECOND cluster of cells and let those calculate for you be calculated a second time. Which gives a 100% verification that your FFT ran correct too. Of course another disadvantage of CELL will probably be limited RAM. Certain machines (orion!) which are relative cheap and have a couple of hundreds of gigabyte of RAM against an attractive price can really boost certain applications. Yet pissing on CELL isn't a real good idea. If what you need is massive calculation power then 8 cores @ 4Ghz will of course kick silly 2 cores @ 3Ghz, especially knowing that most chip manufacturers don't seem to have an especially fast multiply instruction on their chips. Just measuring gflops is total madness. The N log N in those calculations is the number of multiplies. Make a chip with 2 integer multiplication units that don't block each other and NTT in integers is faster than any SSE implementation of FFT, besides having 0 round off errors. CELL is already quite ideal there in that it has 8 cores. Yet of course it is wishful thinking such chips exist any soon with 2 multiplication units for a very cheap price (no itanium isn't a cheap chip additional it's just 1.6Ghz) which would simply speedup that calculation code factor 2. If i nonstop do integer multiplications in that k8 dual core chip at 2 chips (4 cores in total) then after a number of days the machine is just DEAD sometimes. black screen etcetera. Just the chips failed simply. It only happens if you EXCLUSIVELY do NTT nonstop, so it seems that at least for K8 dual core chips the multiplication unit is extremely weak and belongs to the worst case path. That means probably that adding a second unit will not cost that much more transistors, but will decrease yields, making the chip production a tad more expensive. So please don't piss on a chip that has hopefully 8 such units instead of todays chips 2. It is potentially at least factor 4 faster at the same clock for such DSP type code. Vincent >> Apparently RedHat is developing >> EL 4.3 to run on the system? > > to an OS, it's basically a kinda low-end PPC chip with 8 very weird FP > coprocessors, the latter not relevant to the OS... > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
- Previous message: [Beowulf] Has anyone actually seen/used a cell system?
- Next message: [Beowulf] Has anyone actually seen/used a cell system?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
