[Beowulf] Itanium2
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Vincent Diepeveen diep at xs4all.nlThu Sep 15 17:24:26 PDT 2005
- Previous message: [Beowulf] gethostbyname?
- Next message: [Beowulf] Dolphin PCI adapters suddenly invisible
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 04:19 PM 9/12/2005 -0400, Mark Hahn wrote: >> >Google for "superlinear speedup". Most likely, as you split up your >> >fixed problem size among more processors, more and more of it fits >> >into the processor cache, where it runs much faster due to fewer main >> >memory accesses. > >also google for "strong scaling" and contrast to "weak scaling". > >the former assumes a fixed problem size and a range of ncpus; >the latter assumes a fixed problem *per* cpu. I suspect you'll have >a hard time showing superlinear speedup under weak scaling ;) > >> This cache effect is quite profound on Altix since some of these have >> something like 9 MB cache per processor. You can see this result on > >that's the irony: the it2 really works well when data is all in-cache, I feel that the L1 and L2 of the I2 is very strong for what it's doing. The Montecito improvements make sense, except that they are a year late with that montecito cpu, and i'll have to see at which speed it is clocked. Montecito should have of course 8 cores or so to really kick butt for the price it will cost. I2's real problem is that it's IPC is too low for integer work. Effectively there seems limits to really go above IPC > 2.0 For just raw gflops of course competition with special dedicated low clocked and ultra cheap boxes full of cheapo gflop cpu's will be difficult for a mega big cpu like I2, it's too big and eats too much power to ever compete there real bigtime. The struggle should be in such a case for a cpu that's not only fast for floating point but especially outperforms single cpu any other cpu at integer workloads. Seems they correct a lot with montecito if i hear about its design as it can execute just like opteron big programs faster thanks to more instruction cache available to serve quickly the utmost tiny L1. Right now that's if i am correct all in L3 cache of Itanium2, which is a very weak performance of it for big, especially integer, code sizes. Additional another problem is that it regurarly is effectively performing at 2 instructions per cycle this itanium2. Sure it seems to be able to do 4 add's within 1 cycle, but it has real weak behaviour at integer codes there. It's ipc there is effectively not better than from opteron which is an OOO processor. It's easy to look back and say that OOO simply wins because it's cheaper to make such a processor because you can put many cores onto 1 die with in total like 1024KB L2, whereas itanium2 is completely build around its huge caches. That means that if you put many cores onto 1 die, that itanium is real weak, because it NEEDS that huge L3 and it NEEDS that fast L1. Effectively that means it's pretty hard in current technology to make high clocked and quad core or even octo-core. This where januari 2006 we will see already quad core opterons, which will be very nice for those who execute integer work loads and it's genius of course for workloads that do all kind of datastructure and in between only now and then some floating point. Itanium2 simply loses it there. Montecito will be a major improvement in that respect, but simply too late. Building a cluster from dual opteron quad cores will become real interesting, for an all round performer that performs very well and is real cheap for each node. The only luck that itanium2 worshippers have is that there is not a single manufacturer yet which produced a big SSI partition based upon its own interconnects and a fast routing network from existing manufacturers. If that would exist, i doubt how vital IPF would still be. >From my viewpoint IPF looked like a genius solution a few years ago, but it has a major problem when what you want to do is put a big bunch of cores at 1 die. Additional for programmers who want to optimize their code real well, IA64 has a major disadvantage that it is nearly impossible to write inline assembly for IA64. Just optimizing a small part of your code, because compiler isn't smarter, is real ugly in IA64, as you have to take too much into account when writing assembly for it. So when wanting the utmost performance out of it, x64 has won the battle in the longterm from IA64 in that respect. Vincent >or can somehow be prefetch+streamed so that cache misses don't happen. >once you start missing, performance becomes unexceptional - you can >easily see this by looking at SpecFP results. there, the it2's excellent >scores is mainly due to extremely high results in the 2-3 very smallest >benchmark components. > >around here, it's mainly serial monte-carlo jobs that are so small that >they're always in-cache. so the "high-end" it2 (and expensive) is best >suited for the lowest-end jobs... > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > >
- Previous message: [Beowulf] gethostbyname?
- Next message: [Beowulf] Dolphin PCI adapters suddenly invisible
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
