[Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Mar 10 09:19:11 PST 2005
- Previous message: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
- Next message: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 10 Mar 2005, Joe Landman wrote: > Inverting the question, if you pay 4000$US per dual CPU compute node > (+/- a bit depending upon technology, config, supplier), what price (if > any) would you be willing to pay for an accelerator that offered you an > order of magnitude more performance per node, on your code, and sat in > the PCI-e/X or HTX slots? And also as important: how hard would you be > willing to work/how much effort committed to program these things? This > makes lots of assumptions, such as such a beast existing, your code > being mapped or mappable to it, and you being interested in this. > > Part of what motivates this question are things like the Cray XD1 FPGA > board, or PathScale's processors (unless I misunderstood their > functions). Other folks have CPUs on a card of various sorts, ranging > from FPGA to DSPs. I am basically wondering aloud what sort of demand > for such technology might exist. I assume the answer starts with "if > the price is right" ... the question is what is that price, what are > the features/functionality, and how hard do people want to work on such > bits. > > Note: As Jeff Layton pointed out many times, the GPUs in a number of > machines are being used by at least one group for CFD, so you can think > of these as a sort of dedicated attached processor. They are not > general purpose, but highly specialized computational pipelines. If you > could have a more general one, what would it look like, what would it > do/emphasize, and how much would it cost? I know there is no one > answer, but I thought it would be fun to extend Omri's question. Problems with coprocessing solutions include: a) Cost -- sometimes they are expensive, although they >>can<< yield commensurate benefits for some code as you point out. b) Availability -- I don't just mean whether or not vendors can get them; I mean COTS vs non-COTS. They are frequently on-of-a-kind beasts with a single manufacturer. c) Usability. They typically require "special tools" to use them at all. Cross-compilers, special libraries, code instrumentation. All of these things require fairly major programming effort to implement in your code to realize the speedup, and tend to decrease the general-purpose portability of the result, tying you even more tightly (after investing all this effort) with the (probably one) manufacturer of the add-on. c) Continued Availability -- They also not infrequently disappear without a trace (as "general purpose" coprocessors, not necessarily as ASICs) within a year or so of being released and marketed. This is because Moore's Law is brutal, and even if a co-processor DOES manage to speed up your actual application (and not just a core loop that comprises 70% of your actual application) by a factor of ten, that's at most four or five years of ML advances. If your code has a base of 30% or so that isn't sped up at all (fairly likely) then your application runs maybe 2-3 times as fast at best and ML eats it in 1-3 years. d) Support. Using the tools and processors effectively requires a fair bit of knowledge, but there is usually a pitifully small set of other implementers of the non-mainstream technology and no good communications channels between them (with some exceptions, of course). You're likely to be mostly on your own while trying to get the tools installed, code written and debugged, and eventually made efficient. If the tool or processor turns out to be "broken" for your purpose, you aren't likely to get much help with this, either, as you're a fringe market (again, with possible exceptions). Each of these alter the naive cost-benefit estimate of "Gee it is 10x faster in more core loop and only makes my system cost 2x as much". Maybe it is 10x faster in the core loop that is 70% of your code, so that now the application runs in 0.37x the original time (good, but now has to be compared to perhaps 0.5x the time available from getting 2x as many ordinary systems). Maybe it takes you four months to get the cross-compiler installed and all your code ported and to then TWEAK the code so it really DOES give you the touted 10x speedup for your core loops, which may have to be reblocked and written using special instructions, which then also necessitates revalidating the results (in case bugs have crept in during the port). Maybe the company that made the core DSP releases a new one in the meantime (they've got ML to contend with as well) and it has a different instruction set, so that a year from now when you want to expand the cluster you either re-instrument all the code again or rely on warehoused chips of the old variety. Maybe in 1 year dual core, 64 bit CPUs are released that effectively double, then double again, what you can get out of COTS systems at near constant cost and your 32 bit CPU plus coprocessor suddenly is slower, less portable, AND more expensive. Or not. Maybe it speeds things up 10x, costs only 2x, will be available for at least 3 more years, has a user base with hundreds of users and a dedicated mailing list, has commercial or open source compiler support that requires only minor tweaks or the use of standard library calls to get most of the benefit, and is built to a standard so that four companies make the actual chips, not just one. I'm just reviewing the questions one would like to ask. Anecdotally I'm reminded of e.g. the 8087, Micro Way's old transputer sets (advertised in PC mag for decades), the i860 (IIRC), the CM-5, and many other systems built over the years that tried to provide e.g. a vector co-processor in parallel with a regular general purpose CPU, sometimes on the same motherboard and bus, sometimes on daughterboards or even on little mini-network connections hung off the bus somehow. None of these really caught on (except for the 8087, and it is an exercise for the studio audience as to why an add-on processor that really should have been a part of the original processor itself, made by the mfr of the actual crippled CPU from the beginning, succeeded), although nearly all of them were used by at least a few intrepid individuals to great benefit. Allowing that Nature is efficient in its process of natural selection, this seems like a genetic/memetic variation that generally lacks the CBA advantages required to make it a real success. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
- Next message: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
