[Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux James.P.Lux at jpl.nasa.govThu Mar 10 11:48:19 PST 2005
- Previous message: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
- Next message: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 09:19 AM 3/10/2005, Robert G. Brown wrote: >On Thu, 10 Mar 2005, Joe Landman wrote: > > > > > Part of what motivates this question are things like the Cray XD1 FPGA > > board, or PathScale's processors (unless I misunderstood their > > functions). Other folks have CPUs on a card of various sorts, ranging > > from FPGA to DSPs. I am basically wondering aloud what sort of demand > > for such technology might exist. I assume the answer starts with "if > > the price is right" ... the question is what is that price, what are > > the features/functionality, and how hard do people want to work on such > > bits. > >Problems with coprocessing solutions include: > > a) Cost -- sometimes they are expensive, although they >>can<< yield >commensurate benefits for some code as you point out. > > b) Availability -- I don't just mean whether or not vendors can get >them; I mean COTS vs non-COTS. They are frequently on-of-a-kind beasts >with a single manufacturer. Definitely an issue. > c) Usability. They typically require "special tools" to use them at >all. Cross-compilers, special libraries, code instrumentation. All of >these things require fairly major programming effort to implement in >your code to realize the speedup, and tend to decrease the >general-purpose portability of the result, tying you even more tightly >(after investing all this effort) with the (probably one) manufacturer >of the add-on. To a certain extent, though, this is being mitigated by things like Signal Processing Workbench or Matlab, which have "plug ins" to convert generic algorithm descriptions (i.e. simulink models, etc.) into runnable code on the coprocessor or FPGA. As far as product lock-in goes, "in theory" one could just recompile for a new target processor, although I don't know if anyone's ever done this. It does greatly reduce the "time and cost to demonstrate capability" > c) Continued Availability -- They also not infrequently disappear >without a trace (as "general purpose" coprocessors, not necessarily as >ASICs) within a year or so of being released and marketed. This is >because Moore's Law is brutal, and even if a co-processor DOES manage to >speed up your actual application (and not just a core loop that >comprises 70% of your actual application) by a factor of ten, that's at >most four or five years of ML advances. If your code has a base of 30% >or so that isn't sped up at all (fairly likely) then your application >runs maybe 2-3 times as fast at best and ML eats it in 1-3 years. There are specialized applications, lending themselves to clusters, for which this might not hold. If we look at Xilinx FPGAs, for instance, while not quite doubling every 18 months, they ARE dramatically increasing in speed and size fairly quickly. And, it's not hugely difficult to take a design that ran at speed X on size Y Xilinx FPGA and port it to speed A on Size B Xilinx FPGA. Consider a classic big crunching ASIC/FPGA application, that of running many correlators in parallel to demodulate very faint signals buried in noise (specifically, raw data coming back from deep space probes), or some applications in radio astronomy. In the latter case, particularly, there's a lot of interest in taking an array of radio telescopes and simultaneously forming many beams, so you can look lots of directions at once, to look for transient events that are "interesting" (like supernovae). The radio astronomy community is relatively poor (Paul Allen's interest notwithstanding), so they've got an incentive to use cheap commodity processing for their needs, but off the shelf PCs might not hack it. They're looking at a lot of architectures that strongly resemble the usual cluster... data from all antennas streams into a raft of processors via ethernet, and each processor forms some subset of beams either in space or frequency. They might have a coprocessor card in the machine that does some of the early really intensive beamforming computation. Take a look at the Allen Telescope Array or at the Square Kilometer Array or at LOFAR. >Anecdotally I'm reminded of e.g. the 8087, Micro Way's old transputer >sets (advertised in PC mag for decades), the i860 (IIRC), the CM-5, and >many other systems built over the years that tried to provide e.g. a >vector co-processor in parallel with a regular general purpose CPU, >sometimes on the same motherboard and bus, sometimes on daughterboards >or even on little mini-network connections hung off the bus somehow. > >None of these really caught on (except for the 8087, and it is an >exercise for the studio audience as to why an add-on processor that >really should have been a part of the original processor itself, made by >the mfr of the actual crippled CPU from the beginning, succeeded), THat's pretty easy. In the good old days, you had an integer CPU and an add on FPU in almost all architectures. The FPU didn't have instruction decoding, sequencing, or anything like that.. more like an extra ALU that tied to the internal bus. Just like having memory management in a separate chip. Intel and Motorola both used this approach. Intel did start to integrate the MMU into the chip with "segment registers" on the 8086, except that it provided zip, zero, none, nada memory protection. This was part of a strategy to keep the codebase compatible with the 8080. After all, who in their right mind would write a program bigger than 64K.. the user application code would never look at the segment registers, which would be managed by a multitasking OS. Think of it as integrated "bank switching", which was quite popular in the 8bit processor world (and itself, an outgrowth of how PDP-11 memory mangement worked) It wasn't until the 80286 that it started to be some more sophistication, and really, it was the 386 that made decent memory management possible. Moto started with a virtual memory scheme and paging, and so became the darling of software folks who had come to expect such things from the PDP-11, DEC-10, DG, and even mainframe world. In any case, NONE of them could have fit the FPU on the die and had decent yields. Besides, you're talking processors that cost $200-400 (in 1980s) and processors with integrated FPUs would have cost upwards of $1K-$1.5K (because of the lower yield). As fab technology advanced, you could either build bigger faster processors (in the separate CPU/FPU model) or you could build integrated processors at the same slow speed. Even today, I'd venture to guess that the vast number of CPU cycles spent on PCs are integer mode computations (bitblts and the like to make windows work). It's not like you need FP to do Word or PowerPoint, or even Excel. It's rendered 3D graphics that really drives FP performance in the consumer market. This drives an interesting battle between the graphics ASIC makers (so that an add on card can do the rendering) and the CPU makers (who want to put it onboard, so that the total system cost is less), and, as well the support provided by MS Windows to use either one effectively. The game market clearly doesn't want to have to try and support ALL the possible graphics cards out there (it was a nightmare trying to write high performance graphics applications back in the late 80's, early 90s. The few skilled folks who were good at it earned their shekels.) >although nearly all of them were used by at least a few intrepid >individuals to great benefit. Allowing that Nature is efficient in its >process of natural selection, this seems like a genetic/memetic >variation that generally lacks the CBA advantages required to make it a >real success. > > rgb James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875
- Previous message: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
- Next message: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
