[Beowulf] Transputers vs Today doing lossless FFT - was Rackable SGI

Sat Apr 4 17:19:45 PDT 2009

On Apr 5, 2009, at 12:38 AM, Nifty Tom Mitchell wrote:

> On Sat, Apr 04, 2009 at 12:19:02PM -0700, Greg Lindahl wrote:
>> On Sat, Apr 04, 2009 at 04:06:22PM +0200, Eugen Leitl wrote:
>>> On Fri, Apr 03, 2009 at 01:32:13PM -0700, Greg Lindahl wrote:
>>>
>>>>> Will have to do with embedded memory or stacked 3d memory a la
>>>>> http://www.cc.gatech.edu/~loh/Papers/isca2008-3Ddram.pdf
>>>>
>>>> We've been building bigger and bigger SMPs for a long time, making
>>>> changes to improve the memory system as needed. How is multicore  
>>>> any
>>>
>>> Off-die memory bandwidth and latency are limited, so many codes
>>> start running into memory bottlenecks even at moderate number
>>> of cores
>>
>> Exactly like shared-bus multiprocessors.
>>
>> The incremental method of solving this is what Opteron/Nehalem does.
>> The more radical method is what Origin/Altiax did. It all comes  
>> down to
>> how many pins you're willing to commit to memory and how few pins you
>> can squeeze a memory bus down to; ounce you bump up against that  
>> limit,
>> then you need an improvement in packaging, which is no more radical
>> than what Opteron/Nehalem or Origin/Altix did. And, of course, all
>> this was invented before Opteron or Origin.
>
> The memory bus issue is a serious issue but with a kilo-core chip
> one solution may surface with a reflection back to the Inmos  
> Transputer.
>
> If there was core to core pipe like bus perhaps with a switch (not a
> full blown any to any cross bar) that lets one core communicate via a
> pipe equivalent to one or more cores.  The pipe need not be too deep.
> Perhaps 128 cache line equivalents.  I do not think that it will be
> necessary for each core to connect to all the other cores directly as
> long as data could flow across the die.   The interesting challenge  
> for
> coding would be to take advantage of a data flow across the die.
>
> Two of the problems with the Transputer was the lack of virtual memory
> and that the the network was fixed.
>

Hi Transputers a bit before my time,
but wasn't it the case that transputers were like 1Mhz and a network  
of nearly 10 mbit?
The processors speeds that were reported to me was having a really  
bad ugly ipc.
I'm not sure whether i was misinformed there. So let's say ipc  
effectively 0.4

So in short the processor speed in Mhz was nearly the same as the  
datapath,
but the datapath far outperformed the number of bytes per second the  
processor
could produce.

Processor speed being nearly the same now as the datapaths is going  
to be wishful
thinking.

The equivalent of that today would be say for example 1024 cores of  
1.5Ghz and 128 bits
vectors and 2 nearly 3 instructions a cycle, that's:

16*3*1.5 G bytes = 48 * 1.5 G bytes = 48 + 24 = 72 Gbytes /s lanes  
from each processor.
Or that's, including error correction, i'm no expert there, maybe 720  
gbit lanes from each processor
going to the network.

In short there is an ever bigger need for low level programmers and  
math type guys who
can rewrite algorithms to something that's doing a lot of work within  
the L1 cache and limiting
the amount of bandwidth over the network. Not to speak about latency.

I already had proposed a while ago to rewrite some FFT type codes to  
integer versions,
in order to limit the number of bytes stored to the RAM (integer  
transforms can effectively use
more bits per 64 bits integer than the FFT transforms can store in  
doubles). It would require more
integer resources on the cpu's though, especially for multiplication.  
The additional advantage for free
is that in integer transforms you have less errors, which at todays  
huge FFT sizes is not a
small advantage, but already is a real huge advantage. The last  
advantage is that when such software
codes work real well, you can just put the logic in hardware a lot  
easier than floating point,
which gives a huge speedboost.

To give exact concrete the numbers. An example:

We consider now the Yap transform using the prime that fits within a  
64 bits register

P = 2^64 - 2^32 + 1

There is 3 "instructions" needed so to speak.
I cut'n paste C source code here from the FFT code.

Note those interested can get the relevant code (maybe i need to  
clean up some
factorisation code that's inside, not sure) :

FORCEINLINE uint64 plusmod(uint64 x,uint64 y) {  // returns :  x + y  
(mod P)
   uint64 result=x+y,res1=result,res2=result-P,resultored=x| 
y,res3=result+4294967295;
   if( resultored > res1 )
     result = res3; // result + 2^64 - P
   if( res1 >= P )
     result = res2; // result - P
   return result;
}

FORCEINLINE uint64 minusmod(uint64 x,uint64 y) { // returns :  x - y  
(mod P)
   uint64 res1=y-x,result=x-y,res2=P-res1;
   if( y > x )
     result = res2;
   return result;
}

FORCEINLINE uint64 mulmod(uint64 x,uint64 y) { // returns :  x * y  
(mod P)
   uint64 result,reslo,reshi;
   uint64 hi_least,hi_most,hi_shifted,hi_addtolo;

   #if USEINTRINSICSX64
   reslo = _umul128(x,y,&reshi);
   #else
   reshi = mul_limbs(&reslo,x,y);
   #endif
   /* taking Modulo:
    *                           p = 2^64 - 2^32 + 1
    *  ==>  i * (2^64 - 2^32 + 1) = 0
    * <==>               2^64 * i = 2^32 * i - i
    *
    * so if i is top (most significant) 32 bits from 128 bits
    * then :  2^96 * i == 2^64 * i - 2^32 * i
    *
    * because 2^96 == -1 (mod P) ==>
    * reduction of i * 2^96 is similar to
    * substraction of least significant dword
    */

   // first a simplistic implementation to see whether the DFT/NTT  
with this P works anyway
   // splits hi quadword in 2 dwords:

   // reduction of bits 96.. ..127 we wait for until at end of modulo
   // so we start with the 96 bits
   hi_least   = reshi&0x00000000ffffffff;
   hi_shifted = reshi<<32;
   hi_most    = reshi>>32;

   // now: 2^64 * i = 2^32 * i - i
   hi_addtolo = hi_shifted-hi_least;

   // now it is : reslo + hi_addtolo - hi_most;
   result = plusmod(reslo,hi_addtolo);
   result = minusmod(result,hi_most);
   return result;
}

You see a lot of C code for something when in hardware, and it  
shouldn't be too complicated to do,
total crushes existing floating point FFT's.

Such generic FFT's can get used for anything, there is just one  
difference with the floating point FFT's
which is that the floating point FFT's which runs in all that SSE2  
code, can have an error, this FFT can't
AND this FFT can store more bits in each 64 bits unit.

You can use then these instructions to speedup all matrixcalculations  
major league, which is like 50%
of all system time on supercomputers, and it most definitely runs on  
all those GPU's type clusters.

Thanks,
Vincent

>
> -- 
> 	T o m  M i t c h e l l
> 	Found me a new hat, now what?
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf
>