[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [oc] Beyond Transmeta...



In my humble opinion, the 1 bit array processor is best constructed using
an MIMDMP design (multiple instruction, multiple data, multiple phase).
In this thread's original example
 
    c = a + b
    d = c + e
 
The second statement in the pseudocode begins 1 bit time out of phase but
otherwise executes concurrently with the prior instruction. Unrelated
instructions would be constructed to execute concurrently
 
    c = a + b
    d = c + e
    x = y + z
    p = c * x
    c = c + d
 
Becomes (using 4 bit wide numbers)
(use courier typeface for alignment, bit times occure right to left)
 
            aaaa+
            bbbb+
               /
           cccc+
           eeee+
              /
          dddd+
          cccc+
         cccc/  
     ppppppppp
              \
           cccc*
           xxxx*
               \
            yyyy+
            zzzz+
 
Note, independent from the word width (4 in this case) the availibility of
variables for further calculation occures at 3 cycles. The full solution
of the product of  p=c*x in word width + 2 cycles. However, the result
is available for use as soon as the first bit becomes available.
 
In viewing this in terms of a RISC processor where an Add occures in 1
cycle (and let's say the multiply occures in 1 cycle as well), the sample
program takes 5 cycles (one for each statement). In the serial approach
and assuming 32 bit words. And assuming that the bit array clocks
at 32x the RISC, then the 32 bit RISC instructions take the equivilent
of 32x5 clocks (160) whereas the availability of variables for computation
begins at 3 cycles. Or, in excess of 50x the RISC processor.
 
This simple example illustrates to some extent the power attainable
using multiple stream serial processing.
 
Conceptualization of this is one thing. Putting it into practice is another.
To put this into practice the program "compiler" must determine an
optimal configuration for routing the data and then "wire" the processor
to perform the task. PLDs illustrate that a "processor" can be rewired
however, the current design of the popular PLDs are designed for bussing
data for parallel use.e.g. the result of an n-bit adder is available only after
the complete result is available and not as the result propigates across
width of the adder.
 
An entirely new design of PLD would be required. One where the data
flows on route programable serial busses. And the computational
(logic) elements are fast but relatively few in number. For example,
if the problem to solve and the compiler available was suitable to
utilize only 128 adders then you view the problem as one of routing
the variables and partial results.
 
The routing problem is non-trivial. The output of each of the 128 adders
could go to any one or number of adders, including self, as well as
to any or any number of destinations. As the computations proceeds
the routing changes as required. This is somewhat like a massive
patch pannel. The duration of the connection might be perminant or
it could be as fleeting as 1 bit time. The performance of the system
becomes dependant on not only the speed of the few components
(e.g. adder) but as dependant on the time it takes to reroute the
connections.
 
Although it would seem a logical extension of an optical circuit (OPLD)
I see no sense in waiing for these devices. If 50x performance is
attainable in a simple problem (as for the example above), it would
seem advantagious to offer this capability using current technology.
 
A manufacturer, such as  Altera or Xilinx, could first test market
this design in a new component. Then based on the experience
learned make a device that integrates the two technologies into
one die. A second problem to solve is in producing the compiler-
like program that produces the routing and scheduling information.
 
Obtaining a 50x performance gain over your competition should
provide enough incentive to pursue R&D in this area. This would
be something I would be interested in pursuing. Although this
is something the major players (Altera, Xilinx) should invest in
it is a project that a startup could do. The startup could use
existing devices (or even use a software emulator) to emulate
potential designs. After the necessary IP is protected with
patent (application) you pursue raising capital to make the devices
or license the technology to one of the bigger players.
 
Jim Dempsey
 
 
----- Original Message -----
From: "Lars Segerlund" <lars.segerlund@comsys.se>
To: <cores@opencores.org>
Sent: Monday, February 10, 2003 6:50 AM
Subject: Re: [oc] Beyond Transmeta...

>
>   There are quite a lot of bit streaming techiniques in use, look at a
> delta-sigma , or the ancient MILDAP which was a 1 bit array processor
> using SIMD, I think bit streaming operations is most usefull for SIMD, (
> IMHO ), since it's the multiply/divide algorithms that are really hard
> to do.
> --
> To unsubscribe from cores mailing list please visit
http://www.opencores.org/mailinglists.shtml