Creating circuits has always been a rather different process from creating algorithms for Von Neumann style sequential processors. Traditionally circuits have been designed using paper and pencil and by drawing large circuit diagrams. Computer programming on the other hand was done with an assembler or a high-level computer language. Circuits were designed to be completely static in hardware, with all circuit elements operating simultaneously. Programs were transient, with only one instruction operating at a time. Clearly the methods used in creating software were very different from those used to create hardware.
Programmable hardware leads us to re-examine this dichotomy. We would like to use the high-level programming languages which make writing software so easy, yet they are strongly oriented towards single-execution, state-driven environments. High-level hardware development languages such as VHDL are available, but they are not well suited to generic description of algorithms.
A great deal of work has been done in recent years to adapt conventional programming languages to output hardware descriptions. These are successful in as much as they work, yet they leave a great deal to be desired in terms of the efficiency of the hardware they create. Early attempts [PageLuk91] converted a program into a large state machine. Effectively this transmuted the states of the program counter into a series of latches, creating hardware which had no higher degree of parallelism than the sequential code it was based on. More recent work has used dataflow analysis to create smaller, more efficient state machines which make better use of the inherently parallel nature of circuitry. For more detail see section 2.4.
One problem with any system based on an existing programming language is that information which is fundamentally important to efficient hardware implementation is often simply not available in the source language. Conventional programming languages assume that most arithmetic is done on ``a word'' - usually 32 or 64 bits. This is reasonable in a conventional architecture since the ALU will have this many bits available in any case and there is little to be gained from performing an operation on fewer bits. In hardware however all possible operations in the program may be expressed in silicon simultaneously, so for instance using a 64-bit register to store a single-bit flag is going to result in horribly inefficient use of hardware. Some projects modify conventional languages to attempt to patch these holes (Occam [PageLuk91] [CockshottShaw92] [ShawMilne92], Ruby [LukLok93] [Luk95] [GuoLuk95], and Handel [SpiveyPage93]), but there is still a long way to go before code generated by a programming language will use logic array space efficiently.
One aspect of computing which has been brought out by the advent of reconfigurable logic is that current programming techniques are very strongly influenced by the underlying architectures they're targeted at. Just as computer languages such as BASIC which are intended to be interpreted are very different from Algol-style languages which are intended for compilation, there is also a great difference between languages intended for sequential execution on conventional processors and languages capable of efficiently expressing algorithms for massively parallel fine-grained computation on logic arrays. Finding common ground between the two approaches is a daunting task.
Even more daunting is the prospect of finding good ways of expressing features such as run-time reconfiguration - in a conventional language this could be akin to the old-time evil of self-modifying code! Further reflection reveals that the similarities between run-time reconfiguration and self-modifying code are not so great. Self-modifying code acquired a bad reputation back in the days when it transformed machine code which was already difficult to understand into something completely incomprehensible. Generalised high level language constructs to support self-modifying code in a comprehensible way have never become widely available. The final straw for self-modifying code is that modern processors support self-modifying code very poorly due to their heavy use of caches.
To some extent modern compiler techniques redress the differences between conventional programming techniques and those required for logic arrays. Using data flow analysis a sequentially expressed algorithm can be converted into a graph of operations of which some may be executed in parallel. In future it seems likely that programming languages which express algorithms in a more universal form will be developed. It is unclear at this stage what these languages might look like.
In the interim a number of approaches have been taken to develop programming and circuit synthesis systems for current reconfigurable logic. The simplest approach is that used by SPLASH - partitioning is by hand with sequentially executed code being written in C and circuits for the logic array being written in VHDL [GokhaleHolmes91] [Lopresti91]. Perle takes a slightly different approach, converting cleverly formulated circuit descriptions written in C++ to netlists for synthesis [VuilleminBertin96]. Others have created even more sophisticated systems, translating programs partly into sequential code and partly circuitry, with automatic partitioning between the two [Athanas93] [ThomaeVandenbout92] [LukLok93] [WoForward94] [GokhaleMarks95]. See section 2.4.3 for more details.
An important stumbling block in creating code for logic arrays is that logic circuits are inherently space consuming and an entire program may not fit on a logic array. Usually only the most time-critical part of the program is converted into logic while the more algorithmically complex parts of the program are run on a conventional processor which operates in tandem with the logic array. This leads to the partitioning problem - which parts of a program to put in logic and which parts to run as sequential code?
As automatic programming tools for logic arrays are developed more and more work is being done on automatically partitioning programs into sections which can be run in logic and sections which are better run sequentially (see section 2.4.3). This problem alone is very complex and it is easy to see that the optimum result may be impossible to determine unless the data set and run profile is known, on the same basis that the ``halting problem'' prevents us from knowing information about a program's execution until it has actually executed.
The partitioning problem becomes even more complex when we cease to be dealing with program-and-forget logic arrays and begin to dynamically reconfigure them at runtime. In this case the partitioning must be performed over domains of both space and time, and details of the reprogrammability and context switching characteristics of the device must be taken into consideration.
The advent of dynamic reconfigurability offers the scope to choose at any given time throughout execution what modules are best to load. A number of researchers have proposed ``virtual logic'' which operates in much the same way as virtual memory - sections of logic are loaded into the logic array when required (see section 2.5) [LingAmano93] [Casselman93] [AlbaharnaCheung94] [AlbaharnaCheung94b] [Gokhale94] [SinghBellec94] [VuilleminBertin96]. This concept while initially attractive leads to some significant problems - circuits normally communicate by means of logic lines and it is not practical to simply ``chop off'' the connections to a section of circuit. Nor do circuits inherently ``know'' when they're blocked waiting on data and that they can be switched out of logic. A higher-level mechanism must be used to communicate between the modules in this case, and the flow of data between the modules must be carefully controlled to prevent ``thrashing'' of logic in and out of the configuration store. Current systems rely heavily on the application programmer to mediate reconfiguration and communications.
This kind of programmer-optimised communication mechanism is a long way from the ideal solution of platform-independent, automatic compilation from high-level source code. What's more the exact form of the communication is dependent in hardware implementation details such as the speed of reconfiguration and the size of the logic array. Ideally code would be compiled from a generic algorithm to operate optimally on whatever platform was available.
The purpose of this chapter is to determine what programming styles are most useful in a computer system based entirely on reconfigurable logic. This results of this analysis will be used in subsequent chapters as a guide for the development of architectural features for the Switchblade architecture. In the process a number of features useful as performance enhancements are also investigated.
Figure4.1
shows a plot of various architectures against axes of
``communication switching time'' and ``parallelism''. ``Communication
switching time'' here refers to the amount of time taken for one
section of code to communicate a result to another. In the case of
conventional sequential processors this is the time taken for a
subroutine call and in the case of logic array processors it is the
time taken to get a result from one section of the circuit to another.
The graph is not to scale - it is only intended to convey an
impression of the performance advantages of various architectures.
The overall performance of a processor is a combination of its
ability to perform computation at high speed (here quantified only as
``parallelism'') and its ability to communicate those results to the
places where they are needed. The slowest combination is on the top
left of the graph and the fastest is on the lower right. Conventional
processors have a fairly low level of parallelism - boosted mainly by
the fact that a single instruction can cause many logic gates to be
used simultaneously - and a moderately high function call overhead. A
logic array system where the entire program can fit on the array at
once has the highest level of parallelism and very little
communication overhead. It is also very likely to be prohibitively
expensive.
The more economical way to use logic arrays is some form of
run-time reconfiguration - array space is saved by configuring in
only the sections of the circuit which are needed at any given time.
Unfortunately reconfiguration itself is inherently expensive,
requiring a large amount of data to be transferred in swapping a
single configuration from memory. A solution based around multiple
configuration stores was developed
as part of the early work for this thesis and was published first in
[Saleeba93].
Multiple configuration stores exploit the
fact that most applications use only a relatively small number of
configurations, so offering multiple configurations stored within the
logic array itself drastically reduces the need to load
configurations from off chip. This means that most reconfigurations
can occur instantly by switching configurations stores, with only a
small percentage requiring transfers from off-chip.
Figure4.2
shows a program as a hierarchy of communicating parts. In
hardware these would be interconnected circuits, and in software they
would be groups of subroutines communicating by means of subroutine
calls. At the top of the hierarchy are the fundamental routines which
call the lower level routines. Most of the processing is in the lower
level routines and more of the algorithmic control is in the higher
routines. In general the partitioning process will move the more
heavily used leaf routines into hardware and leave the less frequently
used upper routines in software. The figure shows an actual
partitioning of a data analysis program, without yet
assigning modules to a final medium of hardware or software.
Just as having a massively parallel fine-grained computing
architecture affects the way we program it, so too do other
architectural features the array may have. In this thesis five
architectural features in particular will have an influence on the
algorithms we can devise -
Run-time reconfiguration allows us to change the form of the
circuit while the system is operating. This can be used to optimise
the system to operate faster according to runtime-determined
factors, or to swap sections of circuit in and out of the array as
they are needed. See section 2.5 for a survey of research into
run-time reconfiguration and section 3.1.1 for a more detailed
description.
Partial reconfiguration allows us to reconfigure just part of the
array while leaving the rest intact and still operating. This can be
useful for swapping between configurations - usually some of the
circuit remains constant while other parts may change. See section
3.1.2 for a more detailed description.
Multiple configuration stores permit very fast switches between
configurations. This has a strong effect on the viability of run-time
reconfiguration - if multiple configuration stores are available much
greater use can be made of rapid changes between configurations. See
section 3.1.3 for a more detailed description.
The use of multiple configuration stores in logic arrays is
relatively new - first described by Saleeba in
[Saleeba93].
Hydra was a system
designed with logic emulation in mind
[Maliniak94]. It provided
sixteen contexts as did DPGA, a logic array intended for computing
applications
[BolotskiDehon94]. VEGA
took a rather different approach with a very large number of contexts
- 1024
[BolotskiDehon94]. DPGA
results demonstrated that a 40%-50% reduction in array space could be
achieved using an eight context version of the architecture, given
hand-coded examples capable of exploiting the feature. The CHiP
architecture is an earlier example of a system using multiple
configuration stores, applied not to a logic array but to a
programmable interconnect system
[Snyder82].
Self reconfiguration is another architectural feature. It implies
that the device has control over its own reconfiguration. This opens
new vistas of possibility in algorithm-generating algorithms,
variable bit width arithmetic and data folding using self-modifying
circuitry. First described by Saleeba in
[Saleeba93], this feature is vital
for a self-contained reconfigurable processor such as that described
in this thesis. See section 3.1.4 for a lengthier description of
self reconfiguration.
The architecture presented in this thesis is unusual in that it
lacks a conventional processor acting as a host and controlling the
logic array. This makes some aspects easier - the partitioning
problem is less immediate for instance since everything must run on
the logic array in some way. On the other hand this approach leads to
new questions on how to control an autonomous logic array.
A great deal of research remains to be done in devising
programming techniques to exploit these architectural features. In
this section a small number of programming techniques are outlined
which exploit these architectural features.
These techniques are not necessarily mutually exclusive - several
may be applicable to a given algorithm but offer different speed/space
tradeoff advantages. Section 4.3 demonstrates this by applying
most of these techniques to an example prime number generation
algorithm.
It is not the purpose of this thesis to investigate programming
techniques in detail - only to investigate them in as much as they
impact the architecture.
One example of a programming style which is attractive in
hardware but not in software is an enumerative approach. Rather than
taking a mathematical approach to a problem - developing formulas and
having a general purpose processor calculate the results - the nature
of reconfigurable hardware favours an approach where a small amount
of clocked logic comes to the same result using an iterative
approach. Since the logic may be clocked at a very high rate it will
often exceed the speed of the formula-based approach and require less
logic resources. Logic arrays are very good at implementing counters
and adders in small amounts of space but given how expensive in die
area transcendental functions can be they are inherently less
attractive in this medium.
An example of this type of algorithm is the classic ``sieve'' prime
number generator. Compared to other methods of finding prime numbers
(discussed in section 4.3.1) it is very well suited to efficient hardware
implementation.
This programming approach is common to all hardware
implementations - it is not specifically related to configurable
logic. The following programming approaches on the other hand are all
specifically related to configurable hardware in some way.
Logic arrays tend to be capable of a high degree of parallelism
but also suffer some drawbacks from being better adapted to parallel
than sequential operations. Logic arrays are limited in size, and one
way this limitation shows is in the amount of code which can be on
the array at any one time. Overlays are a simplistic answer to this
problem. They allow the programmer to nominate to replace one section
of code with another at some point during execution. This is a
primitive form of run-time reconfiguration. Depending on the
architectural features of the target logic array it may or may not be
possible to load an overlay to only a limited section of the logic
array - it may be necessary to reconfigure the entire array at once.
Similarly the reconfiguration process may vary greatly in speed
depending on the reconfiguration interface used and whether multiple
configuration stores are available.
An overlay approach is used with most of the current generation
of run-time reconfigurable systems. The programmer must explicitly
nominate where one region of execution ends and the next overlay
begins. Methods such as automatic partitioning are
attempts to automate this process and make it less intrusive on the
programmer.
This approach is by no means limited to logic arrays. Overlays
were used extensively in the early days of microcomputers when memory
was at a premium. Large programs were loaded one section at a time
and when each section was completed the application would ``chain''
along to the next overlay, loading it from secondary storage before
execution.
This approach is an optimisation of the overlay approach which
uses the benefits of multiple configuration stores and/or partial
reconfiguration. While one configuration is operating, an unused part
of the configuration store is being reconfigured with a new
configuration preparatory to it being required later. Given the slow
speed of reconfiguration, overlapping reconfiguration with computation
can result in good speedups for algorithms which reconfigure often.
Conventional processors have a single internal data word width
and all operations are capable of operating on this width. Modern
RISC processors aggressively optimise the ALU for this word width to
obtain the results of most operations in a minimum time quantum. As a
result there is no motivation for the programmer to produce programs
which use data items containing less than the full word width. Logic
arrays on the other hand can be considered massively parallel
single-bit computers. Every bit counts, so using a C ``int'' to
represent a flag is likely to waste a large number of bits and a
large amount of array area.
Current programming languages are not well designed to provide
this kind of control over final word length. What's more word length
may vary along a computational path as accruing results change in
size. Automatic methods of determining word length are limited by the
inability to determine in advance what the size of a result will be,
in much the same way that the ``halting problem'' prevents
foreknowledge of whether a program will terminate.
In the short term it is possible to add some simple extensions to
conventional languages to specify word widths. This allows data paths
and arithmetic to operate on a programmer-specified word length,
saving a great deal of logic array area and allowing more
functionality to be crammed into the logic array. Some work on
variable word length arithmetic on reconfigurable processors was done
on Perle
[VuilleminBertin96].
Another approach is to allow variable word lengths. This
relies on run-time reconfigurability to change the width of a data
path during the execution of a program. For some algorithms a
compiler can determine that a data path width will be different when
a module is used at various different points. Separate configurations
can be generated for each of these data path widths and each regarded
as a separate entity - much as some programming languages allow
function overloading by data type.
This approach is one example of the use of special cases over the
time domain, which is discussed in more detail in section 4.2.3.7.1.
Programmers are often troubled by the same problem as compilers -
sometimes it's hard to know in advance what word length will be
required. Run-time reconfiguration allows us another option - to
reconfigure data path widths if necessary. Conceptually a circuit
could route overflow bits to an ``interrupt'' equivalent which would
cause the system to reconfigure the specific data path with more
bits. In reality this would be very expensive, requiring that the
very expensive circuit synthesis processes be repeated for the new
word width. It may however prevent catastrophic program failure and
data loss so it would almost certainly be preferable to the
alternative.
Traditional programming wisdom abhors the creation of
self-modifying code. Conventional programming tools also usually
preclude it. In reconfigurable logic arrays we stand to gain a great
deal from this technique though, particularly through the use of a
constructive approach to programming.
As was mentioned earlier, the current generation of conventional
processors are poorly suited to run-time reconfiguration due to their
use of caches. The constructive approach and many of the other
approaches described here require run-time reconfiguration so it is
clear that they are poorly supported by conventional processors, and
even if they were, the performance advantages which they offer by
expoiting the fine-grained parallelism of logic arrays would not be
available on a conventional processor anyway. These factors go a long
way toward explaining why approaches such as the constructive approach
are not popular in programming already.
Often a program starts with a very limited data set. As it runs
it acquires more data from an I/O source and operates on it, adding
it to the data it already has, often with a complexity increasing
as the internal data set increases in size. This added time cost can
be reduced or avoided by exploiting the logic array's inherent
parallelism - processing of a new data item causes a new circuit to
be generated which operates in parallel with all others when the next
data item is read. The array space used increases until some physical
bound is reached or the algorithm terminates.
The constructive approach is extremely powerful in that it
reduces many algorithms to The ``constructive approach'' has a major limitation in its
assumption that the data set will fit entirely within the logic
array. The staged approach is similar in concept except that it
assumes that the data set will not fit within the logic array. It
relies on external memory to store intermediate results and run-time
created overlays which are loaded sequentially. Each overlay contains
a subset of the constructed circuitry. The system passes the input
through the first stage overlay and results are stored in memory.
Then the array is reconfigured and the temporary results are passed
back through the second stage overlay. Once all the overlays have been
used, a new set of results have been found. These new results may
cause the creation of a new overlay which will be used in subsequent
passes. In many cases the staged approach provides a large constant
speedup over a conventional approach.
A half-way step between the constructive and staged approaches is
possible if multiple configuration stores are available. Up to the
limit of the number of configuration stores available, constructed
overlays can be kept in alternate stores on the logic array. When a
new data item is read the multiple configurations can quickly be
flipped through, performing the processing for several overlays before
requiring a memory write. This greatly reduces the overhead caused
by memory accesses.
Data folding is where some variables which are constant over the
duration of a program run can be treated instead as constants,
programming them directly into the configuration for the one run
[Foulk93]. This can be considered a
limited (but important) case of the constructive approach. The
parameters are part of the input set but the constructive approach is
applied only for those parameters where it is considered worthwhile.
This approach can significantly increase the density of
configurations if the folded variables are heavily used.
Another possibility is to take data folding, but not limit it to
the duration of a program. Instead, algorithms can run-time
reconfigure themselves to take different fundamental values as the
algorithm progresses. Changing the value of one of these programmed-in
variables would be relatively expensive but will still be viable in
cases where the changes of value are infrequent.
Data folding at run-time is exactly equivalent to a subset of the
now spurned technique of self-modifying code. The technique fell out
of favour due to the difficulty of understanding and maintaining
self-modifying programs (which were mostly written in assembly language).
These days programming and compiler techniques have advanced to the
point where it is quite easy to conceive of compilers automatically
producing self-modifying code as an optimisation without the
programmer needing to be concerned about it. Given the lack of
advantage to be obtained from self-modifying code on conventional
processors and the caching problems that it causes it is not often
used. In logic arrays however there is potentially great advantage to
be obtained from this technique so a return to the old technique of
self-modification may be a profitable area for future research.
Conventional sequential processors can be considered opposite in
approach to logic array processors - they achieve speed by performing
complex calculations in sequence where logic arrays achieve speed by
performing simple operations in parallel. Sequential processors
suffer the disadvantage that highly regular operations must still be
executed sequentially or with a low degree of parallelism. Logic
arrays are excellent at fine-grained parallelism but poor at
complex, irregular algorithms.
Each is an extreme response to the problem of executing arbitrary
algorithms. Each works for all problems, but tends to offer poor
performance for problems it's poorly suited to. Clearly there must be
a middle ground. MATRIX
[MirskyDehon96]
is the result of one attempt to
find this middle ground, but unfortunately it failed to perform to
expectations.
Reconfigurable logic arrays give us the option of configuring a
sequential processor on to the array if we desire to. Many
researchers have implemented conventional processors on logic arrays
[GrunbacherJaud92]
[Davidson93]
[SurmannUngering92]
[PetersenHutchings95]
[Meier95]
[IseliSanchez93]
[Brunvand92].
Some researchers have taken a further step by
parameterising these processors so they may be adapted to the
specific application
[Meier95]
[IseliSanchez93]
[Brunvand92].
Others have investigated the
possibilities of creating customised instruction set processors in
reconfigurable logic
[AthanasSilverman93]
[LewisVanierssel93]
[Witting95a]
[WirthlinGilson94]
[WirthlinHutchings95]
[WirthlinHutchings95b].
One answer to the problem of finding a middle ground between logic
and sequential computation is to implement sequential processing units
on the logic array. What's more, these sequential processing units can
be customised to the level of regularity expressed by the algorithm. A
relatively simple algorithm might get by with a sequential controller
implemented by a finite state machine in a few latches. A more complex
and irregular algorithm might be best implemented by a processor
configuration running specialised instructions out of memory.
The problem can also be attacked on a more local level. Some
parts of an algorithm might exhibit regularity but others not. In
this case a sequential processor could be configured in only when
needed. Another case might contain highly regular functional units
being controlled by a complex algorithm. A small sequential processor
controlling an area of logic would be better suited to this. In yet
another case complex independent operations with a high level of
local communication might call for an array of sequential processors
to be configured into the array.
Some method of automatically generating sequential processors
well-suited to specific tasks within an algorithm is needed. Neither
completely-sequential nor completely-logic solutions are ideal in all
cases - an approach combining the two is more likely to provide an
efficient solution. The balance of sequential processor size versus
speed will depend on a number of factors including how overall
performance is affected by its speed, how regular the process is that
it controls and how its space use affects the ability of the rest of
the system to perform.
Ideally a compiler would locate sections of an algorithm
involving sequence and determine the best style of sequential
processor for the application. In most cases this would be as simple
as a few latches. In more involved cases quite a complex processor
might result. These ideas are worthy of a research project and thesis
in their own right.
The ultimate extension of the approaches covered in this section
is to apply multiple approaches simultaneously. In some cases the
approaches are compatible and in other cases they are mutually
exclusive. When mutually exclusive, one approach may be favoured
over another dependent on conditions encountered at runtime. It may
be desirable to offer an algorithm in multiple forms which are
selected from depending on the circumstances. The prime number generator
example described in section 4.3 demonstrates this using two versions
of the same algorithm at different times - an initial constructive
phase followed by a staged approach when array space becomes
scarce.
A number of models for programming reconfigurable systems have
been developed. For the most part these models are based on
expediency and the need to interact with the tools available.
Bertin's concept of ``Programmable Active Memories'' is elegant but the
resultant ``Perle'' system was programmed in a fairly conventional
manner
[VuilleminBertin96].
Some systems attempt to provide a
programming model which conceals the distinction between hardware and
software by having the compiler automatically partition between the
two
[Athanas93]
[ThomaeVandenbout92]
[LukLok93]
[WoForward94]
[GokhaleMarks95].
These are not particularly efficient at creating circuits
from conventional code yet, but are improving. Some researchers
conceive of ``virtual logic'' arrays where sections of code are
switched in and out of the logic array at runtime as required
[Gokhale94].
The methods of communication between these virtual
circuits are problematic - conventional circuits are not capable of
being switched in and out transparently in this way.
The Switchblade architecture described in this thesis
pursues a virtual logic model. It attempts to improve the
feasibility of this model by providing specific hardware support for
logic virtualisation and addresses the problem of communication
between virtual logic modules.
Section 4.2 has described a number of approaches to algorithm design,
each of which provides definite advantages under certain
circumstances. Some current systems provide a division between
algorithms to be implemented in hardware and those to be executed in
software. Using the approaches described above it can be seen that in
a purely logic array based system the distinction between sequential
code and logic can be quite blurred, and in fact there are a number
of other ways of expressing the same algorithms. This possibility is
not merely a theoretical curiosity - significant performance
improvements can be found through appropriate choice of approach.
As with the current distinction between sequential code and
logic, algorithms must currently be specifically developed by hand to
use whichever approach is best. The alternative is to develop
a compiler which can automatically determine which of the approaches is
suitable and automatically transmute the algorithm into one based on
a different approach. This is not a trivial problem, and may well be
an important research area of the future.
The current popular methods of expressing algorithms all tend to
be biased toward the traditional sequential mode of computation.
This is unsurprising since they are designed by current programmers -
who are accustomed to programming using sequential languages. It is
difficult to translate these algorithms into circuit definitions
efficiently. It is likewise difficult to translate circuit
definitions into sequential algorithms efficiently. It is even more
difficult to convert algorithms into functionally equivalent but
faster algorithms based on some assessment of what approach may be
more suitable for given situations.
From this it seems clear that our current algorithms are too
involved with their execution environment. They are based on
assumptions about that environment which may not be true in a
changeable reconfigurable logic array system. Some way of neatly
describing the essence of an algorithm without these assumptions is
needed. A meta-algorithm, if you will.
Unfortunately the solution to this problem is not clear. It's also
beyond the scope of this thesis. However this must be considered a
prime research area in the next few years if reconfigurable computing
is to advance.
This project was designed to demonstrate the programming
approaches discussed earlier in this chapter, and to test the
supporting features offered by the DRAMA
architecture. The DRAMA
architecture was designed in concert with the programming approaches
to provide a test bed for their evaluation. As well as creating a
sizeable and potentially useful application for the DRAMA
architecture as proof of concept, the primes project explored the
performance advantages of DRAMA's architectural features
[SaleebaPose96]
[SaleebaPose96a].
Prime number generation was chosen as it did not have a bias
towards floating point or other computation better suited to
special-purpose coprocessors. Similar algorithms could find useful
application in cryptography, where prime numbers are heavily used in
creating code keys
[RivestShamir79].
The algorithm presented here could also be adapted to factorisation,
and hence code breaking.
The generation of prime numbers tends to reduce to a related
problem - determining if a given number has any factors. A great deal
of work has been done on factorisation algorithms over the years, and
Knuth gives a good survey of these
[Knuth81].
An early method was devised by Fermat
[Dickson52],
and is the basis for many of the other methods.
Probabilistic techniques for primality testing are also possible. A
Monte Carlo-type method for factorisation was described by Pollard
[Pollard75],
and a probabilistic primality test
[SolovayStrassen77]
[SolovayStrassen78],
was found to be fast compared to
other methods. While it did not produce deterministic results, the
chance of error could be shown to be vanishingly small.
A continued fractions approach is well suited to finding large factors
of large numbers
[LehmerPowers31]
[BrilliartMorrison75].
This method produces factors very quickly, with
Sieve techniques have the advantage of not requiring complex maths,
which makes them uniquely suited to hardware implementation.
A survey of sieve methods is given in
[Wunderlich67].
A sieve machine is described in
[Lehmer33].
A 1965 hardware delay-line implementation was capable of
processing one million numbers per second. A similar technique is also
adopted for this paper as sieves provide the greatest effect for the
least hardware.
A simple sieve requires one bit of storage for every number tested.
The algorithm used here modifies the algorithm so that the sieve
values are generated on the fly rather than stored.
The algorithm used in this device uses a type of parallel-sieve
technique and relies on the fact that nearly all potential primes
will be eliminated very quickly by an initial, limited sieve. In fact,
on average the algorithm eliminates close to two numbers as primes
per clock cycle. The trickle of numbers which are not immediately
eliminated must be exhaustively checked for primality at considerably
greater expense.
One advantage of the algorithm is that it is able to exploit the
primes already produced to check for factors in later numbers. The
availability of these primes drastically reduces the search space when
looking for factors.
The algorithm is split into two main stages. These stages do not go
through their sequence only once - the algorithm is designed to cycle
between the stages indefinitely as more and more primes are generated.
In fact the algorithm has no termination condition - results will be
generated until the storage space for primes is exhausted. Even then,
the algorithm can conceivably continue producing results to a stream
device up to the square of the last prime stored.
Stated simply, the algorithm generates primes by checking each
successive integer for divisibility by each prime number less than or
equal to its square root. Any integer with no prime factors is added
to the list of primes.
The reason for splitting the algorithm into two stages is that
the issues involved in creating the first-guess ``candidate'' primes
are different from those involved with checking all the possible
factors of a candidate. Since the vast majority of numbers can be
eliminated in the first phase it makes sense to optimise this case as
heavily as possible.
The first stage generates an initial set of candidate primes by
testing each successive number and eliminating the primes from two
upwards as factors. The upper limit on number of primes that can be
checked is determined by the size of the available logic array. This
stage produces a list of candidate primes in a buffer external to the
logic array. These candidate primes have been checked for divisibility
by low-valued primes, but may not have had all possible prime factors
checked.
The second stage uses specially created logic array configurations to
quickly check batches of higher-valued prime numbers against the lists
of candidate primes. The results are stored in a second external
buffer. When all the candidates have been checked against the primes
which a configuration has been designed to check for, a new
configuration is created to check for the next batch of possible
factors. The candidates are passed back through the logic array to the
first buffer, eliminating more potential primes in the process. This
stage can repeat until all the potential factors have been exhausted,
and the contents of the final buffer are declared prime.
This algorithm is an example of the ``staged approach'' described
earlier in this chapter. The sieve algorithm is also inherently an
``enumerative approach'' algorithm.
Candidate generation is the first stage of the algorithm. The
candidate generator is capable of generating a candidate prime every
few clock cycles. This candidate has already been checked for
divisibility by a number of low-valued primes.
In essence the procedure is to count upwards through the integers,
testing a new integer for primality on each clock cycle. As an
optimisation only the odd integers are checked. These checks are
carried out with a high degree of parallelism, with a number of factor
checks occurring every cycle.
To qualify as a candidate prime, an integer must not have a prime
factor in a limited range from 2 upwards. The upper limit is
determined by the size of the logic array since each additional
factor checked incurs an expense in logic elements. Given infinite
circuit resources the first stage could generate all the primes
itself, obviating the need for the latter stage. Unfortunately,
realistic limits on the number of available array elements
necessitate the use of a second stage.
Every single clock cycle another number is checked for primality. All
the factors which have factor checkers in hardware are checked in
parallel, simultaneously. If on any given clock cycle all the
factor-checking units fail to find a factor, the number is saved on
the list of candidate primes.
Ideally as many factors as possible should be checked before a number
is declared a candidate and passed to the next stage. Section 3 gives
simulation results describing the relationship between the number of
factors checked by this first stage and the overall performance of the system.
The factor-checking circuits are very simple. Essentially each
one is a divide-by-n counter, where n is the prime
factor to be tested. Each counter starts at zero and increments up to
n-1, after which it returns to zero. Since the period of the
counter is the same as the value of the factor, each zero output of
the counter corresponds to another number divisible by that factor.
Figure4.3
shows a partial circuit of the first stage.
Interestingly, the fact that each clock indicates a step by two
rather than one does not affect the period of the counters,
only the point at which they are initialised. In other words if the
counter for prime factor seven is reset to zero on the clock cycle
which corresponds to the value seven, then the next time it will
indicate a factor is seven clocks later, on the cycle corresponding
to the value twenty-one. The factor for fourteen never occurs as it
is even and there are no clock cycles for even numbers.
By ORing the return-to-zero signals on each counter it is possible to
determine if any of the factor checkers has found a factor each cycle.
If none of the factor checkers has found a factor the value is a
candidate prime and can be stored.
Any prime generated by the circuit is either a candidate prime or
a true prime. True primes will be generated from the time
that the circuit starts running until the tested value reaches the
square of ``the first prime which cannot be implemented as a factor
checker''. This is the first value which has a factor which is not
testable by the candidate generator due to the limited logic
resources.
True primes are stored directly to the result buffer in main memory
using the external memory interface.
When the circuit is first started it knows only about the primes
one and two. As it discovers more primes additional factor checkers
are added for these new values. This is possible due to the
self-reconfiguring nature of DRAMA. An interface is provided
which permits the device to change its own configuration as it
operates. In this case the feature is used to create a new factor
checker containing a divide-by-n counter where n is the
value just checked.
The self-generation phase is an example of the ``constructive
approach'' described earlier in this chapter.
The algorithm for creating the factor checker is simple enough but
expensive of logic array space. For this reason it is stored in a
separate configuration, as will be described shortly. The first thing
it does is find an appropriate area of free logic to configure. This
is easy on DRAMA as elements can be allocated linearly. A
configuration for a simple ripple counter is created, with the reset
function being determined by the bit pattern of the factor being
checked. The generator's main OR function is modified to include the
output of the new circuit.
The progress of the first stage can be split into three phases -
firstly the self-generation phase (which also produces the low-valued
true primes), next the true prime generation phase, and finally the
candidate generation phase, which can recur indefinitely. See
figure4.4.
It may seem overly complex to switch between
phases this way, but in fact most of the circuit remains identical -
the change in phase merely indicates that a small subsection of the
array has switched from one configuration to another. This is an
advantage of the ``multiple configuration'' feature of DRAMA,
which permits arbitrary sections of circuit to change configuration
instantly on receipt of circuit stimuli.
Multiple configuration is also used heavily during the self-generation
phase. When a new configuration is to be created the normal clocking
scheme and some of the supporting logic is switched out while the new
factor checker is constructed by a different configuration. When this
is complete the normal generator circuitry is switched back in and
continues as if uninterrupted.
Yet another configuration is used for calculating squares. The square
of the first prime which can't fit in the device as a factor
checker is used to determine the transition from true prime
generation to candidate prime generation. On the transition from the
self-generation phase to the true prime phase the discovery of the
first true prime causes the system to switch configurations, perform a
simple shift/add multiply to square the current number, and then
modify the configuration again.
The modified configuration sets an AND gate to monitor the
current value being tested and inform the system when it's time to
change to candidate generation phase. In non-reconfigurable systems
this function would be performed by a latch and a comparator, at
greater expense in logic array elements. With self-reconfiguration we
can use a simple AND gate and modify it to suit. This is an example of
run-time data folding, one of the approaches described earlier.
Realistically speaking it is not strictly necessary to start with
a bare-bones candidate generator and bootstrap an initial candidate
generator from there. It would be easy enough to generate a list of
low-valued primes and create a configuration for the initial
candidate generator beforehand, however the algorithm as described
vividly demonstrates the power of self-reconfiguration.
Candidate primes are stored differently from true primes - they are
saved to one of two external RAMs connected to the logic array. These
RAMs act as a double-buffered temporary result area. At any time the
logic array may be reading one buffer and writing to the other. While
later stages use double-buffering, the candidate generator needs only to
write results to one buffer.
The way that the values are stored is slightly novel. Rather than
storing the absolute binary value of the candidate prime, the values
are stored as differences from the previous value. The difference
between primes is always even, so the least significant bit need not
be stored. Also, simulation shows that for the first million primes on
average candidate primes were 16.5 higher than the previous
candidate. None were found greater than 154 from the previous
candidate, so 16 bits was deemed ample for storing each
difference. This is clearly more
economical than storing the (potentially very high valued) numbers in
their raw form.
Another optimisation might be Huffman coding, which would drastically
improve the efficiency of use of the buffers, and hence improve the
overall performance of the system.
When the candidate buffer fills up, the first stage must come to
an end. Control will be passed to the second stage by means of
switching configurations. At some later stage when all the available
candidate primes have been processed by the later stages, control
will return to the first stage so that more candidates can be
generated. This highlights an important feature of the way
DRAMA handles its multiple configurations - when a
configuration is switched out its state is stored with it.
This means that restoring the candidate generator to the exact place
where it left off is simply a matter of switching its configuration
back in.
The purpose of the second stage is to check the list of candidate
primes and eliminate any with factors. Since the hardware is
incapable of checking all the factors simultaneously, the algorithm
divides the task into groups of factors to be checked and iteratively
checks each group of factors against the list of candidates. On each
iteration the candidates are read one by one from the source buffer,
checked for factors, and candidates for which no factors have been
found are written to the destination buffer.
After each pass a configuration with the next set of factors to check
is established and the process is repeated. The only difference is
that the source buffer and destination buffer swap their roles with
each pass, performing a sort of ``sifting'' action on the candidates as
they are ``shaken'' back and forwards between the two buffers.
The process of checking factors in the second stage is similar in
concept to that used in the first stage, though the procedure is
slightly more complex. The checkers (called ``clearing units'' here to
distinguish them from the factor checkers of the first stage) use the
same technique of counting up the moduli of each of the factors being
tested, however unlike stage one they can't count in predictable
steps.
The stream of candidate primes coming in from the source buffer
is encoded as differences, so instead of incrementing the modulus
counters by a fixed step as in stage one, the difference is
added to the values of each of the counters.
As with stage one the counters have a period equal to the value
of the factor being cleared. In this case though the counters count
from a value p part way through the output range up to a state
with all bits set. This approach is designed to minimise the circuit
complexity and hence space that each clearing unit consumes. When a
counter overflows p is added to the counter to return it to
the appropriate range
(figure 4.5).
In this way
the period of the counters is kept correct. p is simply the
two's complement of the period.
When a counter overflows the output is quickly compared to
zero before the addition takes place. If it was non-zero on all the
clearing units, no factors were found and the candidate is written to
the destination buffer. If on the other hand factors were found, then
this candidate can be discarded. The difference values of discarded
candidates are accrued in an output difference register. When the next
factorless candidate is written it will have the extra differences
added to it to keep the differences correctly synchronised.
The time taken to check a candidate against the factors in a
configuration is two cycles - the first cycle adds the difference to
the counter and checks if the result is zero. The second cycle adds
p to the counter if an overflow occurred. These two steps each
trigger parallel computation on all the factor clearers on the array,
eliminating many factors in one hit.
The configuration of factor clearing circuits takes a double-buffered
approach. A small area of the logic array is devoted to a
reconfiguration controller which operates in parallel with factor
clearing. As all the candidate primes are sifted across from one
buffer to the other, testing against one set of factors, the
reconfiguration controller is fetching the next set of factors
and creating a configuration to perform the task of clearing these factors.
Two configuration stores are set aside for factor clearing circuits.
At any given time during the second stage one of the configurations is
being used to clear factors and the other is being configured with the
next set of factors. After each iteration the roles of the two
configurations are reversed, double-buffering the reconfiguration
process.
This reconfiguration controller is a simple state machine designed to
use the minimum amount of logic array space. Its job is greatly
simplified by the fact that the configurations already exist in an
almost complete state in the configuration store, and are designed
so that only relatively few changes need be made to modify them to
check for different factors.
Of course this same function could have been performed by
statically-configured logic but reconfiguration has the advantage
of allowing the size of the clearing units to be
determined by necessity, allowing more to be squeezed into the
available space.
A side effect of the approach taken here is that the same
configurations are created again and again as each new set of
candidates from stage one are checked against the same factors. This
initially appeared to be an expensive choice, and storing the
configurations off-chip was considered. The storage requirements of
this would have been prohibitive however, and a simplified design of
clearing unit helped reduce the complexity of the reconfiguration
controller to a point where the cost of creating the configurations
from scratch was acceptable.
Before a configuration can be used, each counter must be initialised
with the correct modulus value for the starting candidate value modulo
that factor. After the reconfiguration controller has finished its
task for each factor, the same logic array area is replaced with a
configuration for a simplified long divider. This divides the
candidate by the factor, and stores the modulus in the appropriate
counter by means of the self-reconfiguration interface.
When a configuration is first used, its state is such that one cycle
of the ``add p'' state occurs, bringing the modulus into the
correct range.
The second stage is complete when all the factors up to the square
root of the largest candidate have been tested. After configuration,
the square of the next higher factor is calculated. Verified
candidates below this limit are true primes, and can be output as
results. Otherwise they still have remaining factors to check, and are
passed to the destination buffer for the next pass.
When the external memory runs out of space it is still possible to
produce results, as long as there is some type of stream device to
output them to. The highest number which can be checked for primality
is the square of the largest number stored in the result area. The
result numbers are necessary as they are used in generating the second
stage factor clearers. This maximum value is likely a very large
number, and reaching it would probably require considerable patience
on the part of the operators and their descendants.
The algorithm was programmed in a form suitable for running under the
DRAMA simulator. It was found that the design could
fit in a multiple-configuration device with eight alternate
configurations available.
The effectiveness of the algorithm is highly dependent on the size of
the available logic array. The more numbers eliminated in the first
stage, and hence the fewer false candidates produced, the more
effectively the external buffers will be used, and hence the faster
the system will run. Also, the second stage has a considerable
overhead in the size of the reconfiguration controller. The effect of
this can only be minimised by use of a relatively large logic array.
Simulations were performed with various sizes of array. The
results given in
table 4.1
show the percentage of false candidates produced for various array
sizes. These results were collected for a short run of primes up to
one million. The proportion of false candidates is determined by the
number of factor checkers. This in turn is determined by the size of
the logic array. For comparison the number of factor clearers from
the second stage is also shown. Note that since the size of the
factors change from pass to pass the number of them which can fit in
the logic array varies. For this reason the number of factor clearers
is expressed as an average.
The proportion of false candidate primes reduces gradually as the
number of factor checkers is increased. The reason for the more marked
drop towards the end of the sample space is that the 141 prime factors
being checked for reach the value 811, most of the way to the square
root of the maximum value being tested.
Figure 4.6
shows the rate of output of prime
numbers graphed against the size of the logic array and the runtime
elapsed. The first obvious feature of the graph is that the rate of
production of primes is very irregular. In fact the graph
approximates
figure 4.4,
with flat candidate
generation phases alternating with steep factor clearing stages.
The larger logic arrays are able to produce considerably more primes
in the initial spate - the number of primes they can generate without
going to the second stage varies with the square of the number of
factor checkers.
The flat stage indicates the candidate generation phase, and takes
about the same time in each case. Of course the larger arrays
produce a smaller proportion of false candidates.
The reconfiguration overhead in the second stage is large enough that
a 300-element array is only just viable. Its performance is very bad
compared to larger arrays, indicative of ``thrashing''. As the
capacity of the arrays increases the number of primes which are
produced in the second stage increases dramatically.
It is to be expected that for longer runs the rate of
production of primes will decrease gradually over time. Not only do
primes become more rare the higher their values become, it is also
more expensive to check them for primality.
Given a pessimistic clock speed of 10MHz and an array size of a
thousand elements, the expected output is on the order of a million
primes per second for the first few seconds of operation. By
comparison a 100MHz SGI Indy took 18.47s to calculate the first
million primes using the highly-optimised ``primes'' program which
comes standard with BSD UNIX. The machine in question was an IP22
model Indy with a MIPS R4000 RISC processor, 32Mb of RAM, 8k byte
instruction and data primary caches and a 1Mb unified secondary
cache.
One potential problem with the second stage occurs if the number of
candidates in the buffer is small enough that they are processed more
quickly than the next configuration can be loaded. Interlocks exist to
prevent premature changeover of configurations, but if this situation
was to occur often the efficiency of the process would be greatly
reduced. Simulation showed that in practice this situation occurs
rarely, if ever. In fact it might be considered worthwhile to reduce
the length of the external buffers from the experimental size of 64k
words if their expense is a significant consideration. This would increase
the chance of premature completion, but this does not appear to be a
limiting factor in operation.
As an exercise in converting a familiar algorithm to an unusual
architecture this project showed very clearly that the types of
algorithms which are likely to perform well on conventional computers
may not necessarily be well adapted to other platforms. The
approaches used here were specifically geared to reconfigurable logic
array architectures, and the performance gains over an algorithm on a
conventional architecture were considerable.
The experience gained in this project indicates that new
programming approaches and architectures to facilitate them must be
developed to fully exploit the benefits offered by this technology.
The architectural features of DRAMA which proved most useful
were multiple configuration stores, partial reconfiguration and
self-reconfiguration. The disadvantage of exploiting these features
was a large increase in complexity, and a corresponding increase in
the incidence of bugs. Automatic tools could potentially alleviate
these problems in the future.
As a fast prime number generator, the device is quite successful. The
main disadvantage is the algorithmic complexity of checking every
potential factor of each candidate prime. An algorithmic improvement
could be achieved by adding a third stage to the algorithm with a
continued fractions approach to factorisation. This would of course
add further complexity to the system, but would improve performance
once the numbers being produced started to reach high values.
Programming for computing devices based on logic arrays is not the
same as programming for conventional general purpose processors.
Circuit synthesis from conventional algorithms is possible but fails
to exploit the abilities of the logic array, resulting in inferior
performance.
There are a number of general approaches to algorithm design which
can result in significantly improved performance over conventional
methods when the algorithms are targeted to logic arrays. In
particular certain approaches can exploit specific architectural
features which may be available on a reconfigurable logic array.
Functionally equivalent algorithms can be expressed in many different
ways, each of which may be better suited to a target architecture with
different attributes or to a different data set.
Von Neumann style conventional processors are merely one data
point in the spectrum of possible processors. Logic arrays are
another. The advent of reconfigurable logic arrays gives rise to the
possibility of gradations between these two extremes, ranging from
fully sequential programs to bit-parallel circuits.
The split between programming techniques for conventional
processors and programming techniques which are better suited to
logic array processors is an indication of a larger problem - some
more general way of expressing algorithms which is less dependent on
the target architecture is needed. The future may hold systems which
can automatically produce an appropriate algorithm from a combination
of a stated function and a description of the operating environment.
In the meantime we still have a great deal to learn about how
best to program logic arrays in order to extract the maximum
performance out of them. The programming approaches outlined in this
chapter when combined with architectural features to complement them
result in some significant advances in performance over previous
systems.
Figure 4.1 - Effects of communication and parallelism on speed
4.2.1 Example of control flow
Figure 4.2 - A program as a hierarchy of communicating parts
4.2.2 Impact of architectural features
4.2.3 Programming techniques
4.2.3.1 Enumerative approach
4.2.3.2 Overlay approach
4.2.3.3 Double-buffered approach
4.2.3.4 Limited word length arithmetic
4.2.3.5 Variable word lengths - temporally partitioned
4.2.3.6 Variable word lengths - on demand
4.2.3.7 Constructive approach
4.2.3.8 Staged approach
4.2.3.9 Data folding
4.2.3.10 Run-time data folding
4.2.3.11 Sequence/space tradeoffs
4.2.3.12 Algorithm folding over the time domain
4.2.4 A model for programming dynamic logic arrays
4.2.4.1 The future
4.3 Prime number generation case study
4.3.1 Primes
4.3.2 Algorithm
4.3.3 Candidate generation
4.3.3.1 Counting factors
4.3.3.2 True primes
4.3.3.3 Self-generation
4.3.4 Phases
4.3.4.1 Calculating squares
4.3.4.2 Storing candidates
4.3.4.3 Buffer full
4.3.5 Clearing factors
4.3.5.1 Factor clearing
4.3.5.2 Configuring the clearer
4.3.5.3 Prepping the moduli
4.3.5.4 Producing results
4.3.6 Simulation
4.3.7 Review of the primes study
4.4 Conclusions